Informatica 40 (2016) 399–408 399
Design of an Asynchronous Processor with Bundled-data Implementation on a
Commercial Field Programmable Gate Array
Jukiya Furushima, Masamitsu Nakajima and Hiroshi Saito
University of Aizu, Aizu-Wakamatsu 965-8580, Japan
E-mail: {m5201118, m5191117, hiroshis}@u-aizu.ac.jp
Keywords: asynchronous circuits, FPGAs, processors
Received: November 18, 2016
In this paper, we propose a modeling method and a design flow to design asynchronous processors with
bundled-data implementation on commercial Field Programmable Gate Arrays (FPGAs). The modeling
method mainly concerns modeling of an asynchronous control circuit on commercial FPGAs. In addition to
the use of a design environment provided by FPGA vendor, the design flow includes constraint generation,
timing analysis, and delay adjustment to design asynchronous processor from a prepared model to FPGA
programming. In the experiments, we design three asynchronous MIPS processors. Comparing with
the synchronous counterpart, one of them reduces global cycle time which results in 13.8% performance
improvement and another one reduces energy consumption 9.3% for a multiplication and 8.8% for a matrix
multiplication.
Povzetek: Opisan je razvoj novega sinhronega procesorja na osnovi tržnega FPGA.
1 Introduction
Field Programmable Gate Arrays (FPGAs) are reconfig-
urable circuits where circuit structure can be changed by
designers freely. Therefore, compared to Application Spe-
cific Integrated Circuits (ASICs), the lifetime of FPGAs is
long. In addition, the design cost is low because FPGA
vendors provide the design environment free of charge. Re-
cently, due to the advance of the FPGA technology, FPGAs
are well adopted in embedded systems and servers for data
centers [1]. As there are a rich number of resources on
FPGAs, we can accelerate performance by implementing
Multi-Processor System-on-Chip (MPSoC). Current ad-
vanced FPGAs such as Altera Cyclone V include an ARM
Cortex processor as a hard-macro to support MPSoC.
Most of commercial FPGAs are synchronous circuits.
Circuit components in synchronous circuits are controlled
by global clock signals. In synchronous circuits, clock
skew, power consumption, and electromagnetic radiation
will be significant problems when the semiconductor sub-
micron technology is advanced more and more. In addi-
tion, generally, the power efficiency of FPGAs is worse
than ASICs. Therefore, low power designs on FPGAs are
very important.
Compared to synchronous circuits, circuit components
in asynchronous circuits are controlled by local hand-
shake signals. Due to the absence of global clock sig-
nals, asynchronous circuits are potentially low power con-
sumption and low electromagnetic radiation. Therefore,
asynchronous circuits may be useful for FPGAs where low
power design is important. However, the design of asyn-
chronous circuits is more difficult than the design of syn-
chronous circuits. To represent circuit behaviors, circuit
model including delay model, data encoding scheme, and
handshake protocol should be considered. Based on the
considered model, asynchronous circuits are designed. In
addition, asynchronous circuit designs are also difficult for
commercial FPGAs because the design environment pro-
vided by FPGA vendors is assuming synchronous circuit
designs.
There are many approaches to design asynchronous cir-
cuits on commercial FPGAs [2, 3, 4, 5, 6]. Tranchero pro-
posed a design method to design asynchronous circuits on
commercial FPGAs in [2]. Ho et al. described to im-
plement C-element [7] into a logic block on commercial
FPGAs, showed that there is no hazards, and designed
a 4-bit adder with the C-element in [3]. We proposed a
floorplan method to place asynchronous logics to commer-
cial FPGAs. All of these literatures address neither de-
sign constraint generation (e.g., the maximum delay con-
straints for paths) nor timing verification (i.e., whether cor-
rect timing to control resources is guaranteed or not). We
also proposed a design method for asynchronous circuits
with bundled-data implementation like this paper in [5].
However, it does not target asynchronous processors. As
modeling, constraint generation, and timing verification of
processor designs are different, we need a design method
to implement asynchronous processors on commercial FP-
GAs. Minas, et. al., proposed an asynchronous processor
with the concurrent error detection scheme to detect tran-
sient errors in [6]. It was implemented on an commercial
FPGA. On the other hand, modeling, constraint generation,
and timing verification described in this paper are not ad-
dressed in [6].
400 Informatica 40 (2016) 399–408 J. Furushima et al.
Figure 1: Circuit structure of asynchronous circuits with
bundled-data implementation.
In this paper, we propose a modeling method and a de-
sign flow to design asynchronous processors with bundled-
data implementation on FPGAs. We address how to im-
plement an asynchronous control circuit on the commer-
cial FPGAs, how to synthesize the asynchronous processor
with the generation of design constraints, and how to carry
out timing verification correctly. We design three pipelined
MIPS processors using the proposed method and design
flow to evaluate area, execution time, dynamic power, and
energy consumption.
The rest of this paper is organized as follows. In section
2, we describe asynchronous circuits with bundled-data im-
plementation. In section 3, we describe about FPGAs. In
section 4, we describe the proposed modeling method and
design flow. In section 5, we describe the experimental re-
sults by designing three MIPS processors. Finally, in sec-
tion 6, we conclude this work.
2 Asynchronous circuits with
bundled-data implementation
Asynchronous circuits with bundled-data implementation
shown in Fig.1 are one of data encoding schemes in asyn-
chronous circuits. Timing of data operations is guaranteed
by delay elements on request signals. Therefore, the per-
formance depends on the delay of the control circuit. Com-
pared to other implementations such as dual-rail implemen-
tations [7] where one bit signal is represented by two wires
and the completion detector is required, bundled-data im-
plementation can be realized easily because we can use the
same data-path resources as synchronous circuits. In addi-
tion, the circuit area and the power consumption become
smaller and lower than other implementations.
2.1 Circuit model
Figure 2 represents a bundled-data implementation model
used in this paper. It is a pipelined processor model with
several pipeline stages i. The left side is the control circuit
and the right side is the data-path circuit. The data-path cir-
cuit consists of Program Counter (PC), Memories (IMEM
and DMEM), pipeline registers (pipereg), Decoder, Reg-
ister File (RF), ALU, and delay elements wdi,k and hdk.
PC stores the address of the instruction memory. IMEM
Figure 2: An asynchronous processor model with bundled-
data implementation.
is a memory to store instructions. DMEM is a memory
to store data. piperegs are registers to separate pipeline
stages. RF is a collection of registers. Data from DMEM
and ALU are written into RF. wdi,k and hdk are delay ele-
ments for registers or memories to guarantee simultaneous
writing constraints and hold constraints.
The control circuit consists of control modules ctrli_j
(j = 1, 2). A control module ctrli_j consists of a Q-module
qi_j [8], delay elements sdi_j and cdi_j , and a C-element
ci_j [7]. sdi_j is used to guarantee setup constraints. cdi_j
is used to guarantee control initialization constraints. The
C-element ci_j is a synchronization component. The output
of the C-element is 0 when all inputs are 0. The output is
1 when all inputs are 1. Otherwise, the output does not
change. Logical 1 for the output of the C-element means
that the execution at the previous control module and the
initialization of the current control module finish.
There are two notes in the control circuit. First, com-
pared to ordinal asynchronous pipelined circuits such as
Micropipelines [9] where a feedback signal for the C-
element is generated from the output of the C-element in
the next control module, the feedback signal in this control
circuit is generated from outi_j . This is because to keep the
same execution time in all pipeline stages. Second, we use
two control modules ctrli_1 and ctrli_2 to control a pipeline
stage i to hide the overhead caused by handshake signals.
Control modules ctrli_j operate as follows. When the
execution at the previous control module and the initializa-
tion of the current control module finish, ci_j asserts ini_j
to trigger the Q-module qi_j . The Q-module qi_j asserts
Design of Asynchronous Processor with. . . Informatica 40 (2016) 399–408 401
Figure 3: Data-path sdpi,l and control path scpi,l for setup
constraints: (a) forward path and (b) backward path.
reqi_j . After the signal passes to sdi_j , it is returned to qi_j
with the assertion of acki_j . Then, the Q-module qi_j de-
asserts reqi_j . After the signal passes to sdi_j again, it is
returned to qi_j with the deassertion of acki_j . The deasser-
tion of acki_j asserts outi_j to move the control to the next
control module. Memories and registers in the data-path
circuit are controlled by the output of sdi_j . The initial-
ization of control modules ctrli_j starts immediately after
outi_j is asserted. It is tuned to the deassertion of ini_j
and outi_j . The next operation starts immediately after the
execution at the previous control module finishes.
2.2 Timing constraints and cycle time
The bundled-data implementation model used in this pa-
per must satisfy five types of timing constraints, setup con-
straints, hold constraints, control initialization constraints,
and simultaneous writing constraints.
Setup constraints mean that input data of registers must
be stable before writing to registers. Figure 3 represents
paths related to setup constraints. sdpi,l (solid line) rep-
resents a data-path from sdi_1 to the destination register
pipereg where data is written through the source mem-
ory IMEM. scpi,l (dotted line) represents a control path
from sdi_1 to the destination register pipereg through the
control module ctrli_2. tminscpi,l , tmaxsdpi,l , tsetupk , and
smi,l represent the minimum delay of scpi,l, the maximum
delay of sdpi,l, the setup time for the destination register
pipereg, and the margin for tmaxsdpi,l . The setup con-
straint can be represented by the following equation:
tminscpi,l > tmaxsdpi,l + tsetupk + smi,l (1)
If this constraint is violated, we need to adjust the delay
element sdi_1 or sdi_2.
There are two types of sdpi,l. One is a forward path
where the source register is controlled by a previous con-
trol module as shown in Fig.3(a) and the other is a back-
ward path where the source register is controlled by a next
control module as shown in Fig.3(b). We define local cycle
time lcti and global cycle time gct. The local cycle time
lcti is defined for each pipeline stage i in which is equal to
the maximum delay of scpi,l, tmaxscpi,l , in pipeline stage
i. The global cycle time gct is the maximum lcti for all
lcti. The global cycle time with input data interval decides
the throughput of asynchronous pipelined processors.
Figure 4: Data-path hdpi,k and control path hcpi,k for a
hold constraint.
Figure 5: Forward path cfpi_1 and backward path cbpi_1
for a control initialization constraint.
Hold constraints mean that input data of registers must
be stable during writing to registers. Figure 4 represents
paths related to hold constraints. hcpi,k (dotted line) rep-
resents a control path from sdi_1 to the destination reg-
ister pipereg where data is written. hdpi,k (solid line)
represents a data-path from sdi_1 to the destination reg-
ister pipereg through data-path resources. tminhdpi,k ,
tmaxhcpi,k , tholdk , and hmi,k represent the minimum de-
lay of hdpi,k, the maximum delay of hcpi,k, the hold time
for the destination register pipereg, and the margin for
tmaxhcpi,k . The hold constraint can be represented by the
following equation:
tminhdpi,k > tmaxhcpi,k + tholdk + hmi,k (2)
If this constraint is violated, we need to adjust the delay
element hdk.
Control initialization constraints mean that the initializa-
tion of control modules must be completed after the control
signal by the assertion of outi_j reaches to the next control
module. Otherwise, the assertion is disabled. Figure 5 rep-
resents paths related to a control initialization constraint.
cfpi_1 (solid line) represents a control path from sdi_1 to
ci_2. cbpi_1 (dotted line) represents a control path from
sdi_1 to ci_2 through qi_1. tmaxcfpi_1 represents the max-
imum delay of cfpi_1 and tmincbpi_1 represents the min-
imum delay of cbpi_1. cmi_1 represents the margin for
tmaxcfpi_1 . The control initialization constraint can be rep-
resented by the following equation:
tmincbpi_1 > tmaxcfpi_1 + cmi_1 (3)
If this constraint is violated, we need to adjust the delay
element cdi_1.
402 Informatica 40 (2016) 399–408 J. Furushima et al.
Figure 6: The maximum delays of two control paths scpi,l
and scpi+1,l must be nearly equal to each other in simulta-
neous writing constraints.
Simultaneous writing constraints mean that all of regis-
ters must be written to the same timing. In pipelined cir-
cuits, the throughput depends on the global cycle time gct.
Therefore, to delay all of register writing timing until the
global cycle time does not affect the throughput. In ad-
dition, as a difference of register writing timing may lead
to setup/hold violations, to preserve these constraints re-
duces the occurrence of setup/hold violations. On the other
hand, to satisfy simultaneous writing constraints results in
behaviors like synchronous circuits. Different from syn-
chronous circuits where global clock signals are used, reg-
isters are controlled by different control modules in this cir-
cuit model. Therefore, we expect low power consumption
for designed asynchronous processors even though these
constraints are preserved. Figure 6 represent two control
paths scpi,l (dotted line) and scpi+1,l (solid line) for setup
constraints. These constraints can be represented by the
following equation:
tmaxscpi,l ' tmaxscpi+1,l ' gct (4)
If the above relationship is violated, we adjust delay ele-
ments sdi_2 and wdi_2,k for tmaxscpi,l and delay elements
sdi+1_2, and wdi+1_2,k for tmaxscpi+1,l .
3 Field programmable gate array
Field Programmable Gate Array (FPGA) is one of recon-
figurable devices. FPGA has been used in many embedded
systems because of the advantage such as lower design cost
and flexibility to change circuit structure. Figure 7 shows
the structure of Altera Cyclone IV FPGA.
The FPGA consists of Logic Array Blocks (LABs), Em-
bedded Multipliers, Random Access Memories (RAMs),
Input/output Elements (IOEs), and Phase Locked Loops
(PLLs). A logic array consists of 16 logic elements (LEs).
A logic element consists of a D Flip-Flop (DFF) and a 4-
to-1 Look Up Table (LUT). Any logic function with four
inputs can be implemented on an LUT based on a Static
Figure 7: Structure of Altera Cyclone IV FPGA [10].
RAM. Most of commercial FPGAs has the similar struc-
ture like this FPGA.
We use two primitives in Altera FPGAs. One is LCELL
and the other is DLATCH. LCELLs are used to implement
delay elements such as sdi_j and DLATCHes are inserted
after C-elements to carry out static timing analysis cor-
rectly with the initialization of C-elements. Both of them
are mapped to LUTs.
4 Design of asynchronous processor
on commercial FPGAs
In this section, we describe the proposed modeling method
and design flow. Even though we target Altera FPGAs in
this paper, as there is a similar design environment, we
think that we can design asynchronous processors on other
FPGAs such as Xilinx FPGAs with the modification of the
proposed modeling method and design flow.
4.1 Modeling method
As shown in Fig.2, bundled-data implementation used in
this paper consists of a control circuit and a data-path cir-
cuit. We use the same data-path resources as the ones used
in synchronous circuits. Therefore, we mainly describe
modeling of the control circuit.
The proposed modeling method extends the method de-
scribed in [11] where FPGAs are not considered. Initially,
pipeline stages are modeled by a Finite State Machine
(FSM) where nodes represent a pipeline stage and edges
represent a control flow between pipeline stages. Figure
8(a) represents an FSM for a 5 stage pipelined processor.
IF, ID, EX, MEM, and WB represent instruction fetch from
the instruction memory, instruction decode, execution, data
memory access, and write back to the register file.
Figure 8(b) represents a modeling flow. For each node in
the FSM, we split it into two nodes and map control mod-
ules (ctrli_1 and ctrli_2) . Splitting of nodes is required to
hide handshake overhead by two control modules. Then,
delay elements (sdi_j), C-elements (ci_j), feedback loops
Design of Asynchronous Processor with. . . Informatica 40 (2016) 399–408 403
Figure 8: Modeling flow of control circuit: (a) FSM and
(b) generation of a control circuit from FSM.
Figure 9: Internal structure of control modules.
from outi_j to ci_j are inserted. In the modeling, we just
insert delay elements sdi_j which consist of one LCELL.
All other delay elements are inserted during delay adjust-
ment after synthesis because other delay elements are re-
quired when timing constraints such as hold constraints are
violated.
Resisters and memories are triggered by acki_j of corre-
sponding control modules. We insert a glue logic to regis-
ters and memories to generate a local clock signal for them
from acki_j and other conditional signals generated from
the data-path circuit.
Figure 9 shows the structure of ctrli_j for Altera Cy-
clone IV FPGA. There are two DLATCHes in a ctrli_j .
One is after the C-element ci_j and the other is after the
C-element in the Q-module qi_j . They are used to initialize
the output of C-elements and to execute static timing analy-
sis correctly. Delay elements sdi_j , hdk, cdi_j , and wdi_j,k
consist of LCELLs. They work as buffers. To avoid renam-
ing of the output signals of delay elements by the synthe-
sis tool Altera Quartus Prime, we assign synthesis_keep
commands to the output signals of delay elements. Note
that we need to avoid logic optimization of control mod-
ules and all of delay elements by Quartus Prime. If they
are optimized after synthesis by Quartus Prime, we need
re-synthesis by assigning design_partition commands to
control modules and delay elements.
Finally, we prepare two models of the bundled-data im-
Figure 10: Verilog HDL models of an asynchronous pro-
cessor: (a) simulation model and (b) synthesis model.
Figure 11: Proposed design flow.
plementation by Verilog Hardware Description Language
(HDL) which is a standard modeling language for FPGAs.
The first model is for Register Transfer Level (RTL) simu-
lation before synthesis and the latter model is for synthesis.
As RTL simulation does not allow to involve primitive cells
DLATCHes and LCELLs, we represent them using logic
expressions. Figure 10 (a) and (b) represent the simulation
model and the synthesis model for qi_j in control module
ctrli_j using Verilog HDL.
4.2 Design flow
The proposed design flow uses the design environment Al-
tera Quartus Prime. To design asynchronous processors
with bundled-data implementation on commercial FPGAs,
we need to consider timing analysis, constraint generation,
and delay adjustment for asynchronous processors which
are not supported by the design environment.
Figure 11 represents the proposed design flow to imple-
ment asynchronous processors on Altera FPGAs. The in-
puts of the design flow are the simulation model and the
synthesis model of an asynchronous processor.
The proposed design flow starts from RTL simulation
to check functional correctness for the simulation model
404 Informatica 40 (2016) 399–408 J. Furushima et al.
with a test sequence. We use the ModelSim-Altera for logic
simulation.
After RTL simulation, we extract all of paths related
to setup, hold, and control initialization constraints (i.e.,
sdpi,l, scpi,l, hdpi,k, hcpi,k, cfpi_j , cbpi_j) in the syn-
thesis model. To analyze path delay such as tmaxsdpi,l
correctly by TimeQuest Timing Analyzer in the Quar-
tus Prime, we generate report_timing commands and
report_path commands. report_timing commands are
used to analyze path delays between registers to ob-
tain setup and hold times tsetup and thold for registers.
report_path commands are used for other paths. Note that
Altera recommends us to set start and end points of paths
with primary inputs, registers (flip-flops and latches), and
primary outputs. On the other hand, most of paths related
to timing constraints in the bundled-data implementation
starts or ends by other pins or nets through registers. For
example, sdpi,l starts from the output of sdi−1_2 to the des-
tination register through the source register. In such cases,
we divide paths into sub-paths and prepare report_timing
and report_path commands for divided sub-paths. For ex-
ample of sdpi,l, a report_path command is prepared to an-
alyze from the output of sdi−1_2 to the source register and a
report_timing command is prepared to analyze from the
source register to destination register.
In the design flow, we synthesize bundled-data imple-
mentation without any constraints at first (Synthesis1).
Then, we decide whether we generate placement con-
straints or not. There are two possibilities to generate
placement constraints. First is to fix the locations of placed
resources in the first synthesis (Back Annotate). Second
is to prepare a region to place logics of a given processor
model (Create a Region). If we use the placement con-
straints, we carry out the second synthesis (Synthesis2).
From the static timing analysis result for the first or sec-
ond synthesis, we analyze the global cycle time gct of the
synthesized circuit. Then, we generate the maximum path
delay constraints for all paths with the global cycle time
gct so that the global cycle time of iterative synthesis re-
sults closes to gct. Then, with the generated constraints,
we repeat synthesis and static timing analysis (STA) until
simultaneous constraints are satisfied (Synthesis3). Then,
we repeat synthesis and and STA until all other timing con-
straints are satisfied (Synthesis4). If some of timing con-
straints are violated, we carry out delay adjustment for cor-
responding delay elements. Finally, through the gate-level
simulation for the synthesized processor, we program the
synthesized processor on the target FPGA. In the rest of
this sub-section, we describe the generation of constraints
and the approach for delay adjustment.
4.2.1 Generation of design constraints
Generation of the Maximum Delay Constraints. We as-
sign the maximum delay constraints to all paths related
to setup constraints using gct obtained from the STA re-
sults by TimeQuest Timing Analyzer in Quartus Prime with
Figure 12: Generation of the maximum delay constraints.
report_timing and report_path commands.
From the STA results, first, we analyze local cycle
time lcti for each pipeline stage and global cycle time
gct. Second, we decide the margin smi,l for tmaxsdpi,l .
Third, we decide two parameters scpmargin and diff .
scpmargin represents a margin between tmaxsdpi,l and
tminscpi,l . Larger scpmargin may result in that setup con-
straints seem to be satisfied easily. However, it degrades the
performance of the synthesized processor because it may
lengthen gct after the third synthesis. diff represents the
difference between tmaxscpi,l and tminscpi,l .
The maximum delay constraints for scpi,l, tconstcp, are
calculated by the following equation.
tconstcp = gct (5)
The maximum delay constraints for sdpi,l, tconstdp, are
calculated by the following equation.
tconstdp = tconstcp − scpmargin− diff − smi,l (6)
As same as report_timing and report_path, we assign
the maximum delay constraints to sub-paths of scpi,l and
sdpi,l if these paths include several registers. From the STA
results, we decide the ratio of delay for each sub-path. For
example, suppose that tconstdp for sdpi,l in Fig.12 is 10 ns
and the ratio of delay from sdi_2 to the source register is
10% of tmaxsdpi,l obtained from STA. Then, the maximum
delay constraint from sdi_2 to the source register becomes
1 ns and the maximum delay constraint from the source
register to the destination register becomes 9 ns.
We use set_max_delay commands to represent the
maximum delay constraints. We prepare a Synop-
sys Design Constraint (SDC) file which includes all of
set_max_delay commands.
Generation of Placement Constraints. There are two
approaches to generate placement constraints. The first ap-
proach is to make a region for placement. In the first syn-
thesis report (Synthesis1), we can get the information about
the number of used logic elements. From the number of
logic elements, we decide a region of FPGA. The region is
created by using LogicLock in Quartus Prime.
The second approach is to fix the locations of placed
resources in the first synthesis. To realize the second ap-
Design of Asynchronous Processor with. . . Informatica 40 (2016) 399–408 405
Figure 13: Effect of placement constraints: (a) based on a
region and (b) based on the back annotation.
proach, we assign LogicLock for each resource (i.e., data-
path resources such as registers and control modules) in the
synthesis model (Fixing Modules). Through the first syn-
thesis with placement constraints by LogicLock, we back
annotate the locations of used logic elements, pins, multi-
pliers, and memories in the first synthesis result to a con-
straint file using Quartus Prime. The locations of resources
are represented by set_location_assignment commands
in the constraint file.
As the second approach fixes the locations of the logic
to the first synthesis result, it may reduce the number of
delay adjustment. However, it may worse performance if
more delay elements are required because the placed logic
may affect the placement of newly introduced LCELLs for
delay elements. On the other hand, the first approach may
allow to place newly introduced LCELLs so that they can
be placed freely inside the region. However, it may increase
the number of delay adjustments. Figure 13(a) and (b) rep-
resent placed logics in the target FPGA when the first and
the second approaches are used.
4.2.2 Delay adjustment
From the static timing analysis (STA) result with
report_timing and report_path commands, simultane-
ous writing constraints are checked. For a given margin
gctmargin for the global cycle time gct, sdi_j or wdi_j,k
are adjusted so that all of tmaxscpi,l are within gct ±
gctmargin. sdi_j is adjusted if all of tmaxscpi,l are out
of gct ± gctmargin while wdi_j,k is adjusted if only cor-
responding tmaxscpi_j,l is out of gct ± gctmargin. We
generate Verilog HDL models for adjusted delay elements.
Figure 14 represents an example of delay adjustment for
simultaneous writing constraints.
Next, we adjust setup, hold, and control initialization
constraints. As the adjustment of hdk affects to sdpi,l, we
adjust hold constraints at first and then we adjust setup con-
Figure 14: Delay adjustment for simultaneous writing con-
straints.
straints reflecting the added or removed delay for hdk to
sdpi,l. The adjustment of cdi_j does not affect to the paths
related to setup and hold constraints. Therefore, there is no
order for delay adjustment between setup (hold) constraints
and control initialization constraints. From the STA result
using report_timing and report_path commands, we as-
sign the delays to both left side and right side of in the
inequalities (1), (2), and (3) (see Section 2). By the sub-
traction of the right side value from the left side value, we
add LCELLs to corresponding delay elements to satisfy the
constraint if the subtraction result is a negative value (i.e., a
timing violation). On the other hand, we remove LCELLs
from corresponding delay elements if the left side value is
larger than the right side value plus the margin scpmargin.
Although that the left side value is larger than the right side
value means no timing violation, the large left side value
results in that gct becomes a large value. Therefore, we
remove LCELLs if the left side value overs the right side
value plus scpmargin. Figure 15 represents an example of
delay adjustment for setup constraints.
After we generate Verilog HDL files for adjusted delay
elements are generated, we repeat synthesis, STA, and de-
lay adjustment until all of timing constraints are satisfied.
5 Experiments
5.1 Experimental results
In the experiments, we design asynchronous MIPS pro-
cessor using the proposed modeling method and design
flow. We refer to a synchronous MIPS processor in [12]
for modeling of asynchronous MIPS processor. Figure 16
represents the block diagram of the MIPS processor. The
execution of the synchronous MIPS processor is 5 stage
pipeline (instruction fetch, instruction decode, execution,
406 Informatica 40 (2016) 399–408 J. Furushima et al.
Figure 15: Delay adjustment for a setup constraint.
Figure 16: Block diagram of the MIPS processor in [12].
data memory access, and write back). The asynchronous
MIPS processor supports 9 instructions (lw, sw, j, beq, add,
sub, or, and, slt).
We compare the designed asynchronous MIPS proces-
sors with the synchronous MIPS processor in terms of area,
execution time, dynamic power consumption, and energy
consumption. The used synthesis tool and simulation tool
are Altera Quartus Prime ver.15.1 and ModelSim-Altera
ver.10.4b. TimeQuest timing analyzer in Quartus Prime is
also used to analyze path delays. The target device is Altera
Cyclone IV (EP4CE115F29C7).
Initially, we synthesize the synchronous MIPS proces-
sor "Sync" by changing clock cycle time so that the clock
frequency is maximum. The clock cycle time of "Sync" is
16 ns. Then, we design three asynchronous MIPS proces-
sors. "Async1" is the one without placement constraints.
"Async2" is the one that the asynchronous MIPS proces-
sor is placed inside a region represented by a placement
constraint. "Async3" is the one that the locations of all
used resources are fixed to the same locations as the first
synthesis in the design flow. Table 1 represents parame-
ters for three asynchronous MIPS processors. We decide
Figure 17: Area of MIPS processors: (a) the number of
LEs, (b) the number of LUTs, and (c) the number of regis-
ters (DFFs).
Figure 18: Execution time of MIPS processors: (a) multi-
plication and (b) matrix multiplication.
gctmargin, smi, scpmargin, and diff from the STA re-
sults for the first and second synthesis. gct is the value
obtained by the STA result for synthesized circuit where
all timing constraints are satisfied.
Table 2 represents the number of delay adjustments for
three MIPS processors. "Simul" and "Others" represent
the number of delay adjustments for simultaneous writing
constraints and other timing constraints such as setup con-
straints.
Figure 17 represents the area of the MIPS processors.
Figure 17(a), (b), and (c) represent the number of used
LEs, LUTs, and registers (DFFs). These are reported by
Quartus Prime. Comparing to "Sync", the increase of LUTs
and registers in three asynchronous MIPS processors is less
than 5%. However, the number of LEs is increased 25% for
"Async1", 54.5% for "Async2", and 54.0% for "Async3".
This is because LUTs and DFFs are separated to different
LEs even though an LE has one LUT and one DFF.
Figure 18 represents the execution time obtained by
gate-level simulation after the designs using ModelSim-
Altera. We prepare two test benches, (a) a multiplica-
tion and (b) a matrix multiplication. In both cases, com-
pared to "Sync", the execution time is increased 12.5% for
"Async2" and 18.8% for "Async3" and decreased about
13.8% for "Async1". This is because the global cycle time
of "Async2" and "Async3" is increased and the global cy-
cle time of "Async1" is decreased compared to the cycle
time of "Sync" (see Table 1).
Figure 19 represents the dynamic power consumption
obtained by PowerPlay Power Analyzer in Quartus Prime
assigning a value change dump (.vcd) file generated by
gate-level simulation. Compared to "Sync", the dynamic
power consumption of "Async1" is increased 22.0% for the
multiplication and 22.7% for the matrix multiplication. On
Design of Asynchronous Processor with. . . Informatica 40 (2016) 399–408 407
Table 1: Used parameters for asynchronous MIPS processors ([ns]).
name gct gctmargin smi scpmargin diff
Async1 13.5 1.2 0.6 0.6 1
Async2 17.0 1.2 0.6 0.6 0.9
Async3 18.9 1.2 0.6 0.6 1
Table 2: The number of delay adjustments.
name Simul Others
Async1 3 0
Async2 6 0
Async3 3 0
Figure 19: Dynamic power consumption of MIPS proces-
sors: (a) multiplication and (b) matrix multiplication.
the other hand, compared to "Sync", the dynamic power
consumption of "Async2" and "Async3" is reduced 19.4%
and 15.3% for the multiplication and 18.9% and 14.2% for
the matrix multiplication. As dynamic power consumption
depends on frequency which is a reciprocal of the clock cy-
cle time, the longer global cycle time results in the lower
dynamic power consumption. On the other hand, the dy-
namic power consumption caused by the global clock sig-
nals (Clock in Fig.19) is reduced in all of asynchronous
MIPS processors.
Figure 20 represents the energy consumption which is
obtained by the product of execution time and dynamic
power consumption. Compared to "Sync", in both multipli-
cation and matrix multiplication, the energy consumption is
increased 5.1% and 0.6% for "Async1" and 0.7% and 1.8%
for "Async3". On the other hand, compared to "Sync", in
both multiplication and matrix multiplication, the energy
consumption is reduced 9.3% and 8.8% for "Async2".
5.2 Discussion
The experimental results show that the proposed model-
ing method and design flow generate two possibilities of
asynchronous processors on commercial FPGAs. First it to
generate a high performance asynchronous processor like
"Async1". As the global cycle time is smaller than the
shortest clock cycle time, it increases throughput. To gen-
erate the high performance one, we should rely on com-
Figure 20: Energy consumption of MIPS processors: (a)
multiplication and (b) matrix multiplication.
Table 3: Ratio of dynamic power consumption ([%]).
name test bench block routing
Async1 multiplication 47.4 52.6
matrix multiplication 46.9 53.1
Async2 multiplication 52.3 47.7
matrix multiplication 50.9 49.1
Async3 multiplication 46.8 53.2
matrix multiplication 46.3 53.7
mercial design environment without placement constraints.
Second is to generate a low energy asynchronous processor
like "Async2". To generate the low energy one, we should
prepare a region to place the logics of processors.
In all of three asynchronous processor designs, there is
no big difference for the number of delay adjustments. On
the other hand, interestingly, to satisfy simultaneous writ-
ing constraints may reduce the possibilities of other timing
violations.
To obtain more low power asynchronous processors on
commercial FPGAs, we should reduce the number of used
logic elements by packing LUTs and DFFs to the same
LEs. As mentioned in Figure 17, LUTs and DFFs of
three asynchronous processors are separated to different
LEs compared to "Sync". This results in the increase of
dynamic power consumption due to the use of routing re-
sources such as switches among LABs in which consumes
more power. In fact, in all of three asynchronous MIPS pro-
cessors, the dynamic power consumption caused by rout-
ing resources is about half of the total dynamic power con-
sumption as shown in Table 3. To pack LUTs and DFFs
to the same LEs results in the reduction of the number of
used LEs which in turn the reduction of dynamic power
consumption by routing resources. We consider this issue
in our future work.
408 Informatica 40 (2016) 399–408 J. Furushima et al.
6 Conclusions
In this paper, we proposed a modeling method and a de-
sign flow to implement asynchronous processors on com-
mercial FPGAs. Using the proposed modeling method and
design flow, we designed three asynchronous MIPS pro-
cessors. Comparing with a synchronous MIPS processor,
one of them reduced the global cycle time which results in
13.8% performance improvement and another one reduced
the energy consumption 9.3% for a multiplication and 8.8%
for a matrix multiplication.
In our future work, we are going to reduce the number
of used logic elements to reduce the dynamic power con-
sumption of routing resources. In addition, we are going to
design different asynchronous processors to generalize the
proposed method.
Acknowledgement
This work is partially supported by Grant-in-Aid for Sci-
entific Research from Japan Society for the promotion of
science (#15K00080).
References
[1] A. Putnam et al., "A Reconfigurable Fabric for Ac-
celerating Large-Scale Datacenter Services", Proc.
ISCA’14, pp.13–24, 2014.
[2] M. Tranchero and L. M. Reyneri, "Exploiting syn-
chronous placement for asynchronous circuits onto
commercial FPGAs", Proc. FPL, pp.622–625, 2009.
[3] Q. T. Ho et al., "Implementing Asynchronous Circuits
on LUT Based FPGAs", Proc. FPL, pp.36–46, 2002.
[4] H. Saito et al., "A Floorplan Method for Asyn-
chronous Circuits with Bundled-data Implementation
on FPGAs", Proc. ISCAS, pp.925–928, 2010.
[5] K. Takizawa et al., "A Design Support Tool Set for
Asynchronous Circuits with Bundled-data Implemen-
tation on FPGAs", Proc. FPL, pp.1–4, September
2014.
[6] Nikolaos Minas et al., "FPGA Implementation of an
Asynchronous Processor with Both Online and Of-
fline Testing Capabilities", Proc. Async, pp.128–137,
2008.
[7] Jens Sparso and Steve Furber, "Principles of Asyn-
chronous Circuit Design: A Systems Perspective",
Springer, 2001.
[8] F. U. Rosenberger et al., "Q-Modules:Internally
Clocked Delay Insensitive Modules", IEEE Transac-
tion of Computer, vol. C-37, no.9, pp. 1005-1018,
1988.
[9] I. E. Sutherland, "Micropipelines", Communications
of the ACM, vol.32, issue 6, pp.720–738, 1989.
[10] Altera Cyclone IV FPGA, "https://www.altera.com/
products/fpga/cyclone-series/cyclone-iv/
overview.html"
[11] S. Iwasaki, "Design and Evaluation of a Low Power
Asynchronous AVR Processor considering a Cycle
Time Constraint", Master Thesis, the University of
Aizu, 2014.
[12] D. A. Patterson and J. L. Hennessy, Computer Orga-
nization and Design", Morgan Kaufmann, 2013.