An implementation or the Advanced RISC Machine microprocessor
architecture using the micropipeline design style. In April 1994 the Amulet group in the Computer Science department of Manchester University took delivery of the AMULET1 microprocessor
. This was their first large scale asynchronous circuit and the world's first implementation of a commercial microprocessor architecture (ARM) in asynchronous logic
Work was begun at the end of 1990 and the design despatched for fabrication in February 1993. The primary intent was to demonstrate that an asynchronous microprocessor can consume less power than a synchronous design.
The design incorporates a number of concurrent units which cooperate to give instruction level compatibility with the existing synchronous part. These include an Address unit, which autonomously generates instruction fetch requests and interleaves (nondeterministic
ally) data requests from the Execution unit; a Register
file which supplies operands, queues write destinations and handles data dependencies; an Execution unit which includes a multiplier, a shifter and an ALU
with data-dependent delay; a Data interface which performs byte extraction and alignment and includes an instruction prefetch
buffer, and a control path which performs instruction decode. These units only synchronise to exchange data.
The design demonstrates that all the usual problems of processor design can be solved in this asynchronous framework: backward instruction set
and exact exceptions
for memory faults are all covered. It also demonstrates some unusual behaviour, for instance nondeterministic
prefetch depth beyond a branch instruction (though the instructions which actually get executed are, of course, deterministic). There are some unusual problems for compiler optimisation
, as the metric which must be used to compare alternative code sequences is continuous rather than discrete, and the nondeterminism
in external behaviour must also be taken into account.
The chip was designed using a mixture of custom datapath and compiled control logic elements, as was the synchronous ARM. The fabrication technology is the same as that used for one version of the synchronous part, reducing the number of variables when comparing the two parts.
Two silicon implementations have been received and preliminary measurements have been taken from these. The first is a 0.7um process and has achieved about 28 kDhrystones running the standard benchmark
program. The other is a 1 um implementation and achieves about 20 kDhrystones. For the faster of the parts this is equivalent to a synchronous ARM6 clocked at around 20MHz; in the case of AMULET1 it is likely that this speed is limited by the memory system cycle time (just over 50ns) rather than the processor chip itself.
A fair comparison of devices at the same geometries gives the AMULET1 performance as about 70% of that of an ARM6 running at 20MHz. Its power consumption is very similar to that of the ARM6; the AMULET1 therefore delivers about 80 MIPS/W (compared with around 120 from a 20MHz ARM6). Multiplication is several times faster on the AMULET1 owing to the inclusion of a specialised asynchronous multiplier. This performance is reasonable considering that the AMULET1 is a first generation part, whereas the synchronous ARM has undergone several design iterations. AMULET2 (currently under development) is expected to be three times faster than AMULET1 - 120 kdhrystones - and use less power.
The macrocell size (without pad ring) is 5.5 mm by 4.5 mm on a 1 micron CMOS
process, which is about twice the area of the synchronous part. Some of the increase can be attributed to the more sophisticated organisation of the new part: it has a deeper pipeline
than the clocked version and it supports multiple outstanding memory requests; there is also specialised circuitry to increase the multiplication speed. Although there is undoubtedly some overhead attributable to the asynchronous control logic, this is estimated to be closer to 20% than to the 100% suggested by the direct comparison.
AMULET1 is code compatible with ARM6 and is so is capable of running existing binaries
without modification. The implementation also includes features such as interrupts and memory aborts.
The work was part of a broad ESPRIT
funded investigation into low-power technologies within the European Open Microprocessor systems Initiative (OMI) programme, where there is interest in low-power techniques both for portable equipment and (in the longer term) to alleviate the problems of the increasingly high dissipation of high-performance chips. This initial investigation into the role asynchronous logic
might play has now demonstrated that asynchronous techniques can be applied to problems of the scale of a complete microprocessor