Desktop personal computer users are experiencing a major change in the processor architectures used in familiar PC's and workstations. The migration towards Reduced Instruction Set Computers (acronym "RISC") has been recently confirmed by the agreement between Apple and IBM to devise what they hope will be a completely new personal computer architecture based on the RS6000 RISC.
The reasons behind the abandonment of traditional Complex Instruction Set Computers (CISC) has been the quest for ever greater throughput. The demands of workstations involved in CAD tasks have been the real driving force behind this. To date, CPU's have been designed with assembler instruction sets that have been geared towards making the assembler programmer's life easier through the extensive use of microcode. By providing single assembler instructions that perform, for instance, three operand multiplication, the assembler programmer (and HLL compiler writer) has been relieved of the job of acheiving the same result with simpler instructions. All these microcoded instructions have thus increased the number of clock cycles for completion to up to 31, in the case of the 80C196 microcontroller's three operand divide.
The need for the CPU to be able to recognise and act on (decode) many hundreds of different instructions, requires complex silicon and many clock cycles. With physical limitations acting to restrict achievable clock speeds on silicon devices, the number of cycles per instruction is obviously very significant in gaining higher performance..
RISCs tend to shift the burden of programming from the microcoder to the assembler programmers and compiler writers. Work both within academia and commercial manufacturers has proved that a suitably programmed RISC machine can achieve a far higher throughput than a CISC for a given clock speed.
Strangely, the embedded world has been slow to question the suitability of the CISC-based microcontroller. Whilst at the very top end, devices such as the i80960 have enjoyed some success, for more commonplace embedded tasks, RISC is almost still quite rare. With the increasing complexity of modern control algorithms, the need for greater processing power is set to become an issue in anything but the simplest applications. In addition, here more than in the workstation world, the worst-case response time to non-deterministic events is crucial, an area where CISCs are especially poor.
Most current high-end microcontrollers are based on existing CISC architectures such as the 8086, 68000 etc., who in common with 8 bit devices such as the 8051, have an internal structure that dates back up to 13 years. With the silicon vendor's need to give existing users an upgrade path, apparently new designs are often based closely on the existing architecture/instruction set, so protecting the user's investment in expensive assembler-code.
Like workstations, microcontrollers are tending to be programmed in a high level language (HLL) to reduce coding times and enhance maintainability. Inevitably, even with the best compilers, some loss of performance is encountered, emphasising again the need for improved CPU performance.
In addition to straight forward data processing, microcontrollers must also handle real-world peripherals such as A/D converters, PWM's, timers, Ports, PLL's etc., all of which require real time processing.
Complicated "labour-saving" instructions must hold CPU's entire attention during execution, thus preventing real-world generated interrupts form being serviced. Unpredictable latency times result.
Loaded instruction must be recognised from potentially many hundreds or even thousands of possibilities. Decoding is thus complicated and lengthy.
Data is typically fetched from off-chip memory, placed in accumlator-type register. Mathematical or logical operations are performed and then result written back to memory. Value is likely to required again in course of procedure, thus requiring further movements to and from off-chip memory.
When calling subroutines with parameters (essential in good HLL programming), parameters must be individually pushed on to stack. Must then be moved through accumulator register(s) for processing before being returned via stack to caller.
Each peripheral device or interrupt source must have dedicated service routine which at the least will require the PSW, PC to be stacked and restored and data removed from or fed to peripheral device.
Embedded systems frequently contain many separate real time tasks which together form a complete system. Conventional CPU's make switching between task slow. Often, many registers have to be stacked to free them up for the incoming task. This problem is aggravated by the use of HLL compilers which tend to use a large number of local variables in library functions which must be preserved.
With the move to HLLs, compilers are tending to dictate what instructions should be provided in silicon.
In practice, compilers tend to only make use of a small number of addressing modes. This results in a large number of unused addressing modes which serve only to complicate the opcode decoding process.
Instruction sets that have evolved tend to be difficult to use due to large number of different basic types and the inconsistent addressing modes allowed. This particularly teu of Intel processors.
Whilst complex, multi-cycle instructions are being executed, the bus is idle.
To show how RISC design is used to improve microcontroller throughput, the Siemens SAB80C166 is used as an example.
Basic Definitions:
1 state time = 2 x 1/oscillator frequency
- fundamental unit of time recognised within processor system.
1 machine cycle = 2 * state time
- minimum time required to perform the simplest meaningful task within cpu.
The unit of state times is used when making comparisons between RISCs and CISCs as this removes any dependency on clock frequency.
- All state time counts are given in single chip operation mode for both 80C196 and SAB80C166.
To maximise the rate at which instructions are executed, RISC CPU's are very heavily pipelined. Here, on any given machine cycle, up to 4 instructions may be processed by overlapping the various steps thus:
Thus although the instruction takes four machine cycles, it is apparently executed in just one (2 state times). Pipelining has considerable benefits for speeding sequential code execution as the bus is guarantied to be fully occupied.
However, phases one and two can simultaneously request access to the bus, if for example, the final phase of the current nstruction is a READ. Here the External Bus Controller applies a WRITE, FETCH, READ priority to prevent bus conflicts.
Multiply and divide instructions require 5 and 10 cycles respectively and constitute the only "complex" opcodes within the RISC. Thus even in the SAB80C166, some instructions do not complete in the mandatory four (one bus) cycles. As it is not practicable to stop the pipeline during longer instructions, dummy instructions are injected into the decode stage, passing through the remaining stages as simple NOP instructions.
Whilst in-line code poses no problems for a pipelined CPU, branches require special steps. The problem is that by the time the branch instruction has reached the EXECUTE stage, the next in-line opcode has already been FETCHED. Thus the instruction immediately after the branch will be executed, followed by a jump to the target address for the branch. This peculiarity is termed a "delayed branch" and is used as an alternative to flushing out the pipeline completely.
The situation with a conditional branch is more complicated as the next instruction may be totally inappropriate given the result of the conditional test. The only solution is to either add a NOP or flush the pipeline.
The solution taken in the SAB80C166 is to, in the first case, inject a dummy instruction into the DECODE stage whilst the real target address is being FETCHED. Thus a single extra machine cycle is required to execute the branch. For the conditional branch, the dummy is only injected if the branch is made, thus significantly, for no-branch situations, no time is lost.
A common situation in embedded control is searching through a table. This involves repetitive branching to a single fixed address. Without taking special steps, a wasted machine cycle would occur during each loop. Bearing some relationship to disk caching techniques on PC's, a "jump cache" is provided. Here, on the first time through the loop, the dummy instruction is injected as before and a single machine cycle is wasted. However, the branch target address is simultaneously stored in a cache area. Now, on subsequent passes through the loop, the target address is extracted from the cache and injected directly into the DECODE stage. Thus the branch now occurs in a single machine cycle.
With the parallel nature of the CPU, care has to taken to avoid pipeline "mirages". Most potential problems originate from the WRITE-BACK stage using addresses that have been changed by subsequently FETCHED instructions. Although special hardware is provided for artificially bringing forward operand READs and WRITEs, some pipeline effects must still be borne in mind.
As an example, the general purpose register R0 is to be loaded with a the value at the top of the stack, after the stack pointer "SP", has been moved to a new address of 0FA40H:
SP = 0FA80H
0FA80H = 0FFH - Value at old top of stack
SP = 0FA40H
0FA40H = 011H - Value at new top of stack
MOV SP,#0FA40H ; Set stack pointer to new location
POP R0 ; Get value at top of stack into R0
Machine Cycle Number ->>
-----------------------------------------------------------------------------
0 1 2 3 4
FETCH SP=0FA80H R0=XX
Get Get POP R0
MOV SP,#0FA40
DECODE SP=0FA80 R0=XX
(and get Get address Get address
operands) of SP of R0 & value
in SP (still
at 0FA80H)
EXECUTE SP=0FA80 R0=XX
WRITE-BACK R0=0FFH R0 POPped
Put #0FA40 from address
into SP #0FA80H
-----------------------------------------------------------------------------
As the instructions overlap, the value POPed into R0 will be incorrect. By putting an instruction between the MOV and POP, the value of SP will be already at the new value by the time the POP gets the value of SP. Note, as WRITE overrules READ, the updating of SP will occur before the READing of the SP value in the decode stage of POP R0. The overlapping of instructions produces a similar effect when disabling interrupts:
1 BCLR IEN2 <start of region which may not be interrupted>3 .4 .5 .
As the actual updating of the IEN register does not occur until machine cycle 3, either NOP's must be inserted in cycles 2 & 3 before the critical region or the interrupt disable command must be moved back two instructions.
In the SAB80C166, branches to interrupts make use of the injected instruction technique and thus vectoring to service routine is achieved in only 4 machine cycles (400ns). The effect of complex but necessary instructions such as MUL and DIV (5 and 10 cycles respectively) might be expected to stretch this but it is interesting to note that the SAB80C166 provides these as interruptable instructions.
Very fast interrupt service is crucial in high-end applications such as engine management systems, servo drives and radar systems where real-world timings are is used in DSP-style calculations. As these normally form part of a larger closed control loop, erratic latency times manifest themselves as an undesirable jitter in the controlled variable.
Traditional microcontrollers have one or more special registers which can be used for mathematical, logical or Boolean operations. In the 8051, there is a single "accumulator" with 8 other Registers which may be used for handling local variables or intermediate results in complex calculations. These additional registers are also used to access memory locations via indirect and/or indexed addressing.
As pointed out in section 3 and 4 above, conventional CPU's spend much time moving data from slow memory areas into active registers. The RISC offers a very large number of general purpose registers which may be used for locals, parameters and intermediates. The SAB80C166 provides 16 word wide general purpose registers (GPRs), each of which is effectively an accumulator, indirect pointer and index. With such a large number of GPR's available, it becomes realistic to keep all locals and intermediates within the CPU throughout quite large procedures. This can yield a great increase in speed.
Further significant benefits are derived from the RISC technique of register windowing. As has been said, up to 16 registers are available for use by the program. However, by making the active register bank movable within a larger on-chip RAM, the job of real time multi-tasking is considerably eased.
Central to this is the concept of a "Context Pointer" (CP), which defines the current absolute base address of the active bank. Thus a reference to "R0" means the register at the address indicated by the CP. Thereafter, the 16 registers originating from CP are accessed by a fast 4-bit offset.
The best example of how the CP is exploited is perhaps a background task and a real-time interrupt co-existing. When the interrupt occurs, rather than pushing all GPR's onto the stack, the CP of the current register bank is stacked and simply switched to a new value, determined at link time, to yield a fresh register bank. This results in a complete context switch in just one machine cycle but does rule out the use of recursion.
A hybrid method, which permits re-entrancy, uses the stack pointer to calculate the new CP dynamically. Here, on entering the interrupt, the number of registers now required is subtracted from the current SP and the result placed in CP, with the old CP stacked. Thus the new register bank is located at the top of the old stack, with the old CP and then the new stack following on immediately afterwards. On exiting the interrupt routine, the original registerbank is restored by POPping the old CP from the stack. The SP is reinstated by adding the size of the new register bank onto the current SP.
A further RISC refinement is register window overlapping whereby when a new procedure is called, part of the new register bank defined by CP' is coincident with the original at CP:
R3' ; Register for subroutine's locals and intermediatesR2' ; Register for subroutine's locals and intermediatesR7 R1' ; Common register, R7 == R1'CP' R6 R0' ; Common register, R6 == R0'R5 ; Register for caller's locals and intermediatesR4 ; Register for caller's locals and intermediatesR3 ; Register for caller's locals and intermediatesR2 ; Register for caller's locals and intermediatesR1 ; Register for caller's locals and intermediatesCP R0 ; Register for caller's locals and intermediates
MODULE 1
; *** Assignment Of GPRs To Local Variables - Caller ***x_var LIT 'R0' ; Local variabley_var LIT 'R1' ; Local variableparm1 LIT 'R6' ; Passed parameter 1parm2 LIT 'R7' ; Passed parameter 2result LIT 'R6' ; Value returned from sub routine
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MODULE 2
; *** Assignment Of GPRs To Local Variables - Sub Routine ***a_var LIT 'R2' ; Local variableb_var LIT 'R3' ; Local variableinput1 LIT 'R0' ; Received parameter 1input2 LIT 'R1' ; Received parameter 2ret1 LIT 'R0' ; Final result returned in R0
Fig. 1 - Giving GPR's Meaningful Names
By using some forethought, the programmer should arrange for any value to be passed to the sub routine to be located in the common area so that all the normal loading and unloading of parameters is avoided. This technique can be used in either absolute or SP-relative registerbank modes.
To get the best from a RISC's registers, the location of data needs close consideration: although highly orthogonal, the limited number of addressing modes provided for MUL and DIV for example, can appear somewhat restrictive. Fortunately though, most operands involved will already be in registers, so eliminating the need for many addressing techniques. As might be expected, the instructions with the widest range of addressing modes are the simple data moves - that RISC's are the result of very careful analysis of the requirements for fast execution becomes obvious after a short acquaintance!
With largely single machine cycle execution, some conventional "fast" instructions such as CLEAR, INC and DEC become redundant. Therefore, to keep the total number of instructions to a minimum, RISC's simply omit them. Examples are given below:
Instruction 80C196 States 80C166 States--------------------------Clear Word CLR 4 AND Rn,#0 2Decrement Word DEC 4 SUB Rn,#01 2Increment Word INC 4 ADD Rn,#01 2
- all direct addressing mode
RISC equivalents of common CISC "fast" instructions
Three operand instructions are also commonplace in CISCs but not present in RISCs. Although additional instructions are required, the overall number of states is still less than the three operand CISC equivalent, plus the shorter RISC instructions allow greater opportunity for interrupt servicing.
The following example illustrates this:
Perform: z = x + y
z,x and y are directly addressed memory locations
x DW 1y DW 1z DW 1
ADD z,x,y ; 5 states - no interrupt possible
z,x and y are memory locations, Rw is a GPR
x DW 1y DW 1z DW 1
MOV Rw,x ; 2 states
; * Interruptable hereADD Rw,y ; 2 states; * Interruptable hereMOV z,Rw ; 2 states; ____; 6 states
One extra state required when using CISC approach. However, if the variables are assigned recognising that this is a RISC:
x and y are memory locations, z is a GPR
x DW 1y DW 1
z LIT 'R0' ; z is assigned to GPR R0 via a LITeral definition
MOV z,x ; 2 states
; * Interruptable hereADD z,y ; 2 states; ____; 4 states
Within the workstation RISC, superscalar operation allows parallel execution of instructions, made possible by having discrete addition, multiplication, shift and other dedicated units, each with their own pipelines.
No RISC microcontroller (yet) offers quite this but something similar is possible to service on-chip peripherals such as an A/D converter.
A common situation occurs in conventional microcontrollers whereby some regular event requires attention from the CPU to load or unload data. Typically, an A/D converter will cyclically read a number of channels, causing an interrupt when completed or simply waiting for the CPU to poll its status. The net result is the valuable CPU time is spent doing what even for a microcontroller is a simple, repetitive task.
The RISC allows the interrupt service routine to be serviced and completed in a single machine cycle. In the case of a periodic A/D conversion, on each read the result is stored in a table where they may be retrieved by the CPU when convenient. This mechanism requires the CPU to perform only a single MOV [table_addr+],ADDAT after each conversion. At the end of the table, an additional cycle is required to reset the table pointer.
Any real-world generated data can be handled in this way, leaving the CPU free for data processing rather than simple data collection.
For example, the SAB80C166 can acheive 10 million instructions per second at 20MHz clock (100ns machine cycle time). This is a result of pipelining and the ability to contain the active data for entire procedures within the cpu registers.
Although instruction set is less diverse, the consistency of addressing modes makes assembler coding easier.
By eliminating instructions that take many cycles, interrupt response is improved. Smaller instructions effectively yield higher "sampling rate" for real world events.
By careful use of multiple register banks controlled by a base pointer, context switching in a multitasking system can be performed in just two cycles (200ns).
In addition, parameter passing overhead to subroutines eliminated by use of overlapping register windows, so that parameters lie in the common area.
Comparison Between CISC And RISC Microcontroller Execution Times
CISC RISC--------------------------------------------------------------------------Number ofBasic Instructions 85 55
Instruction 80C196 Cycles 80C166 Cycles Difference
--------------------------------------------------------------------------Move word direct LD x,y 4 MOV Rw,Rw 2 2Move word indirect LD x,[y] 5 MOV Rw,[Rw] 2 3Move word indexed LD x,z[y] 7 MOV Rw,[Rw+#d16] 4 3
Add words direct ADD x,y 4 ADD Rw,Rw 2 2
Add words indirect ADD x,[y] 5 ADD Rw,[Rw] 2 3Add words indexed ADD x,x[y] 7 ADD Rw,[RW+#d16] 4 3
Multiply words direct MUL x,y 16 MUL Rw,Rw 10 6
Multiply words indirect MUL x,[y] 18 N/AMultiply words indexed MUL x,x[y] 20 N/A
Divide words direct DIV x,y 26 DIV Rw 20 6
Divide words indirect DIV x,[y] 28 N/ADivide words indexed DIV x,z[y] 30 N/A
16 bit uncond.jump LJMP #16 7 JMPA cc_UC,#d16 4 3
Shift Left 16 places SHL x,#16 22 SHL Rw,Rw 4(*) 18Software interrupt TRAP 16 TRAP #n 4 12Return from subroutine RET 11 RET 2 9
Direct data on stack PUSH x 6 PUSH Rw 2 4
Indirect data on stack PUSH [y] 9 N/AIndexed data on stack PUSH z[y] 10 N/A--------------------------------------------------------------------------
*with SAB80C166, both operands in shift must be held in registers and hence an additional two states area included for loading number of shifts into a GPR, Rw.
By considering the simpler instructions which form the bulk of any program, it can be seen that the CISC requires approximately twice the number of state times of the RISC. For instructions that change program flow, the CISC overhead is even greater at a factor of 4. Taken over a complete software system, the RISC advantage should be a reduction in run times by about 55%.