CS480 Computer Science Seminar Fall, 2002

RISC architecture and instruction Level Parallelism (ILP)based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann publishing Co. 1996 CS480 Computer Science Seminar Fall, 2002

CISC versus RISC: Historical factors that affect the architecture of the processors • M6800 was introduced, 16k RAM chips cost $500, and 40 MB hard disks cost $55,000. When MC68000 was introduced, 64k RAMs still cost around several hundred dolla4rs, and 10 MB hard disks cost $5,000. During such time periods, code size is among top conderations which led to CISC design. • As succeeding generations of processors were introduced, manufacturers continued to offer upward compatibility and at the same time, added more capability to the old design, which led to even more complex design. The complex instruction set made it difficult to support higher clock rate. • Furthermore, the machine architects wanted to close the “semantic gap” between machine and high level instruction sets, which encouraged CISC design.

The justification for RISC design • Advancement of VLSI technology drastically drives the cost down (RAMs, hard disks, etc.) • Research conducted in1971 (Knuth) and 1982 (Patterson) showed that 85% of a program’s statements were assignment, conditional branch, or procedure calls and nearly 80% of the assignment statements were MOVE instructions without arithmetic operations.

The bridge from CISC to RISC • Instruction prefetching: fetch the next instruction(s) into an instruction queue before the current instruction is completed. • Pipelining: execute the the next instruction before the the completion of the execution of the current instruction. (each instruction is carried out in several stages, e.g, PowerPC601 has 20 stages). • Superscalar operation: processor can issue more than one instruction simultaneously. The number of instructions issued may vary as program executes. • It is very difficult to inplement the above speed-up techniques with CISC processors (because the instructions are long and of variable length and there are usually some many different addressing mode. Also the operand access often depends on complex address arithmetic.

RISC design philosophy • One instruction issue per cycle. • Fixed length instruction. • Only load and store instructions access memory. • Simplified address mode: usually register indirect and indexed, where the index may he in a register or may be an immediate constant. • Fewer, simpler operations (means shorter clock cycles, since less is done in a given clock cycle.) • Delayed loads and branches: ofter these instructions take more than one cycle to complete. The processor is allowed to execute other instructions following the load or branch while it completes. • Prefetch (instructions, operands, and branch taget) and speculative execution (guess the outcome of a condition and execute the code; if gussed wrong, the result is simply discarded). • Let the compiler figure out the dependences among instructions and schedule the instruction in a way that the number of “delay slots” are minimized.

Pipeline concept:a simplified 5-stage pipeline

Latencies of some operations to be used in the following example

How scheduling of instruction can reduce total execution time by exploiting ILP • Example R1 is initially the address of the element in the array with the highest address; F2 contains the scalar value of s; for simplicity, the element of the array with the lowest address is assumed to be zero. Note: the body of each iteration is independent

The straightforward assembler code of the above loop without showing the “stall”

The straightforward assembler code with the “stall” machine/clock cycles indicated: before scheduling, it takes 9 cycle per iteration

The straightforward assembler code with the “stall” machine/clock cycles indicated: after scheduling, it takes only 6 cycles

Loop unrolling technique – replicating the loop body multiple times - to further reduce the execution time(before scheduling) Note that there are 4 copies of loop body, assuming R1 is initially a multiple of 32 (loop iterations is a multiple of 4). Also note that registers are not reused. This loop will run in 27 cycles: each LD takes 2 cycles, each ADDD 3, the branch 2, and all other instructions 1; or approximately 6.8 cycles per array element.

Loop unrolling technique – replicating the loop body multiple times - to further reduce the execution time(after scheduling) After scheduling, the loop runs in 14 cycles or 3.5 cycles per array element.

Complication of scheduling :inter-dependency among instructions • Data dependences: instruction j is data dependent on instruction I if either of the following hold • Instruction i produces a result that is used by instruction j. • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. • Name dependences • Control dependences

Data dependency example

Unrolling a loop sometimes eliminate data dependences. In the example below, the arrows indicate dependency. But was discussed before, the SUBIs are not needed.

Name dependences • A name dependence occurs when two instructions use the same register or momory location, called a name, but there is no flow of data between the instructions associated with that name. • Two types • Antidependence: instruction j writes a register or memory location that instructin i reads and instruction i is executed first. • Output dependence: instructin i and j write the same register or memory location. • Instructions involved in a name dependence can execute simultanously or be reordered (since no value is being transmitted between these instructions) if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict.

Name dependences: example shows both data (light arrows) and name dependences (dark arrows)

Name dependences removed by renaming the registers: only true data dependences (light arrows) are left

Control dependence if p1 { s1; } if p2 { s2;} s1 is control dependent on p1, and s2 is control dependent on p2 but not on p1. • There are two constraints: • an instruction that is control dependent on a branch cannot be moved before the branch. • An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch.

Control dependence example

VLIW

CS480 Computer Science Seminar Fall, 2002