Instruction-Level Parallelism

Instruction-Level Parallelism

Instruction-Level Parallelism • Instruction-level Parallelism (ILP) is when a processor has more than one execution unitand thus can execute more than one instruction simultaneously. • It should be distinguished from parallelism on a higher level which might be accomplished by having more than one processor (multicore). • It should be distinguished from pipelining which has various instructions in various stages but only one in the execution stage.

Pipeline Hazards • Recall the hazards and potential hazards of pipelining. • Having multiple instructions in the pipeline means that the first instruction is not complete before the second instruction begins, which could be a problem if the instructions share data/registers. • Another term used is dependency.

Dependency Categories • RAR: Read After Read • 1st instruction reads, 2nd instruction reads • RAW: Read After Write • 1st instruction writes, 2nd instruction reads • WAR: Write After Read • 1st instruction reads, 2nd instruction writes • WAW: Write After Write • 1st instruction writes, 2nd instruction writes

Bigger Problems • WAR and WAW are not really problems in a single, in-order pipeline. • However, in an out-of-order pipeline or in multiple pipelines, • The write may get ahead of the read in WAR and if one reads after writing instead of before, one has read the wrong information • The second write may get ahead of the first write in WAW leaving the wrong value in the register for subsequent processing.

Example from Carter’s Book • LD r1, (r2) Load Reg. 1 with memory location pointed to by Reg. 2 • ADD r5, r6, r7 Add values in Reg. 6 and Reg. 7 put answer in Reg 5 • SUB r4, r1, r4 Subtract value in Reg. 4 from value in Reg. 1 put answer in Reg. 4 • MUL r8, r9, r10 Multiply values in Reg. 9 and Reg. 10, put answer in Reg. 8 • ST (r11), r4 Store value in Reg. 4 in memory location pointed to by Reg. 11

Execution Unit 1 LD r1, (r2) SUB r4, r1, r4 ST (r11), r4 Execution Unit 2 ADD r5, r6, r7 MUL r8, r9, r10 Example from Carter’s Book This program fragment can be broken into the parallel pieces shown above since they do not use the same registers.

Another Example from Carter’s Book • ADD r1, r2, r3 • LD r4, (r5) • SUB r7, r1, r9 • MUL r5, r4, r4 • SUB r1, r12, r10 • ST (r13), r14 • OR r15, r14, r12

Type of access to registers in the sequential program fragment Registers R1 and R4 have RAWs, Registers R1 and R5 have WARs, and Register R1 has a WAW.

Hazards (RAW) • Instruction 3 must follow Instruction 1 because they have a RAW dependency in Register 1. • Instruction 4 must follow Instruction 2 because they have a RAW dependency in Register 4.

Type of access to registers in the sequential program fragment Registers R1 and R4 have RAWs and Registers R1 and R5 have WARs

Potential Hazards (WAR) • Instruction 5 (writes to R1) is at best simultaneous with Instruction 3 (read from R1) because the read stage of an instruction precedes the the write stage. • Instruction 4 is at best simultaneous with Instruction 2, but we already have the stronger condition that it must follow it.

Division of Labor • After identifying the various conditions on the ordering of instructions, the instructions can be divided up among the execution units in any way that respects the conditions. • Instructions that must follow each other will be sent to the same execution unit. • This ensures their order and also allows for bypassing.

1. ADD r1, r2, r3 3. SUB r7, r1, r9 5. SUB r1, r12, r10 7. OR r15, r14, r12 2. LD r4, (r5) 4. MUL r5, r4, r4 6. ST (r13), r14 With Two Execution Units 7 Cycles  4 Cycles

With Four Execution Units LD r4, (r5) MUL r5, r4, r4 ST (r13), r14 SUB r1,r12,r10 OR r15, r14, r12 ADD r1, r2, r3 SUB r7, r1, r9 7 Cycles  2 Cycles Because of the RAW dependency, we cannot do better than 2 cycles here – no matter how many execution units there are.

Another Distinction • In the two execution unit result, one has not changed the order of the instructions – apart from executing Instructions 1 and 2 simultaneously. • In the four execution unit result, one has changed the order of the instructions – Instructions 6 and 7 occur in the first time cycle before Instructions 3, 4 and 5 which are in the second. • Therefore the benefit we gained from the latter assumes that the processor allows for out-of-order processing.

Superscalar • A processor is said to be superscalar if it has multiple execution units and if the placement of the instructions into the parallel execution units is handled by the processor’s hardware. • In other scenarios the hardware may have parallel execution units but the hardware does not determine the splitting up of the instructions among the execution units. The parallelization of instructions will occur at a higher level. It is done by the compiler.

Don’t have to recompile • A superscalar processor can give ILP (Instruction-Level Parallelism) to code that was not compiled for a processor that does not have ILP without the code being recompiled. • Provided the new processor (with ILP) in backward compatible with the old processor (without ILP).

But consider recompiling • The hardware can only consider so many instructions at once – its window of instructions. • The compiler can take a much broader view of the code and arrange instructions in a way that allows the superscalar processor to take greater advantage of ILP.

Loop Unrolling • One example of what a compiler might do to exploit ILP is loop unrolling. • Branching is the bane of pipelining and parallelism. • Loops have at least one possible branch with each iteration. • Loop unrolling is doing two of more iterations worth of work in one iteration. It reduces the number of branch considerations and promotes parallelism.

for(i=0; i<100; i++){ a[i] = b[i] + c[i]; } for(i=0; i<100; i+=2){ a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; } Loop Unrolling Example The unrolled version has half as many branches and so is easier to pipeline. The unrolled version will use more independent registers within each iteration and so takes greater advantage of ILP.

Don’t try this at home • Loop unrolling requires knowledge of the processor’s capabilities (the number of execution units, the number of stages in the pipeline, etc.). If the programmer does not have this knowledge, the unrolling and other code optimization techniques should be left to the compiler.

Superscalar Versus Vector • A vector is essentially a one-dimensional array. • A program that is optimized for the efficient handling of such arrays is said to be vectorized. • In a superscalar processor, the execution units can be doing different operations on different data, whereas with vectorization the execution units would be doing the same operation on different data.

Vectorization • Vectorization could even be beneficial if there is only one execution unit because the same operation would be performed over and over again (on different data) so it would not have to be decoded over and over again. • Vectorization is more restrictive but easier to implement than making the processor superscalar. But since it is exactly the kind of processing that arises so often, it is worth investing effort in doing it well.

SIMD • Recall that one of the features of MMX (MultiMedia eXtensions or Matrix Math eXtensions) was SIMD (Single Instruction Multiple Data) in which an individual instruction allowed one to operate on many pieces of data simultaneously (i.e. vectorization). • In Mathematics, matrices operate on vectors • These are important to the optimization of audio-visual data, since such processing involves a lot of data that can be operated on in parallel.

Try this at home • While loop unrolling is probably best left to the compiler, there are some things the high-level programmer can consider to try to ensure that his or her code can be vectorized to the fullest extent. • Recall that vectorization is concerned with the processing of arrays.

Whenever Possible • Use for loops instead of while loops • Make the number of iterations a power of 2 • Avoid ifs • Avoid subroutine calls • In nested loops, make the loop with the larger number of iterations the inner loop

Who bears the burden? • In superscalar processors, it is the hardware that provides the ILP. The compiler can help exploit the hardware’s capabilities. But the superscalar processor can yield ILP (on the fly) even for code compiled on a sequential processor. • InVery Long Instruction Word(VLIW) Processors, the burden for discovering ILP is on the compiler.

VLIW Processors • When the program is compiled, operations which can be done in parallel are sandwiched together in one long instruction, hence the name “very long instruction word” processor. • The processor has to parse this long instruction, but it does not have to make decisions about what can be done in parallel since that has been done by the compiler.

VLIW Pros and Cons • The good thing about VLIW processors is that they depend on the compiler (pre-processor). • The bad thing about VLIW processors is that they depend on the compiler (pre-processor). • ???

VLIW Pro • Placing the burden for parallelizing the code on software allows the hardware to be simpler. • The instruction-issue logic circuitry that would determine parallelization in the superscalar processor now does little more than parsing. • This allows the hardware • To be cheaper • To use less power • And possibly to be faster.

VLIW Pro • The simplification of hardware puts it along the same lines as the RISC philosophy. • The reduced hardware leads to a reduction in power consumption. • E.g. computers based on the Crusoe family of processors from Transmeta can go almost all day without having to recharge the battery.

VLIW Pro • The compiler can take a more global view when looking for parallelization. • The superscalar processor has a window, a limited number of instructions it sees and it looks for ILP within that window. • This is not a real advantage of VLIW over superscalar since code on a superscalar processor must also be compiled and that compiler can also look for ILP on a more global scale.

VLIW Con • The dependence on the compiler for ILP can lead to backward compatibility issues. • Within a family of superscalar processors, one can change the micro-architecture (hardware implementation) without changing the architecture. Compiled code is architecture specific but not micro-architecture specific.

VLIW Con (Cont.) • The new superscalar micro-architecture can take advantage (to some extent) of any new ILP capability without recompiling the code. • In a VLIW processor, more of the hardware details must be exposed to the software. And thus changes in the hardware require changes in the software – recompiling. • The old VLIW-compiled code may not work on a new VLIW processor.

Hyper-Threading Technology

HT Technology

Thread-Level Parallelism • “Hyper-Threading Technology provides thread-level-parallelism (TLP) on each processor resulting in increased utilization of processor execution resources.” • “Hyper-Threading Technology makes a single physical processor appear as two logical processors ….”

EPIC • The Itanium processors have a feature known as EPIC. • “EPIC (Explicitly Parallel Instruction Computing) is a 64-bit microprocessor instruction set, jointly defined and designed by Hewlett Packard and Intel, that provides up to 128 general and floating point unit registers and uses speculative loading, predication, and explicit parallelism to accomplish its computing tasks.”

Need a compiler to take advantage • One feature of Itanium is its use of a "smart compiler" to optimize how instructions are sent to the processor. This approach allows Itanium and future IA-64 microprocessors to process more instructions per clock cycle (IPCs). • IPCs can be used along with clock speed in terms of megahertz (MHz) to indicate a microprocessor's overall performance.

References • Computer Architecture, Nicholas Carter • http://www.whatis.com • http://www.webopedia.com • PC Hardware in a Nutshell, Thompson and Thompson • http://www.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p01_abstract.htm

Instruction-Level Parallelism