Lecture 14: ILP - II

Lecture 14: ILP - II Last Time: Branch prediction Multiple instruction issue Today Dynamic instruction scheduling

Hardware to exploit ILP I-Mem Reg File PC D-Mem 1-Wide D-Mem Reg File I-Mem PC 2-Wide

Control Flow Restrictions max = a[0] ;for(i=1;i<n;i++) { if(a[i] > max) max = a[i] ;} return max ; LOOP: LW %1, %2 ; // a[i] SGT %3, %1, %8 ; // a[i] > max BEQZ %3, NOMAXADDI %8, %1, #0 ; // update maxNOMAX: ADDI %2, %2, #4 ; // update a[i] ptr ADDI %4, %4, #1 ; // update i SLT %5, %4, %9 ; // i < n BNEZ %5, LOOP Can only reschedule code inside a basic block. Small basic blocks limit opportunities for scheduling IC = 8 CPI = 11/8 = 1.4

Predication to Eliminate Branches max = a[0] ;for(i=1;i<n;i++) { if(a[i] > max) max = a[i] ;}return max ; LOOP: LW %1, %2 ; // a[i] SGT %3, %1, %8 ; // a[i] > maxIF(%3) ADDI %8, %1, #0 ; // update max ADDI %2, %2, #4 ; // update a[i] ptr ADDI %4, %4, #1 ; // update i SLT %5, %4, %9 ; // i < n BNEZ %5, LOOP Predicate conditional to make this all one basic block IC = 7CPI = 9/7 = 1.3 LOOP: LW %1, %2 ; // a[i] ADDI %4, %4, #1 ; // update i ADDI %2, %2, #4 ; // update a[i] ptr SGT %3, %1, %8 ; // a[i] > max SLT %5, %4, %9 ; // i < nIF(%3) ADDI %8, %1, #0 ; // update max BNEZ %5, LOOP Reschedule to eliminate stalls and provide slack CPI = 8/7 = 1.1

Loop Unrolling for ILP max = a[0] ;for(i=1;i<n;i+=2) { if(a[i] > max) max = a[i]; if(a[i+1] > max) max = a[i+1]; }return max ; Unroll the loop to make an even bigger basic block More opportunities for parallelism Lower loop overhead6+4 vs 2(3+4) This loop has a loop-carried dependence through max, more parallelism if loop is not serial. LOOP: LW %1, %2 ; // a[i] LW %11, 4(%2) ; // a[i+1] SGT %3, %1, %8 ; // a[i] > max ADDI %2, %2, #8 ; // update a[i] ptrIF(%3) ADDI %8, %1, #0 ; // update max SGT %12, %11, %8 ; // a[i+1] > maxIF(%12) ADDI %8, %11, #0 ; // update max ADDI %4, %4, #2 ; // update i SLT %5, %4, %9 ; // i < n-1 BNEZ %5, LOOP IC per iteration = 5CPI = 11/10 = 1.1

Multiple ALUs - ILP LOOP: LW %1, %2 // a[i] LW %11, 4(%2) // a[i+1]SGT %3, %1, %8 // a[i] > maxADDI %2, %2, #8 // update a[i] ptrIF(%3) ADDI %8, %1, #0 // update maxSGT %12, %11, %8 // a[i+1] > maxIF(%12) ADDI %8, %11, #0 // update max ADDI %4, %4, #2 // update i SLT %5, %4, %9 // i < n-1 BNEZ %5, LOOP LOOP: LW %1, %2 LW %11, 4(%2) ADDI %2, %2, #8 ADDI %4, %4, #2 SGT %3, %1, %8 SLT %5, %4, %9 IF(%3) ADDI %8, %1, #0 SGT %12, %11, %8 IF(%12) ADDI %8, %11, #0 BNEZ %5, LOOP IC per iteration = 5CPI = 7/10 = 0.7

Compilation and ISA • Efficient compilation requires knowledge of the pipeline structure • latency of each operation type • But a good ISA transcends several implementations with different pipelines • should things like a delayed branch be in an ISA? • should a compiler use the properties of one implementation when compiling for an ISA? • do we need a new interface?

Op Op Op Rd Rd Rd Ra Ra Ra Rb Rb Rb Very-Long Instruction Word (VLIW) Computers IP Instruction Memory Instruction word consists of several conventional 3-operand instructions (up to 28 on the Multiflow Trace), one for each of the ALUs Register File Register file has 3N ports to feed N ALUs. All ALU-ALU communication takes place via register file.

Pros Very simple hardware no dependency detection simple issue logic just ALUs and register file Potentially exploits large amounts of ILP Cons Lockstep execution (static schedule) very sensitive to long latency operations (cache misses) Global register file hard to build Lots of NOPs poor code ‘density’ I-cache capacity and bandwidth compromised Must recompile sources Implementation visible through ISA VLIW Pros and Cons

128-bit instructions three 3-address operations a template that encodes dependencies 128 general registers predication speculative load EPIC - Intel Merced op1 op2 op3 tmp pred op rd rs1 rs2 const

Multiple Issue Instruction Memory Hazard Detect Instruction Buffer Register File

Superficially looks like VLIW but: Dependencies and structural hazards checked at run-time Can run existing binaries must recompile for performance, not correctness More complex issue logic Swizzle next N instructions into position Check dependencies and resource needs Issue M <= N instructions that can execute in parallel Multiple Issue (Details)

Example Multiple Issue Issue rules: at most 1 LD/ST, at most 1 floating op Latency: LD - 1, int-1, F*-1, F+-1 cycle LOOP: LD %F0, 0(%1) // a[i] 1 LD %F2, 0(%2) // b[i] 2 MULTD %F8, %F0, %F2 // a[i] * b[i] 3 ADDD %F12, %F8, %F16 // + c 4 SD %F12, 0(%3) // d[i] ADDI %1, %1, 4 5 ADDI %2, %2, 4 ADDI %3, %3, 4 6 ADDI %4, %4, 1 // increment I SLT %5, %4, %6 // i<n-1 7 BNEQ %5, %0, LOOP 8 Old CPI = 1 New CPI = 8/11 = 0.73

Rescheduled for Multiple Issue Issue rules: at most 1 LD/ST, at most 1 floating op Latency: LD - 1, int-1, F*-1, F+-1 cycle LOOP: LD %F0, 0(%1) // a[i] 1 ADDI %1, %1, 4 LD %F2, 0(%2) // b[i] 2 ADDI %2, %2, 4 MULTD %F8, %F0, %F2 // a[i] * b[i] 3 ADDI %4, %4, 1 // increment I ADDD %F12, %F8, %F16 // + c 4 SLT %5, %4, %6 // i<n-1 SD %F12, 0(%3) // d[i] 5 ADDI %3, %3, 4 6 BNEQ %5, %0, LOOP Old CPI = 0.73 New CPI = 6/11 = 0.55

More complex issue logic check dependencies check structural hazards issue variable number of instructions (0-N) shift unissued instructions over Able to run existing binaries recompile for performance, not correctness Datapaths nearly identical Neither VLIW or multiple-issue can schedule around run-time variation in instruction latency cache misses Dealing with run-time variation requires run-time or dynamic scheduling Multiple Issue vs VLIW

In-order execution an unexpected long latency blocks ready instructions from executing binaries need to be rescheduled for each new implementation small number of named registers becomes a bottleneck The Problem with Static Scheduling LW R1, C //miss 50 cyclesLW R2, D MUL R3, R1, R2SW R3, CLW R4, B //readyADD R5, R4, R9SW R5, ALW R6, FLW R7, GADD R8, R6, R7SW R8, E

Dynamic Scheduling • Determine execution order of instructions at run time • Schedule with knowledge of run-time variable latency • cache misses • Compatibility advantages • avoid need to recompile old binaries • avoid bottleneck of small named register sets • but still need to deal with spills • Significant hardware complexity

Example • 10 cycle data memory (cache) miss • 3 cycle MUL latency • 2 cycle add latency

Dynamic SchedulingBasic Concept Window of Waiting Instructions on operands & resources Sequential Instruction Stream Execution Resources Instructions waiting to commit LW %1,ALW %2,BADD %3,%1,%2 SW %3,CLW %4,8(A)LW %5,8(B)ADD %6,%4,%5 SW %6,8(C)LW %7,16(A)LW %8,16(B) ADD %9,%7,%8 SW %9,16(C) LW %10,24(A) LW %11,24(B) Register File LW %4,8(A)LW %5,8(B) ADD %3,%1,%2 SW %3,CADD %6,%4,%5 SW %6,8(C)LW %7,16(A)LW %8,16(B) ADD %9,%7,%8 SW %9,16(C) PC Issue Logic

Instruction window fixed number of instruction slots (e.g., 32) generic or partitioned over execution units fetch next sequential instruction whenever a slot is free mark input and output registers busy slots monitor register status and execution unit reservation tables Issue when all input operands available output operand (register) not busy (WAW, WAR) due to earlier instruction execution unit is available Commit when all previous instructions have committed why? Implementation Issues

%7 %0 %1 %2 %3 %4 %5 %6 0 0 0 1 1 0 0 0 Register Scoreboard Register File • Tracks register writes • busy = pending write • Detect hazards for scheduler ADD %3,%1,%2 • Wait until %1 is valid • Mark %3 valid when complete SUB %4,%0,%3 • Wait for %3 What about: valid bit (= 0 if write “pending”) LD %3,(0)%7ADD %4,%3,%5LD %3,(4)%7

%4 %7 %6 %1 %5 %0 %3 %2 1 0 0 0 0 0 0 1 Implementing A Simple Instruction Window ADD %3,%1,%2 SW %3,0(C)ADD %6,%4,%5 SW %6,8(C) LW %7,16(A) result reg src1 src2 issue order dst reg rdy reg rdy 3 ADD %3 %1 0 %2 1 5 SW %3 0 C 1 2 ADD %6 %4 0 %5 0 4 SW %6 0 C 1 1 LW %7 A 1 1 Result sequence: %4, %7, %5, %1, %6, %3 Often called reservation stations reg = name, value

Add an instruction to the window only when dest register is not busy mark destination register busy check status of source registers and set ready bits When each result is generated compare dest register field to all waiting instruction source register fields update ready bits mark dest register not busy Issue an instruction when execution resource is available all source operands are ready Result issues instructions out of order as soon as source registers are available allows only one operation in the window per destination register Implementing a Simple Instruction Window (2)

Register Renaming (1) What about this sequence? 1 LW %1, 0(%4)2 ADD %2, %1, %33 LW %1, 4(%4)4 ADD %5, %1, %3 Can’t add 3 to the window since %1 is already busy Need 2 %1s!

%1 0 P5 %2 0 P2 %3 1 P1 %4 1 P7 %5 1 P6 Register Renaming (2) value P1 0 A Rename Table P2 1 5 P3 1 C P4 1 0 P5 0 E P6 1 F P7 1 3 Virtual Registers P8 0 2 Add a tag field to each register - translates from virtual to physical register name Physical Registers In window Next instruction LW %1, 0(%4)ADD %2, %1, %3 LW %1, 4(%4)

Before After %1 %1 0 0 P5 P4 %2 %2 0 0 P2 P2 %3 %3 1 1 P1 P1 %4 %4 1 1 P7 P7 %5 %5 1 0 P6 P6 Register Renaming (3) S1 LW P5 data 1 1 S2 ADD P2 P5 0 data 1 S3 LW P4 data 1 1 When result generatedcompare tag of result to not-ready source fieldsgrab data if match S4 ADD P6 P4 0 data 1 Add instruction to window even if dest register is busy When adding instruction to window read data of non-busy source registers and retain read tags of busy source registers and retain write tag of destination register with slot number LW %1,0(%4)ADD %2,%1,%3LW %1,4(%4)ADD %5,%1,%3

Some Issues • How do we rename several (2-4) instructions per cycle? • How do we make sure that the correct value winds up in the register? • How do we make sure events (exceptions) are handled in the right order? • When can we move a load past a store?

Summary • Dynamic Scheduling • Out of order issue • Register renaming • gets rid of WAW and WAR hazards • Instruction window • wait for operands and execution unit • Next time • Reorder buffers • Start caching!!!!

Lecture 14: ILP - II

Lecture 14: ILP - II

Presentation Transcript

Lecture 12: Advanced Static ILP

Lecture 6: Static ILP

ILP 4/7/14

Lecture # 14 Distributed Algorithms ( II)

Lecture: Static ILP

Lecture 5: Static ILP Basics

Lecture 13: ILP

Lecture 3: Dynamic ILP

Lecture 14 Models II

Lecture 10: ILP Innovations

Lecture 6: Static ILP

Lecture 9: More ILP

Lecture 9: ILP Innovations

Lecture 10: ILP Innovations

Lecture 11: Advanced Static ILP

Lecture: Static ILP

Lecture: Static ILP

Lecture 6: Static ILP

Lecture: Static ILP

Lecture 14 State Machines II