1 / 59

Advanced Pipelining

Advanced Pipelining. Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8). for(i=0;i<N;i++) A[i] = A[i] + 10; & (A[0]) in $s1 & (A[i]) in $s2. slt $t1, $s3, $s0 beq $t1, $0, end loop: lw $t0, 0($s1) addi $t0, $t0, 10

dysis
Download Presentation

Advanced Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Pipelining • Optimally Scheduling Code • Optimally Programming Code • Scheduling for Superscalars (6.9) • Exceptions (5.6, 6.8)

  2. for(i=0;i<N;i++) A[i] = A[i] + 10; & (A[0]) in $s1 & (A[i]) in $s2 slt $t1, $s3, $s0 beq $t1, $0, end loop: lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop Optimally schedule code

  3. lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop 1. Identify Dependencies $t0 – lw->addi – RAW $t0 – addi->sw - RAW

  4. lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop Draw timing diagramWITH DATA FORWARDING F D X M W

  5. lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop 3. Remove WAR/WAW dependencies lw addi RAW, WAR, WAW Target the false dependencies sw addi F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W slt D F bne F

  6. lw $t0, 0($s1) sw $t0, 0($s1) addi $s1, $s1, 4 3. Remove WAR/WAW dependencies Correct Incorrect Original lw $t0, 0($s1) addi $s1, $s1, 4 sw $t0, 0($s1) lw $t0, 0($s1) addi sw

  7. lw $t0, 0($s1) addi $t0, $t0, 10 sw $t0, 0($s1) addi $s1, $s1, 4 slt $t1, $s1, $s2 bne $t1, $0, loop lw $t0, 0($s1) addi $s1, $s1, 4 addi $t0, $t0, 10 sw $t0, ____($s1) slt $t1, $s1, $s2 bne $t1, $0, loop

  8. lw $t0, 0($s1) addi $s1, $s1, 4 addi $t0, $t0, 10 slt $t1, $s1, $s2 sw $t0, -4($s1) bne $t1, $0, loop 3. Remove WAR/WAW dependencies lw addi addi slt F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W sw bne

  9. Software Control Hazard Removal If ( (x % 2) == 1) isodd = 1;

  10. Software Control Hazard Removal If ( x == true) y = false; else y = true;

  11. Software Control Hazard Removal If ((x == MON) || (x == TUE) || (x == WED)) { }

  12. Increasing Branch Performance If ((TheCoinTossIsHeads) || (StudentStudiedForExam)) { }

  13. What does it all mean? • Does that mean that error-checking code is bad? That is a whole lot of branches if you do it well!!!

  14. The moral is….. • Calculation is less expensive than …..

  15. Superscalars - Parallelism Increase Depth – assembly line – build many cars at the same time, but each car is in a different stage of assembly. Increase Width – multiple assembly lines – build many cars at the same time by building many line, all of which operate simultaneously. Ford mass produces cars. We want to “mass produce” instructions

  16. “Superpipelining” (deep pipelining – many stages) • Limiting returns because…. • Register delays are __________________________ of clock • Difficult to __________________

  17. SuperScalars • __________ parts of pipeline • Multiple instructions in _______ stage at once

  18. SuperScalars • Which instructions can execute in parallel? • Fetching multiple instructions per cycle

  19. Static Scheduling – VLIW or EPIC (Itanium) • __________ schedules the instructions • If one instruction stalls, all following instructions stall • Book Example: SuperScalar MIPS: • Two instructions / cycle • one alu/branch, one ld/st each cycle

  20. Schedule for SS MIPS Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1, -4 bne $s1, $zero,Loop PC ALU/branch ld/st 0 8 16 24 32

  21. SuperScalars - Static Fetch Decode Execute Memory WriteBack bne addu addi sw lw Write Values Read Values

  22. Loop Problem • Problem: • Too many _______________ in loop • Not enough ______________ to fill in holes • Solution: • Do ______________ at once • More instructions • Only one branch

  23. Loop Unrolling1. Unroll Loop Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2 sw $t0, 4($s1) lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2 sw $t0, 4($s1) bne $s1, $zero,Loop Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2 sw $t0, 4($s1) bne $s1, $zero,Loop

  24. Loop Unrolling2. Rename Registers Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2 sw $t0, 4($s1) lw $t1, 0($s1) addi $s1, $s1, -4 addu $t1, $t1, $s2 sw $t1, 4($s1) bne $s1, $zero,Loop But wait!!! How has this helped? There are tons of dependencies? Whatever are we to do? Register Renaming!!!

  25. Loop Unrolling2. Rename Registers Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2 sw $t0, 4($s1) lw $t1, 0($s1) addi $s1, $s1, -4 addu $t1, $t1, $s2 sw $t1, 4($s1) bne $s1, $zero,Loop Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2 sw $t0, 4($s1) lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2 sw $t0, 4($s1) bne $s1, $zero,Loop (Repeated slide for your reference)

  26. Loop Unrolling3. Reduce Instructions Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addi $s1, $s1, -4 addu $t0, $t0, $s2 sw $t0, ___($s1) lw $t1, ___($s1) addu $t1, $t1, $s2 sw $t1, 4($s1) bne $s1, $zero,Loop Loop: lw $t0, 0($s1) addi $s1, $s1, -8 addu $t0, $t0, $s2 sw $t0, 8($s1) lw $t1, 4($s1) addu $t1, $t1, $s2 sw $t1, 4($s1) bne $s1, $zero,Loop

  27. Loop Unrolling4. Schedule Loop: lw1 $t0, 0($s1) addi $s1, $s1, -8 addu1 $t0, $t0, $s2 sw1 $t0, 8($s1) lw2 $t1, 4($s1) addu2 $t1, $t1, $s2 sw2 $t1, 4($s1) bne $s1, $zero,Loop ALU/branch lw/sw lw1

  28. Performance Comparison Original Unrolled ALU/branch ld/st lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2 bne $s1, $zero,L sw $t0, 4($s1)

  29. Static Scheduling Summary • Code size ______________ (because of nops) • It can not resolve __________ dependencies • If one instruction stalls, ___________________

  30. Dynamic Scheduling • _________ schedules ready instructions • Only ___________ instructions stall • _______________ resolved in hardware

  31. 4-wide Dynamic SuperscalarFetch Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop Register Alias Table Fetch 4 instructions each cycle Register File Instruction Window addi r1,r1,-4 sw r2, 0(s1) addu r2,r2,r5 lw r2, 0(s1) sw 1add1, 0(s1) lw r2, 0(s1) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1) Ld/St Queue Ld/St lw r2, 0(s1) 1Add addi r1,r1,-4 2Add 3Add bne 2add1,r7,Loop Commit Buffer addi r1,r1,-4 addu r2,ldst1,r5

  32. 4-wide Dynamic Superscalar Decode • Register Alias Table records • Current Register Number • (WAW/WAR Register Renaming) • or Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop Register Alias Table Register File Instruction Window addi r1,r1,-4 sw r2, 0(s1) addu r2,r2,r5 lw r2, 0(s1) sw 1add1, 0(s1) lw r2, 0(s1) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1) Ld/St Queue Ld/St lw r2, 0(s1) 1Add addi r1,r1,-4 2Add 3Add bne 2add1,r7,Loop Commit Buffer addi r1,r1,-4 addu r2,ldst1,r5

  33. 4-wide Dynamic Superscalar Decode • Register Alias Table records • Current Register Number • (WAW/WARRegister Renaming) • or • 2. Functional Unit • (RAW – result not ready) Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop Register Alias Table Register File Instruction Window addi r1,r1,-4 sw r2, 0(s1) addu r2,r2,r5 lw r2, 0(s1) sw 1add1, 0(s1) lw r2, 0(s1) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1) Ld/St Queue Ld/St lw r2, 0(s1) 1Add addi r1,r1,-4 2Add 3Add bne 2add1,r7,Loop Commit Buffer addi r1,r1,-4 addu r2,ldst1,r5

  34. 4-wide Dynamic Superscalar Execute Wait until your inputs are ready Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop Register Alias Table Register File Instruction Window addi r1,r1,-4 sw r2, 0(s1) addu r2,r2,r5 lw r2, 0(s1) sw 1add1, 0(s1) lw r2, 0(s1) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1) Ld/St Queue Ld/St lw r2, 0(s1) 1Add addi r1,r1,-4 2Add 3Add bne 2add1,r7,Loop Commit Buffer addi r1,r1,-4 addu r2,ldst1,r5

  35. 4-wide Dynamic Superscalar Execute Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop Execute once they are ready Register Alias Table Register File Instruction Window addi r1,r1,-4 sw r2, 0(s1) addu r2,r2,r5 lw r2, 0(s1) sw 1add1, 0(s1) lw r2, 0(s1) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1) Ld/St Queue Ld/St lw r2, 0(s1) 1Add addi r1,r1,-4 2Add 3Add bne 2add1,r7,Loop Commit Buffer addi r1,r1,-4 addu r2,ldst1,r5

  36. 4-wide Dynamic SuperscalarMemory Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop First calculate the address Register Alias Table Register File Instruction Window addi r1,r1,-4 sw r2, 0(s1) addu r2,r2,r5 lw r2, 0(s1) sw 1add1, 0(s1) lw r2, 0(s1) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1) Ld/St Queue lw r2, 0(s1) Ld/St 1Add 2Add addi r1,r1,-4 3Add bne 2add1,r7,Loop Commit Buffer addi r1,r1,-4 addu r2,ldst1,r5

  37. 4-wide Dynamic SuperscalarMemory Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop Ld/St Queue checks memory addresses – out of order lw/sw Register Alias Table Register File Instruction Window addi r1,r1,-4 sw r2, 0(s1) addu r2,r2,r5 lw r2, 0(s1) sw 1add1, 0(s1) lw r2, 0(s1) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1) Ld/St Queue Ld/St lw r2, 0(s1) 1Add addi r1,r1,-4 2Add 3Add bne 2add1,r7,Loop Commit Buffer addi r1,r1,-4 addu r2,ldst1,r5

  38. 4-wide Dynamic SuperscalarCommit Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop Register Alias Table Register File KEY Waiting for value Reading value Instruction Window addi r1,r1,-4 sw r2, 0(s1) Instructions wait until all previous instructions have completed addu r2,r2,r5 lw r2, 0(s1) sw 1add1, 0(s1) lw r2, 0(s1) addu r2,ldst1,r5 addi r1,r1,-4 bne 2add1,r7,Loop sw r2, 0(s1) Ld/St Queue lw r2, 0(s1) Ld/St 1Add 2Add addi r1,r1,-4 3Add bne r1,r7,Loop Commit Buffer addi r1,r1,-4 addu r2,r2,r5

  39. Fallacies & Pitfalls • Pipelining is easy • ______________ is difficult • Instruction set has no impact on pipelining • Complicated _____________ & _____________________ instructions complicate pipelining immensely

  40. Technology Influences • Pipelining ideas are good ideas regardless of technology • Only recently, with extra chip space, has ___________________ become better than ____________________ • Now, pipelining limited by ________

  41. Internal External Exceptions –Unexpected Events

  42. Definitions • Anything unexpected happens • External event occurs • Internal event occurs • Change in control flow

  43. Stop Transfer control to OS Tell OS what happened Begin executing where we left off Exception-Handling

  44. 1. Detect Exception • Add control lines to detect errors

  45. Step 2: Store PC into EPC << 2 << 2 4 Addr Out Data Data Memory In Data src1 src1data src2 src2data Register File destreg destdata PC op/fun rs rt rd imm Read Addr Out Data Instruction Memory Inst 16 Sign Ext 32

  46. Step 3: Tell OS the problem • Store error code in the _________ • Use vectored interrupts • Use error code to determine _________

  47. Cause Register • Set a flag in the cause register • How does the OS find out if an overflow occurred if the bit corresponding to an overflow is bit 5?

  48. Vectored Interrupts • The address of trap handler is determined by cause

  49. Cause Register – Go to OS Handler PC << 2 Cause -4 << 2 4 EPC Addr Out Data Data Memory In Data src1 src1data src2 src2data Register File destreg destdata PC op/fun rs rt rd imm Read Addr Out Data Instruction Memory Inst 16 Sign Ext 32

  50. Vectored Interrupt – Go to OS Cause Vector Table << 2 -4 << 2 4 EPC Addr Out Data Data Memory In Data src1 src1data src2 src2data Register File destreg destdata PC op/fun rs rt rd imm Read Addr Out Data Instruction Memory Inst 16 Sign Ext 32

More Related