1 / 29

3.13. Fallacies and Pitfalls

3.13. Fallacies and Pitfalls. Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster Balance must be found: E.g. sophisticated pipeline: CPI ↓ clock cycle ↑. Fallacies and Pitfalls.

Download Presentation

3.13. Fallacies and Pitfalls

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 3.13. Fallacies and Pitfalls • Fallacy: Processors with lower CPIs will always be faster • Fallacy: Processors with faster clock rates will always be faster • Balance must be found: • E.g. sophisticated pipeline: CPI ↓ clock cycle ↑

  2. Fallacies and Pitfalls • Pitfall: Emphasizing improving CPI by increasing issue rate, while sacrificing clock rate can decrease performance • Again, question of balance • SuperSPARC –vs– HP PA 7100 • Complex interactions between cycle time and organisation

  3. Fallacies and Pitfalls • Pitfall: Improving only one aspect of a multiple-issue processor and expecting overall performance improvement • Amdahl’s Law! • Boosting performance of one area may uncover problems in another

  4. Fallacies and Pitfalls • Pitfall: Sometimes bigger and dumber is better! • Alpha 21264: sophisticated multilevel tournament branch predictor • Alpha 21164: simple two-bit predictor • 21164 performs better for transaction processing application! • Can handle twice as many local branch predictions

  5. Concluding Remarks • Lots of open questions! • Clock speed –vs– CPI • Power issues • Exploiting parallelism • ILP –vs– explicit

  6. Characteristics of Modern (2001) Processors • Figure 3.61 • 3–4 way superscalar • 4–22 stage pipelines • Branch prediction • Register renaming (except UltraSPARC) • 400MHz – 1.7GHz • 7–130 million transistors

  7. Chapter 4Exploiting ILP with Software

  8. 4.1. Compiler Techniques for Exposing ILP • Compilers can improve the performance of simple pipelines • Reduce data hazards • Reduce control hazards

  9. Loop Unrolling • Compiler technique to increase ILP • Duplicate loop body • Decrease iterations • Example: • Basic code: 10 cycles per iteration • Scheduled: 6 cycles for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

  10. Loop Unrolling • Basic code: 7 cycles per “iteration” • Scheduled: 3.5 cycles (no stalls!) for (int k = 0; k < 1000; k+=4) { x[k] = x[k] + s; x[k+1] = x[k+1] + s; x[k+2] = x[k+2] + s; x[k+3] = x[k+3] + s; } for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

  11. Loop Unrolling • Requires clever compilers • Analysing data dependences, name dependences and control dependences • Limitations • Code size • Decrease in amortisation of overheads • “Register pressure” • Compiler limitations • Useful for any architecture

  12. Superscalar Performance • Two-issue MIPS (int + FP) • 2.4 cycles per “iteration” • Unrolled five times

  13. 4.2. Static Branch Prediction • Useful: • where behaviour can be predicted at compile-time • to assist dynamic prediction • Architectural support • Delayed branches

  14. Static Branch Prediction • Simple: • Predict taken • Has average misprediction rate of 34% (SPEC) • Range: 59% – 9% • Better: • Predict backward taken, forward not-taken • Worse for SPEC!

  15. Static Branch Prediction • Advanced compiler analysis can do better • Profiling is very useful • FP: 9% ± 4% • Int: 15% ± 5%

  16. 4.3. Static Multiple Issue: VLIW • Compiler groups instructions into “packets”, checking for dependences • Remove dependences • Flag dependences • Simplifies hardware

  17. VLIW • First machines used a wide instruction with multiple operations per instruction • Hence Very Long Instruction Word (VLIW) • 64–128 bits • Alternative: group several instructions into an issue packet

  18. VLIW Architectures • Multiple functional units • Compiler selects instructions for each unit to create one long instruction/an issue packet • Example: five operations • Integer/branch, 2 × FP, 2 × memory access • Need lots of parallelism • Use loop unrolling, or global scheduling

  19. Example for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; } • Loop unrolled seven times! • 1.29 cycles per result • 60% of available instruction “slots” filled

  20. Summary of Improvements

  21. Drawbacks of Original VLIWs • Large code size • Need to use loop unrolling • Wasted space for unused slots • Clever encoding techniques, compression • Lock-step execution • Stalling one unit stalls them all • Binary code compatibility • Variations on structure required recompilation

  22. 4.4. Compiler Support for Exploiting ILP • We will not cover this section in detail • Loop unrolling • Loop-carried dependences • Software pipelining • Interleave instructions from different iterations

  23. 4.5. Hardware Support for Extracting More Parallelism • Techniques like loop-unrolling work well when branch behaviour can be predicted at compile time • If not, we need more advanced techniques: • Conditional instructions • Hardware support for compiler speculation

  24. Conditional or Predicated Instructions • Instructions have associated conditions • If condition is true execution proceeds normally • If not, instruction becomes a no-op cmovz %r8, %r1, %r2 bnez %r8, L1 nop mov %r1, %r2 L1: ... if (a == 0) b = c; • Removes control hazards

  25. Conditional Instructions • Control hazards effectively replaced by data hazards • Can be used for speculation • Compiler reorders instructions depending on likely outcome of branches

  26. Limitations on Conditional Instructions • Annulled instructions still execute • But may occupy otherwise stalled time • Most useful when conditions evaluated early • Limited usefulness for complex conditions • May be slower than unconditional operations

  27. Conditional Instructions in Practice

More Related