3.13. Fallacies and Pitfalls

3.13. Fallacies and Pitfalls • Fallacy: Processors with lower CPIs will always be faster • Fallacy: Processors with faster clock rates will always be faster • Balance must be found: • E.g. sophisticated pipeline: CPI ↓ clock cycle ↑

Fallacies and Pitfalls • Pitfall: Emphasizing improving CPI by increasing issue rate, while sacrificing clock rate can decrease performance • Again, question of balance • SuperSPARC –vs– HP PA 7100 • Complex interactions between cycle time and organisation

Fallacies and Pitfalls • Pitfall: Improving only one aspect of a multiple-issue processor and expecting overall performance improvement • Amdahl’s Law! • Boosting performance of one area may uncover problems in another

Fallacies and Pitfalls • Pitfall: Sometimes bigger and dumber is better! • Alpha 21264: sophisticated multilevel tournament branch predictor • Alpha 21164: simple two-bit predictor • 21164 performs better for transaction processing application! • Can handle twice as many local branch predictions

Concluding Remarks • Lots of open questions! • Clock speed –vs– CPI • Power issues • Exploiting parallelism • ILP –vs– explicit

Characteristics of Modern (2001) Processors • Figure 3.61 • 3–4 way superscalar • 4–22 stage pipelines • Branch prediction • Register renaming (except UltraSPARC) • 400MHz – 1.7GHz • 7–130 million transistors

Chapter 4Exploiting ILP with Software

4.1. Compiler Techniques for Exposing ILP • Compilers can improve the performance of simple pipelines • Reduce data hazards • Reduce control hazards

Loop Unrolling • Compiler technique to increase ILP • Duplicate loop body • Decrease iterations • Example: • Basic code: 10 cycles per iteration • Scheduled: 6 cycles for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

Loop Unrolling • Basic code: 7 cycles per “iteration” • Scheduled: 3.5 cycles (no stalls!) for (int k = 0; k < 1000; k+=4) { x[k] = x[k] + s; x[k+1] = x[k+1] + s; x[k+2] = x[k+2] + s; x[k+3] = x[k+3] + s; } for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

Loop Unrolling • Requires clever compilers • Analysing data dependences, name dependences and control dependences • Limitations • Code size • Decrease in amortisation of overheads • “Register pressure” • Compiler limitations • Useful for any architecture

Superscalar Performance • Two-issue MIPS (int + FP) • 2.4 cycles per “iteration” • Unrolled five times

4.2. Static Branch Prediction • Useful: • where behaviour can be predicted at compile-time • to assist dynamic prediction • Architectural support • Delayed branches

Static Branch Prediction • Simple: • Predict taken • Has average misprediction rate of 34% (SPEC) • Range: 59% – 9% • Better: • Predict backward taken, forward not-taken • Worse for SPEC!

Static Branch Prediction • Advanced compiler analysis can do better • Profiling is very useful • FP: 9% ± 4% • Int: 15% ± 5%

4.3. Static Multiple Issue: VLIW • Compiler groups instructions into “packets”, checking for dependences • Remove dependences • Flag dependences • Simplifies hardware

VLIW • First machines used a wide instruction with multiple operations per instruction • Hence Very Long Instruction Word (VLIW) • 64–128 bits • Alternative: group several instructions into an issue packet

VLIW Architectures • Multiple functional units • Compiler selects instructions for each unit to create one long instruction/an issue packet • Example: five operations • Integer/branch, 2 × FP, 2 × memory access • Need lots of parallelism • Use loop unrolling, or global scheduling

Example for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; } • Loop unrolled seven times! • 1.29 cycles per result • 60% of available instruction “slots” filled

Summary of Improvements

Drawbacks of Original VLIWs • Large code size • Need to use loop unrolling • Wasted space for unused slots • Clever encoding techniques, compression • Lock-step execution • Stalling one unit stalls them all • Binary code compatibility • Variations on structure required recompilation

4.4. Compiler Support for Exploiting ILP • We will not cover this section in detail • Loop unrolling • Loop-carried dependences • Software pipelining • Interleave instructions from different iterations

4.5. Hardware Support for Extracting More Parallelism • Techniques like loop-unrolling work well when branch behaviour can be predicted at compile time • If not, we need more advanced techniques: • Conditional instructions • Hardware support for compiler speculation

Conditional or Predicated Instructions • Instructions have associated conditions • If condition is true execution proceeds normally • If not, instruction becomes a no-op cmovz %r8, %r1, %r2 bnez %r8, L1 nop mov %r1, %r2 L1: ... if (a == 0) b = c; • Removes control hazards

Conditional Instructions • Control hazards effectively replaced by data hazards • Can be used for speculation • Compiler reorders instructions depending on likely outcome of branches

Limitations on Conditional Instructions • Annulled instructions still execute • But may occupy otherwise stalled time • Most useful when conditions evaluated early • Limited usefulness for complex conditions • May be slower than unconditional operations

Conditional Instructions in Practice

3.13. Fallacies and Pitfalls

3.13. Fallacies and Pitfalls

Presentation Transcript

Fallacies

Fallacies:

Fallacies

10 .3.13

Propaganda and Fallacies

3.13 Female reproductive system

Fallacies

Fallacies

Arguments and Fallacies

Fallacies

Fallacies

FALLACIES

Fallacies of Presumption and Fallacies of Ambiguity

3.13 Use Sequences

Exploration 3.13

Fallacies

_________________________ 3.13

Don Now 3.13

Fallacies

Section 3.13

PHILOSOPHY AND FALLACIES

Fallacies and Paradoxes