1 / 32

Master/Slave Speculative Parallelization

Master/Slave Speculative Parallelization. Craig Zilles (U. Illinois) Guri Sohi (U. Wisconsin). The Basics. #1: a well-known problem: On-chip Communication #2: a well-known opportunity: Program Predictability #3: our novel approach to #1 using #2. Problem: Communication.

hedya
Download Presentation

Master/Slave Speculative Parallelization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Master/Slave Speculative Parallelization Craig Zilles (U. Illinois) Guri Sohi (U. Wisconsin)

  2. The Basics • #1: a well-known problem: On-chip Communication • #2: a well-known opportunity: Program Predictability • #3: our novel approach to #1 using #2

  3. Problem: Communication • Cores becoming “communication limited” • Rather than “capacity limited” • Many, many transistors on a chip, but… • Can’t bring them all to bear on one thread • Control/data dependences = freq. communication

  4. Best core << chip size • Sweet spot for core size • Further size increases either hurts Mhz or IPC How can we maximize core’s efficiency? Chip Core

  5. bne 0% 100% Opportunity: Predictability • Many programs behaviors are predictable • Control flow, dependences, values, stalls, etc. • Widely exploited by processors/compilers • But, not to help increase effective core size • Core resources used to make, validate pred’s • Example: perfectly-biased branch

  6. Speculative Execution Uses space in I-cache Branch predictor • Execute code before/after branch in parallel • Branch is fetched, predicted, executed, retired All of this occurs in the core Uses execution resources Not just the branch, but its backwards slice

  7. bne 0% 100% CC Trace/Superblock Formation • Optimize code assuming the predicted path • Reduces cost of branch and surrounding code • Prediction implicitly encoded in executable • Code still verifies prediction • Branch & slice still fetched, executed, committed, etc. All of this occurs on the core

  8. Why waste core resources? • The branch is perfectly predictable! The core should only execute instructions that are not statically predictable!

  9. If not in the core, where? • Anywhere else on chip! • Because it is predictable: • Doesn’t prevent forward progress • We can tolerate latency to verify prediction Instruction Storage Prediction Verify Prediction

  10. A concrete example:Master/Slave Speculative Parallelization • Execute “distilled program” on one processor • A version of program with predictable inst’s removed • Faster than original, but not guaranteed to be correct • Verify predictions by executing original program • Parallelize verification by splitting it into “tasks” Master core: Executes distilled program Slave cores: Parallel execution of original program

  11. Talk Outline • Removing predictability from programs • “Approximation” • Externally verifying distilled programs • Master/Slave Speculative Parallelization (MSSP) • Results Summary • Summary

  12. A AB bne B C C 1% 99% Approximation Transformations • Pretend you’ve proven the common case • Preserve correctness in the common case • Break correctness in uncommon case • Use profile to know the common case A B C

  13. Load is highly invariant (usually gets 7) addi $zero, 7, r13 • Memory Dependences: st r12, 0(A) ld r11, 0(B) never A and B may alias always mv r12, r11 Not just for branches • Values: ld r13, 0(X) If rarely alias in practice? If almost always alias?

  14. printf Two dominant paths fprintf csEOF exit fwrite printf Enables Traditional Optimizations Many static paths Approximate away unimportant paths From bzip2

  15. Enables Traditional Optimizations Many static paths Two dominant paths Approximate away unimportant paths From bzip2

  16. Enables Traditional Optimizations Many static paths Two dominant paths Approximate away unimportant paths Very straightforward structure Easy for compiler to optimize From bzip2

  17. Enables Traditional Optimizations Many static paths Two dominant paths Approximate away unimportant paths Very straightforward structure Easy for compiler to optimize From bzip2

  18. Effect of Approximation Original Code Distilled Code • Equivalent 99.999% of the time, better execution characteristics • Fewer dynamic instructions: ~1/3 of original code • Smaller static size: ~2/5 of original code • Fewer taken branches: ~1/4 of original code • Smaller fraction of loads/stores • Shorter than best non-speculative code • Removing checks: code incorrect .001% of the time

  19. Talk Outline • Removing predictability from programs • “Approximation” • Externally verifying distilled programs • Master/Slave Speculative Parallelization (MSSP) • Results Summary • Summary

  20. Goal • Achieve performance of distilled program • Retain correctness of original program • Approach: • Use distilled code to speed original program

  21. Checkpoint parallelization • Cut original program into “tasks” • Assign tasks to processors • Provide each a checkpoint of registers & memory • Completely decouples task execution • Tasks retrieve all live-ins from checkpoint • Checkpoints taken from distilled program • Captured in hardware • Stored as a “diff” from architected state

  22. Master core: Executes distilled program Slave cores: Parallel execution of original program

  23. Start Master and Slave from architected state A’ A B’ B Take checkpoint, use to start next task C’ C Verify B’s inputs with A’s outputs; commit state C C’ Bad Checkpoint @ C Detected at end of B Squash, restart from architected state Example Execution Master Slave1 Slave2 Slave3

  24. Bad checkpoints: • through original program • interprocessor comm. MSSP Critical Path Master Slave1 Slave2 Slave3 • If checkpoints correct: • through distilled program • no communication latency • verification in background A’ A B’ B C’ C C C’ • If bad checkpoints are rare: • performance of distilled program • tolerant of communication latency

  25. Talk Outline • Removing predictability from programs • “Approximation” • Externally verifying distilled programs • Master/Slave Speculative Parallelism (MSSP) • Results Summary • Summary

  26. Methodology • First-cut distiller • Static binary-to-binary translator • Simple control flow approximations • DCE, inlining, register re-allocation, save/restore elimination, code layout… • HW model: 8-way CMP of 21264’s • 10 cycle interconnect latency to shared L2 • Spec2000 Integer benchmarks on Alpha

  27. Results Summary • Distilled Programs can be accurate • 1 task misspeculation per 10,000 instructions • Speedup depends on distillation • 1.25 h-mean: ranges from 1.0 to 1.7 (gcc, vortex) • (relative to uniprocessor execution) • Modest storage requirements • Tens of kB at L2 for speculation buffering • Decent latency tolerance • Latency 5 -> 20 cycles: 10% slowdown

  28. Distilled Program Accuracy 100,000 10,000 1,000 Average distance between task misspeculations: > 10,000 original program instructions

  29. Distillation Effectiveness Instructions retired by Master (distilled program) Instructions retired by Slave (original program (not counting nops) 100% 60% 0% Up to two-thirds reduction

  30. Performance 100,000 10,000 accuracy 1,000 100% 60% distillation 0% 1.6 1.4 Speedup 1.2 1.0 Performance scales with distillation effectiveness

  31. Related Work • Slipstream • Speculative Multithreading • Pre-execution • Feedback-directed Optimization • Dynamic Optimizers

  32. Summary • Don’t waste core on predictable things • “Distill” out predictability from programs • Verify predictions with original program • Split into tasks: parallel validation • Achieve the throughput to keep up • Has some nice attributes (ask offline) • Can support legacy binaries, latency tolerant, low verification cost, complements explicit parallelism

More Related