1 / 25

Branch-mispredict Level Parallelism for Control Independence

Branch-mispredict Level Parallelism for Control Independence. Kshitiz Malik, Mayank Agarwal Sam Stone, Kevin Woley, Matthew Frank Implicitly Parallel Architectures Group University of Illinois at Urbana Champaign. Summary. Mispredicted branches : A major bottleneck to ILP

modesty
Download Presentation

Branch-mispredict Level Parallelism for Control Independence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Branch-mispredict Level Parallelism for Control Independence Kshitiz Malik, Mayank Agarwal Sam Stone, Kevin Woley, Matthew Frank Implicitly Parallel Architectures Group University of Illinois at Urbana Champaign

  2. Summary • Mispredicted branches: A major bottleneck to ILP • BLP: An application property that can help • Control-Independence architectures can exploit BLP • Current policies ill-suited • BLP-targeted policies lead to dramatic improvements Implicitly Parallel Architectures Group, UIUC

  3. Outline • The Branch Prediction Wall • Application BLP to Scale the Wall • Architectures to Exploit BLP • Maximizing BLP Implicitly Parallel Architectures Group, UIUC

  4. B1 B2 The Branch Prediction Wall • Successful branch prediction => high ILP • Parallel and OOO execution across branches • Mispredicted branches Catastrophic • All instructions beyond mispredicts squashed • Mispredicted branches fetched and executed serially Squashed Squashed B2 Resolved B1 Resolved Useful Fetch Mispredicted Branch Wasted Fetch Implicitly Parallel Architectures Group, UIUC

  5. Pushing the Wall Farther Out • Better branch predictors • Multipath execution • Predication • Parallel resolution of independent mispredicts • Exploit existence of BLP in the application Implicitly Parallel Architectures Group, UIUC

  6. Outline • The Branch Prediction Wall • Application BLP to Scale the Wall • Architectures to Exploit BLP • Maximizing BLP Implicitly Parallel Architectures Group, UIUC

  7. B1 B1 B2 B2 Branch-mispredict Level Parallelism (BLP) • Resolve multiple mispredicted branches in parallel • Overlap the penalty of individual mispredicts • Higher rate of mispredict resolution • Increased performance Superscalar Overlap BLP Useful fetch Mispredicted Branch Wasted fetch Implicitly Parallel Architectures Group, UIUC

  8. Conditions for BLP to Exist • For parallel resolution, mispredicts need to be • Control-Independent • Data-Independent • A and B don’t lead to BLP • B control-dependent on A • A and E don’t lead to BLP • E data-dependent on A (but CI) • A and G can lead to BLP • Control and Data Independent j++ Mispredicted Branch A Control Flow B C Data Flow CD D: i++ E: if (i) CIDD F CIDI G: if (j<10) Implicitly Parallel Architectures Group, UIUC

  9. j++ A B C CD B A G E D: i++ E: if (i) CIDD F CIDI G: if (j<10) CIDD Branches → No BLP No Overlap Useful Fetch Mispredicted Branch Wasted Fetch Implicitly Parallel Architectures Group, UIUC

  10. j++ A B C CD B A G E D: i++ E: if (i) CIDD F CIDI G: if (j<10) CIDI Branches → BLP Overlap Useful fetch Mispredicted Branch Wasted Fetch Implicitly Parallel Architectures Group, UIUC

  11. B A How to Measure BLP • Average # simultaneous independent mispredicts • Unresolved branches (fetched but not completed) • When atleast one unresolved • Only branches that eventually retire • Superscalar : BLP is exactly 1 • Case 2: BLP more than 1 BLP = 1.5 Overlap Mispredict Penalty Mispredict Fetched Correct path fetch Implicitly Parallel Architectures Group, UIUC

  12. Outline • The Branch Prediction Wall • Application BLP to Scale the Wall • Architectures to Exploit BLP • Maximizing BLP Implicitly Parallel Architectures Group, UIUC

  13. Control Independence Architectures • Start (spawn) task at control-independent (CI) point • Execute concurrently with spawner task • Can potentially exploit BLP Implicitly Parallel Architectures Group, UIUC

  14. Control-Independent Spawning • E control-independent of branch B • Spawn E as a new task at B • Execute concurrently • Data-dependent instructions delayed until previous thread completes A Spawner B C D Spawnee E F Implicitly Parallel Architectures Group, UIUC

  15. Control-Independent Spawning for BLP CPU1 CPU2 CPU3 A A Task Spawn B B Resolve Mispredict Spawned Task E D C C D F E Reconnect A F Task Spawn B Spawned Task Resolve Mispredict E Useful fetch D C Mispredicted Branch F Reconnect Wasted Fetch Implicitly Parallel Architectures Group, UIUC

  16. Targeting BLP in CI Architectures • CI architectures can exploit BLP • But conventional policies ill-suited for BLP • Two policies target BLP and improve performance: • Spawn Selection • Smarter data dependence handling Implicitly Parallel Architectures Group, UIUC

  17. Outline • The Branch Prediction Wall • Application BLP to Scale the Wall • Architectures to Exploit BLP • Maximizing BLP Implicitly Parallel Architectures Group, UIUC

  18. Where to look for BLP? heap_head = .. if (...) Low-Confidence Branch Inner Loop Control Flow P: if (heap[i+1] < heap[i]) Data Flow Z: i++ Example: get_heap_head() from VPR Route ~30% of instructions ~40% of mispredicts CIDD Q: if (heap[i] > val) swap (i, from) if (...) R: ret_val = heap_head Limited BLP in innermost loop Might need to look farther out for CIDI branches CIDI S: if(ret_val.flag = 0) T: if(ret_val.flag = 0) Implicitly Parallel Architectures Group, UIUC

  19. Spawn Selection for BLP heap_head = .. if (...) Inner Loop P: if (heap[i+1] < heap[i]) Z: i++ CIDD Q: if (heap[i] > val) swap (i, from) if (...) R: ret_val = heap_head CIDI S: if(ret_val.flag = 0) T: if(ret_val.flag = 0) Low-Confidence Branch Control Flow Data Flow Implicitly Parallel Architectures Group, UIUC

  20. Spawn Selection for BLP heap_head = .. if (...) Inner Loop P: if (heap[i+1] < heap[i]) Z: i++ CIDD Q: if (heap[i] > val) swap (i, from) if (...) R: ret_val = heap_head CIDI S: if(ret_val.flag = 0) T: if(ret_val.flag = 0) Low-Confidence Branch High-BLP Spawn R → High Performance Naïve policies will select Q Control Flow Data Flow Implicitly Parallel Architectures Group, UIUC

  21. Exploiting Choice in Spawn Selection BLP good indicator of performance Effective strategy for tasks selection Implicitly Parallel Architectures Group, UIUC

  22. Balanced Dependence Handling • Impacts exploitable BLP • Conservative handling => CIDI mispredicts marked CIDD • Blind speculation => wastage from misspeculation • Balanced approach • Adapts dynamically • Large improvements in BLP Implicitly Parallel Architectures Group, UIUC

  23. BLP → Performance 4-core setup Baseline 4-wide OOO core Aggressive branch predictor Aggressive backend Exploiting BLP is an effective heuristic to improve performance Implicitly Parallel Architectures Group, UIUC

  24. Conclusions • Branch prediction a major bottleneck to ILP • Applications possess significant amounts of “Branch-Mispredict Level Parallelism (BLP)” • Control-Independence architectures can exploit BLP • But current policies ill-suited for BLP • Two techniques dramatically increase exploited BLP • Spawn selection and Balanced Dependence Handling • High BLP Exploited => High Performance Implicitly Parallel Architectures Group, UIUC

  25. Thank You Questions?

More Related