1 / 233

Using Criticality to Attack Performance Bottlenecks

Using Criticality to Attack Performance Bottlenecks. Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn). Bottleneck Analysis. Bottleneck Analysis: Determining the performance effect of an event on execution time. An event could be:

ruth-york
Download Presentation

Using Criticality to Attack Performance Bottlenecks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

  2. Bottleneck Analysis Bottleneck Analysis: Determining the performance effect of an event on execution time • An event could be: • an instruction’s execution • an instruction-window-full stall • a branch mispredict • a network request • inter-processor communication • etc.

  3. Why is Bottleneck Analysis Important?

  4. Bottleneck Analysis Applications • Run-time Optimization • Resource arbitration • e.g., how to scheduling memory accesses? • Effective speculation • e.g., which branches to predicate? • Dynamic reconfiguration • e.g, when to enable hyperthreading? • Energy efficiency • e.g., when to throttle frequency? • Design Decisions • Overcoming technology constraints • e.g., how to mitigate effect of long wire latencies? • Programmer Performance Tuning • Where have the cycles gone? • e.g., which cache misses should be prefetched?

  5. Why is Bottleneck Analysis Hard?

  6. miss1 (100 cycles) miss2 (100 cycles) 2 misses but only 1 miss penalty Current state-of-art Event counts: Exe. time = (CPU cycles + Mem. cycles) * Clock cycle time where: Mem. cycles = Number of cache misses * Miss penalty

  7. Parallelism Parallelism in systems complicates performance understanding • Two parallel cache misses • Two parallel threads • A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing

  8. Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware

  9. Our Approach

  10. Our Approach: Criticality Critical events affect execution time, non-critical do not Bottleneck Analysis: Determining the performance effect of an event on execution time

  11. Defining criticality Need Performance Sensitivity • slowing down a “critical” event should slow down the entire program • speeding up a “noncritical” event should leave execution time unchanged

  12. Standard Waterfall Diagram

  13. (MISP) Annotated with Dependence Edges

  14. Fetch BW Data Dep ROB Branch Misp. Annotated with Dependence Edges

  15. Edge Weights Added 1 1 1 1 0 2 1 1 1 3 1 1 1

  16. F F F F F F F F F F F E E E E E E E E E E E C C C C C C C C C C C 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 Convert to Graph 0 1 1 0 1 1 1 2 0 0 1 1 1 3 0 1 1 1

  17. F F F F F F F F F F F E E E E E E E E E E E C C C C C C C C C C C Convert to Graph 1 1 0 1 1 1 1 1 1 0 1 1 2 2 1 2 0 1 1 0 1 1 1 1 2 1 1 1 1 3 1 1 0 1 1 1 1 1 1 1

  18. Critical Icache miss,But how costly? Non-critical,But how much slack? Smaller graph instance 0 0 10 1 F F F F F 1 1 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C

  19. Critical Icache miss,But how costly? Non-critical,But how much slack? Add “hidden” constraints 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1

  20. Slack = 13 – 7 = 6 cycles Add “hidden” constraints Cost = 13 – 7 = 6 cycles 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1

  21. Slack = 6 cycles Slack “sharing” 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1 Slack = 6 cycles Can delay one edge by 6 cycles, but not both!

  22. ~80% insts have at least 5 cycles of apportioned slack Machine Imbalance global apportioned

  23. Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware

  24. Simple criticality not always enough Sometimes events have nearly equal criticality miss #1 (99) miss #2 (100) Want to know • how critical is each event? • how far from critical is each event? Actually, even that is not enough

  25. Our solution: measure interactions Two parallel cache misses miss #1 (99) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 1 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs Parallel interaction 100 0 + 1 icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99

  26. miss #1 • Positive icost  parallel interaction miss #2 Interaction cost (icost) icost = aggregate cost – sum of individual costs • Zero icost ?

  27. Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 • Positive icost  parallel interaction miss #2 . . . • Zero icost  independent miss #2 miss #1 • Negative icost ?

  28. Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = ?

  29. Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost  serial interaction

  30. Branch mispredict Load-Replay Trap Interaction cost (icost) icost = aggregate cost – sum of individual costs Fetch BW miss #1 • Positive icost  parallel interaction LSQ stall miss #2 . . . • Zero icost  independent miss #2 miss #1 miss #1 miss #2 • Negative icost serial interaction ALU latency

  31. Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1 Why care about serial interactions? miss #1 (100) miss #2 (100) ALU latency (110 cycles)

  32. 4 Icost Case Study: Deep pipelines 1 Dcache (DL1) Looking for serial interactions!

  33. Icost Breakdown (6 wide, 64-entry window)

  34. Icost Breakdown (6 wide, 64-entry window)

  35. Icost Breakdown (6 wide, 64-entry window)

  36. Icost Breakdown (6 wide, 64-entry window)

  37. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  38. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  39. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  40. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  41. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  42. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  43. Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware

  44. Exploit in Hardware • Criticality Analyzer • Online, fast-feedback • Limited to critical/not critical • Replacement for Performance Counters • Requires offline analysis • Constructs entire graph

  45. R1 R2 + R3 Only last-arriving edges can be critical • Observation: R2 E R3 Dependence resolved early If dependence into R2 is on critical path, then value of R2arrived last. critical  arrives last arrives last critical 

  46. F F F F F F F F E E E E E E E E C C C C C C C C C C C C C C C C EFif branch misp. CFif ROB stall FFotherwise EEobserve arrival order of operands F F F F F E E E E E FEif data ready on fetch C C C C C C C C C C Determining last-arrive edges Observe events within the machine last_arrive[F] = last_arrive[C] = last_arrive[E] = F F F F E E E E C C C C C C C C ECif commit pointer is delayed CCotherwise

  47. Last-arrive edges The last-arrive rule CP consists only of “last-arrive” edges F E C

  48. Prune the graph Only need to put last-arrive edges in graph No other edges could be on CP F E C newest

  49. …and we’ve found the critical path! Backward propagate along last-arrive edges F E C newest newest • Found CPby only observing last-arrive edges • but still requires constructing entire graph

  50. Step 2. Reducing storage reqs CP is a ”long” chain of last-arrive edges. • the longer a given chain of last-arrive edges, the more likely it is part of the CP Algorithm: find sufficiently long last-arrive chains • Plant token into a node n • Propagate forward, only along last-arrive edges • Check for token after several hundred cycles • If token alive, n is assumed critical

More Related