Using Criticality to Attack Performance Bottlenecks

Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Bottleneck Analysis Bottleneck Analysis: Determining the performance effect of an event on execution time • An event could be: • an instruction’s execution • an instruction-window-full stall • a branch mispredict • a network request • inter-processor communication • etc.

Why is Bottleneck Analysis Important?

Bottleneck Analysis Applications • Run-time Optimization • Resource arbitration • e.g., how to scheduling memory accesses? • Effective speculation • e.g., which branches to predicate? • Dynamic reconfiguration • e.g, when to enable hyperthreading? • Energy efficiency • e.g., when to throttle frequency? • Design Decisions • Overcoming technology constraints • e.g., how to mitigate effect of long wire latencies? • Programmer Performance Tuning • Where have the cycles gone? • e.g., which cache misses should be prefetched?

Why is Bottleneck Analysis Hard?

miss1 (100 cycles) miss2 (100 cycles) 2 misses but only 1 miss penalty Current state-of-art Event counts: Exe. time = (CPU cycles + Mem. cycles) * Clock cycle time where: Mem. cycles = Number of cache misses * Miss penalty

Parallelism Parallelism in systems complicates performance understanding • Two parallel cache misses • Two parallel threads • A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing

Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware

Our Approach

Our Approach: Criticality Critical events affect execution time, non-critical do not Bottleneck Analysis: Determining the performance effect of an event on execution time

Defining criticality Need Performance Sensitivity • slowing down a “critical” event should slow down the entire program • speeding up a “noncritical” event should leave execution time unchanged

Standard Waterfall Diagram

(MISP) Annotated with Dependence Edges

Fetch BW Data Dep ROB Branch Misp. Annotated with Dependence Edges

Edge Weights Added 1 1 1 1 0 2 1 1 1 3 1 1 1

F F F F F F F F F F F E E E E E E E E E E E C C C C C C C C C C C 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 Convert to Graph 0 1 1 0 1 1 1 2 0 0 1 1 1 3 0 1 1 1

F F F F F F F F F F F E E E E E E E E E E E C C C C C C C C C C C Convert to Graph 1 1 0 1 1 1 1 1 1 0 1 1 2 2 1 2 0 1 1 0 1 1 1 1 2 1 1 1 1 3 1 1 0 1 1 1 1 1 1 1

Critical Icache miss,But how costly? Non-critical,But how much slack? Smaller graph instance 0 0 10 1 F F F F F 1 1 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C

Critical Icache miss,But how costly? Non-critical,But how much slack? Add “hidden” constraints 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1

Slack = 13 – 7 = 6 cycles Add “hidden” constraints Cost = 13 – 7 = 6 cycles 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1

Slack = 6 cycles Slack “sharing” 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1 Slack = 6 cycles Can delay one edge by 6 cycles, but not both!

~80% insts have at least 5 cycles of apportioned slack Machine Imbalance global apportioned

Simple criticality not always enough Sometimes events have nearly equal criticality miss #1 (99) miss #2 (100) Want to know • how critical is each event? • how far from critical is each event? Actually, even that is not enough

Our solution: measure interactions Two parallel cache misses miss #1 (99) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 1 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs Parallel interaction 100 0 + 1 icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99

miss #1 • Positive icost  parallel interaction miss #2 Interaction cost (icost) icost = aggregate cost – sum of individual costs • Zero icost ?

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 • Positive icost  parallel interaction miss #2 . . . • Zero icost  independent miss #2 miss #1 • Negative icost ?

Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = ?

Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost  serial interaction

Branch mispredict Load-Replay Trap Interaction cost (icost) icost = aggregate cost – sum of individual costs Fetch BW miss #1 • Positive icost  parallel interaction LSQ stall miss #2 . . . • Zero icost  independent miss #2 miss #1 miss #1 miss #2 • Negative icost serial interaction ALU latency

Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1 Why care about serial interactions? miss #1 (100) miss #2 (100) ALU latency (110 cycles)

4 Icost Case Study: Deep pipelines 1 Dcache (DL1) Looking for serial interactions!

Icost Breakdown (6 wide, 64-entry window)

Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

Exploit in Hardware • Criticality Analyzer • Online, fast-feedback • Limited to critical/not critical • Replacement for Performance Counters • Requires offline analysis • Constructs entire graph

R1 R2 + R3 Only last-arriving edges can be critical • Observation: R2 E R3 Dependence resolved early If dependence into R2 is on critical path, then value of R2arrived last. critical  arrives last arrives last critical 

F F F F F F F F E E E E E E E E C C C C C C C C C C C C C C C C EFif branch misp. CFif ROB stall FFotherwise EEobserve arrival order of operands F F F F F E E E E E FEif data ready on fetch C C C C C C C C C C Determining last-arrive edges Observe events within the machine last_arrive[F] = last_arrive[C] = last_arrive[E] = F F F F E E E E C C C C C C C C ECif commit pointer is delayed CCotherwise

Last-arrive edges The last-arrive rule CP consists only of “last-arrive” edges F E C

Prune the graph Only need to put last-arrive edges in graph No other edges could be on CP F E C newest

…and we’ve found the critical path! Backward propagate along last-arrive edges F E C newest newest • Found CPby only observing last-arrive edges • but still requires constructing entire graph

Step 2. Reducing storage reqs CP is a ”long” chain of last-arrive edges. • the longer a given chain of last-arrive edges, the more likely it is part of the CP Algorithm: find sufficiently long last-arrive chains • Plant token into a node n • Propagate forward, only along last-arrive edges • Check for token after several hundred cycles • If token alive, n is assumed critical

Using Criticality to Attack Performance Bottlenecks

Using Criticality to Attack Performance Bottlenecks

Presentation Transcript

Finding the Performance Bottlenecks in Your Application

Using Software to Enhance Performance

Using Data to Access Performance

Understanding Performance Bottlenecks using Performance Dashboard

Multi-Core Packet Scattering to Disentangle Performance Bottlenecks

Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors

Attack Attack !

Hula Hoops and Bottlenecks: Using Media to Teach Difficult Concepts

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring

Multi-Core Packet Scattering to Disentangle Performance Bottlenecks

Typical performance bottlenecks and how they can be found

Quantum Criticality

Vulnerability Analysis Using Attack Graphs

Instrumentation and Performance Analysis for Finding Memory Bottlenecks

criticality

USING DATA TO IMPROVE PERFORMANCE

Using Multiple Gateways to Foil DDOS Attack

Attack on SSHv1 using dsniff

Discovering and Understanding Performance Bottlenecks in Transactional Applications

Using Criticality to Attack Performance Bottlenecks

Vulnerability Analysis Using Attack Graphs