Using criticality to attack performance bottlenecks
Download
1 / 233

Using Criticality to Attack Performance Bottlenecks - PowerPoint PPT Presentation


  • 64 Views
  • Uploaded on

Using Criticality to Attack Performance Bottlenecks. Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn). Bottleneck Analysis. Bottleneck Analysis: Determining the performance effect of an event on execution time. An event could be:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Using Criticality to Attack Performance Bottlenecks' - ruth-york


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Using criticality to attack performance bottlenecks

Using Criticality to Attack Performance Bottlenecks

Brian Fields

UC-Berkeley

(Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)


Bottleneck analysis
Bottleneck Analysis

Bottleneck Analysis:

Determining the performance effect of an

event on execution time

  • An event could be:

    • an instruction’s execution

    • an instruction-window-full stall

    • a branch mispredict

    • a network request

    • inter-processor communication

    • etc.



Bottleneck analysis applications
Bottleneck Analysis Applications

  • Run-time Optimization

    • Resource arbitration

      • e.g., how to scheduling memory accesses?

    • Effective speculation

      • e.g., which branches to predicate?

    • Dynamic reconfiguration

      • e.g, when to enable hyperthreading?

    • Energy efficiency

      • e.g., when to throttle frequency?

  • Design Decisions

    • Overcoming technology constraints

      • e.g., how to mitigate effect of long wire latencies?

  • Programmer Performance Tuning

    • Where have the cycles gone?

      • e.g., which cache misses should be prefetched?



Current state of art

miss1 (100 cycles)

miss2 (100 cycles)

2 misses but only 1 miss penalty

Current state-of-art

Event counts:

Exe. time = (CPU cycles + Mem. cycles) * Clock cycle time

where:

Mem. cycles = Number of cache misses * Miss penalty


Parallelism
Parallelism

Parallelism in systems complicates performance understanding

  • Two parallel cache misses

  • Two parallel threads

  • A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing


Criticality challenges
Criticality Challenges

  • Cost

    • How much speedup possible from optimizing an event?

  • Slack

    • How much can an event be “slowed down” before increasing execution time?

  • Interactions

    • When do multiple events need to be optimized simultaneously?

    • When do we have a choice?

  • Exploit in Hardware



Our approach criticality
Our Approach: Criticality

Critical events affect execution time,

non-critical do not

Bottleneck Analysis:

Determining the performance effect of an

event on execution time


Defining criticality
Defining criticality

Need Performance Sensitivity

  • slowing down a “critical” event should slow down the entire program

  • speeding up a “noncritical” event should leave execution time unchanged



(MISP)

Annotated with Dependence Edges


Fetch BW

Data Dep

ROB

Branch Misp.

Annotated with Dependence Edges


Edge Weights Added

1

1

1

1

0

2

1

1

1

3

1

1

1


F

F

F

F

F

F

F

F

F

F

F

E

E

E

E

E

E

E

E

E

E

E

C

C

C

C

C

C

C

C

C

C

C

1

1

1

1

1

1

2

2

1

1

1

1

2

1

1

1

1

1

1

1

1

1

Convert to Graph

0

1

1

0

1

1

1

2

0

0

1

1

1

3

0

1

1

1


F

F

F

F

F

F

F

F

F

F

F

E

E

E

E

E

E

E

E

E

E

E

C

C

C

C

C

C

C

C

C

C

C

Convert to Graph

1

1

0

1

1

1

1

1

1

0

1

1

2

2

1

2

0

1

1

0

1

1

1

1

2

1

1

1

1

3

1

1

0

1

1

1

1

1

1

1


Smaller graph instance

Critical Icache miss,But how costly?

Non-critical,But how much slack?

Smaller graph instance

0

0

10

1

F

F

F

F

F

1

1

1

1

1

1

E

E

E

E

E

3

1

1

1

1

1

C

C

C

C

C


Add hidden constraints

Critical Icache miss,But how costly?

Non-critical,But how much slack?

Add “hidden” constraints

0

0

10

1

F

F

F

F

F

1

1

1

1

1

1

2

1

1

1

1

E

E

E

E

E

3

1

1

1

1

1

C

C

C

C

C

1

0

0

1


Add hidden constraints1

Slack = 13 – 7 = 6 cycles

Add “hidden” constraints

Cost = 13 – 7 = 6 cycles

0

0

10

1

F

F

F

F

F

1

1

1

1

1

1

2

1

1

1

1

E

E

E

E

E

3

1

1

1

1

1

C

C

C

C

C

1

0

0

1


Slack sharing

Slack = 6 cycles

Slack “sharing”

0

0

10

1

F

F

F

F

F

1

1

1

1

1

1

2

1

1

1

1

E

E

E

E

E

3

1

1

1

1

1

C

C

C

C

C

1

0

0

1

Slack = 6 cycles

Can delay one edge by 6 cycles, but not both!


Machine imbalance

~80% insts have at least 5 cycles of apportioned slack

Machine Imbalance

global

apportioned


Criticality challenges1
Criticality Challenges

  • Cost

    • How much speedup possible from optimizing an event?

  • Slack

    • How much can an event be “slowed down” before increasing execution time?

  • Interactions

    • When do multiple events need to be optimized simultaneously?

    • When do we have a choice?

  • Exploit in Hardware


Simple criticality not always enough
Simple criticality not always enough

Sometimes events have nearly equal criticality

miss #1 (99)

miss #2 (100)

Want to know

  • how critical is each event?

  • how far from critical is each event?

Actually, even that is not enough


Our solution measure interactions
Our solution: measure interactions

Two parallel cache misses

miss #1 (99)

miss #2 (100)

Cost(miss #1) = 0

Cost(miss #2) = 1

Cost({miss #1, miss #2}) = 100

Aggregate cost > Sum of individual costs Parallel interaction

100

0 + 1

icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99


Interaction cost icost

miss #1

  • Positive icost  parallel interaction

miss #2

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

  • Zero icost ?


Interaction cost icost1
Interaction cost (icost)

icost = aggregate cost – sum of individual costs

miss #1

  • Positive icost  parallel interaction

miss #2

. . .

  • Zero icost  independent

miss #2

miss #1

  • Negative icost ?


Negative icost
Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)

Cost(miss #1) = ?


Negative icost1
Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)

Cost(miss #1) = 90

Cost(miss #2) = 90

Cost({miss #1, miss #2}) = 90

icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90

Negative icost  serial interaction


Interaction cost icost2

Branch mispredict

Load-Replay Trap

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

Fetch BW

miss #1

  • Positive icost  parallel interaction

LSQ stall

miss #2

. . .

  • Zero icost  independent

miss #2

miss #1

miss #1

miss #2

  • Negative icost serial interaction

ALU latency


Why care about serial interactions

Reason #1 We are over-optimizing!

Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)

Reason #2 We have a choice of what to optimize

Prefetching miss #2 has the same effect as miss #1

Why care about serial interactions?

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)


Icost case study deep pipelines

4

Icost Case Study: Deep pipelines

1

Dcache (DL1)

Looking for serial interactions!


Icost breakdown 6 wide 64 entry window
Icost Breakdown (6 wide, 64-entry window)


Icost breakdown 6 wide 64 entry window1
Icost Breakdown (6 wide, 64-entry window)


Icost breakdown 6 wide 64 entry window2
Icost Breakdown (6 wide, 64-entry window)


Icost breakdown 6 wide 64 entry window3
Icost Breakdown (6 wide, 64-entry window)


Icost case study deep pipelines1
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge


Icost case study deep pipelines2
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge


Icost case study deep pipelines3
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge


Icost case study deep pipelines4
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge


Icost case study deep pipelines5
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge


Icost case study deep pipelines6
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge


Criticality challenges2
Criticality Challenges

  • Cost

    • How much speedup possible from optimizing an event?

  • Slack

    • How much can an event be “slowed down” before increasing execution time?

  • Interactions

    • When do multiple events need to be optimized simultaneously?

    • When do we have a choice?

  • Exploit in Hardware


Exploit in hardware
Exploit in Hardware

  • Criticality Analyzer

    • Online, fast-feedback

    • Limited to critical/not critical

  • Replacement for Performance Counters

    • Requires offline analysis

    • Constructs entire graph


  • R1 R2 + R3

    Only last-arriving edges can be critical

    • Observation:

    R2

    E

    R3

    Dependence resolved early

    If dependence into R2 is on critical path, then value of R2arrived last.

    critical  arrives last

    arrives last critical


    Determining last arrive edges

    F

    F

    F

    F

    F

    F

    F

    F

    E

    E

    E

    E

    E

    E

    E

    E

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    EFif branch misp.

    CFif ROB stall

    FFotherwise

    EEobserve arrival order of operands

    F

    F

    F

    F

    F

    E

    E

    E

    E

    E

    FEif data ready on fetch

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    Determining last-arrive edges

    Observe events within the machine

    last_arrive[F] =

    last_arrive[C] =

    last_arrive[E] =

    F

    F

    F

    F

    E

    E

    E

    E

    C

    C

    C

    C

    C

    C

    C

    C

    ECif commit pointer is delayed

    CCotherwise


    Last arrive edges
    Last-arrive edges

    The last-arrive rule

    CP consists only of “last-arrive” edges

    F

    E

    C


    Prune the graph
    Prune the graph

    Only need to put last-arrive edges in graph

    No other edges could be on CP

    F

    E

    C

    newest


    …and we’ve found the critical path!

    Backward propagate along last-arrive edges

    F

    E

    C

    newest

    newest

    • Found CPby only observing last-arrive edges

    • but still requires constructing entire graph


    Step 2 reducing storage reqs
    Step 2. Reducing storage reqs

    CP is a ”long” chain of last-arrive edges.

    • the longer a given chain of last-arrive edges, the more likely it is part of the CP

      Algorithm: find sufficiently long last-arrive chains

    • Plant token into a node n

    • Propagate forward, only along last-arrive edges

    • Check for token after several hundred cycles

    • If token alive, n is assumed critical


    Online Criticality Detection

    Forward propagate token

    F

    PlantToken

    E

    C

    newest

    newest


    Tokens“Die”

    Online Criticality Detection

    Forward propagate token

    F

    PlantToken

    E

    C

    newest

    newest


    Online Criticality Detection

    Forward propagate token

    F

    PlantToken

    E

    C

    Token survives!


    Putting it all together

    Prediction Path

    CP

    predictiontable

    PC

    OOO Core

    E-critical?

    Token-Passing

    Analyzer

    Training Path

    Last-arrive edges

    (producer  retired instr)


    Results
    Results

    • Performance (Speed)

      • Scheduling in clustered machines

        • 10% speedup

      • Selective value prediction

      • Deferred scheduling (Crowe, et al)

        • 11% speedup

      • Heterogeneous cache (Rakvic, et al.)

        • 17% speedup

    • Energy

      • Non-uniform machine: fast and slow pipelines

        • ~25% less energy

      • Instruction queue resizing (Sasanka, et al.)

      • Multiple frequency scaling (Semeraro, et al.)

        • 19% less energy with 3% less performance

      • Selective pre-execution (Petric, et al.)


    Exploit in hardware1
    Exploit in Hardware

    • Criticality Analyzer

      • Online, fast-feedback

      • Limited to critical/not critical

  • Replacement for Performance Counters

    • Requires offline analysis

    • Constructs entire graph


  • Profiling goal
    Profiling goal

    Goal:

    • Construct graph

    many dynamic instructions

    Constraint:

    • Can only sample sparsely


    Profiling goal1

    Genome sequencing

    Profiling goal

    Goal:

    • Construct graph

    DNA strand

    DNA

    Constraint:

    • Can only sample sparsely


    Shotgun genome sequencing
    Shotgun” genome sequencing

    DNA


    Shotgun genome sequencing1
    Shotgun” genome sequencing

    DNA


    Shotgun genome sequencing2
    Shotgun” genome sequencing

    DNA

    . . .

    . . .


    Shotgun genome sequencing3
    Shotgun” genome sequencing

    DNA

    . . .

    . . .

    Find overlaps among samples

    . . .

    . . .


    Mapping shotgun to our situation

    Icache miss

    Dcache miss

    Branch misp.

    No event

    Mapping “shotgun” to our situation

    many dynamic instructions


    Profiler hardware requirements

    . . .

    . . .

    Profiler hardware requirements


    Profiler hardware requirements1

    Match!

    Profiler hardware requirements

    . . .

    . . .



    Conclusion grand challenges

    token-passing analyzer

    modeling

    shotgun profiling

    parallel interactions

    serial interactions

    Conclusion: Grand Challenges

    • Cost

      • How much speedup possible from optimizing an event?

    • Slack

      • How much can an event be “slowed down” before increasing execution time?

    • Interactions

      • When do multiple events need to be optimized simultaneously?

      • When do we have a choice?


    Conclusion bottleneck analysis applications

    Selective value prediction

    Scheduling and steering in clustered processors

    Resize instruction window

    Non-uniform machines

    Helped cope with high-latency dcache

    Measured cost of cache misses/branch mispredicts

    Conclusion: Bottleneck Analysis Applications

    • Run-time Optimization

      • Effective speculation

      • Resource arbitration

      • Dynamic reconfiguration

      • Energy efficiency

    • Design Decisions

      • Overcoming technology constraints

    • Programmer Performance Tuning

      • Where have the cycles gone?


    Outline
    Outline

    Simple Criticality

    • Definition (ISCA ’01)

    • Detection (ISCA ’01)

    • Application (ISCA ’01-’02)

      Advanced Criticality

    • Interpretation (MICRO ’03)

      • What types of interactions are possible?

    • Hardware Support (MICRO ’03, TACO ’04)

      • Enhancement to performance counters


    Simple criticality not always enough1
    Simple criticality not always enough

    Sometimes events have nearly equal criticality

    miss #1 (99)

    miss #2 (100)

    Want to know

    • how critical is each event?

    • how far from critical is each event?

    Actually, even that is not enough


    Our solution measure interactions1
    Our solution: measure interactions

    Two parallel cache misses

    miss #1 (99)

    miss #2 (100)

    Cost(miss #1) = 0

    Cost(miss #2) = 1

    Cost({miss #1, miss #2}) = 100

    Aggregate cost > Sum of individual costs Parallel interaction

    100

    0 + 1

    icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99


    Interaction cost icost3

    miss #1

    • Positive icost  parallel interaction

    miss #2

    Interaction cost (icost)

    icost = aggregate cost – sum of individual costs

    • Zero icost ?


    Interaction cost icost4
    Interaction cost (icost)

    icost = aggregate cost – sum of individual costs

    miss #1

    • Positive icost  parallel interaction

    miss #2

    . . .

    • Zero icost  independent

    miss #2

    miss #1

    • Negative icost ?


    Negative icost2
    Negative icost

    Two serial cache misses (data dependent)

    miss #1 (100)

    miss #2 (100)

    ALU latency (110 cycles)

    Cost(miss #1) = ?


    Negative icost3
    Negative icost

    Two serial cache misses (data dependent)

    miss #1 (100)

    miss #2 (100)

    ALU latency (110 cycles)

    Cost(miss #1) = 90

    Cost(miss #2) = 90

    Cost({miss #1, miss #2}) = 90

    icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90

    Negative icost  serial interaction


    Interaction cost icost5

    Branch mispredict

    Load-Replay Trap

    Interaction cost (icost)

    icost = aggregate cost – sum of individual costs

    Fetch BW

    miss #1

    • Positive icost  parallel interaction

    LSQ stall

    miss #2

    . . .

    • Zero icost  independent

    miss #2

    miss #1

    miss #1

    miss #2

    • Negative icost serial interaction

    ALU latency


    Why care about serial interactions1

    Reason #1 We are over-optimizing!

    Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)

    Reason #2 We have a choice of what to optimize

    Prefetching miss #2 has the same effect as miss #1

    Why care about serial interactions?

    miss #1 (100)

    miss #2 (100)

    ALU latency (110 cycles)


    Outline1
    Outline

    Simple Criticality

    • Definition (ISCA ’01)

    • Detection (ISCA ’01)

    • Application (ISCA ’01-’02)

      Advanced Criticality

    • Interpretation (MICRO ’03)

      • What types of interactions are possible?

    • Hardware Support (MICRO ’03, TACO ’04)

      • Enhancement to performance counters


    Profiling goal2
    Profiling goal

    Goal:

    • Construct graph

    many dynamic instructions

    Constraint:

    • Can only sample sparsely


    Profiling goal3

    Genome sequencing

    Profiling goal

    Goal:

    • Construct graph

    DNA strand

    DNA

    Constraint:

    • Can only sample sparsely


    Shotgun genome sequencing4
    Shotgun” genome sequencing

    DNA


    Shotgun genome sequencing5
    Shotgun” genome sequencing

    DNA


    Shotgun genome sequencing6
    Shotgun” genome sequencing

    DNA

    . . .

    . . .


    Shotgun genome sequencing7
    Shotgun” genome sequencing

    DNA

    . . .

    . . .

    Find overlaps among samples

    . . .

    . . .


    Mapping shotgun to our situation1

    Icache miss

    Dcache miss

    Branch misp.

    No event

    Mapping “shotgun” to our situation

    many dynamic instructions


    Profiler hardware requirements2

    . . .

    . . .

    Profiler hardware requirements


    Profiler hardware requirements3

    Match!

    Profiler hardware requirements

    . . .

    . . .






    Conclusion bottleneck analysis applications1

    Selective value prediction

    Scheduling and steering in clustered processors

    Resize instruction window

    Non-uniform machines

    Helped cope with high-latency dcache

    Measured cost of cache misses/branch mispredicts

    Conclusion: Bottleneck Analysis Applications

    • Run-time Optimization

      • Effective speculation

      • Resource arbitration

      • Dynamic reconfiguration

      • Energy efficiency

    • Design Decisions

      • Overcoming technology constraints

    • Programmer Performance Tuning

      • Where have the cycles gone?


    Conclusion grand challenges1

    token-passing analyzer

    modeling

    shotgun profiling

    parallel interactions

    serial interactions

    Conclusion: Grand Challenges

    • Cost

      • How much speedup possible from optimizing an event?

    • Slack

      • How much can an event be “slowed down” before increasing execution time?

    • Interactions

      • When do multiple events need to be optimized simultaneously?

      • When do we have a choice?




    Criticality prior work
    Criticality Prior Work

    Critical-Path Method, PERT charts

    • Developed for Navy’s “Polaris” project-1957

    • Used as a project management tool

    • Simple critical-path, slack concepts

      “Attribution” Heuristics

    • Rosenblum et al.: SOSP-1995, and many others

    • Marks instruction at head of ROB as critical, etc.

    • Empirically, has limited accuracy

    • Does not account for interactions between events


    Related work microprocessor criticality
    Related Work: MicroprocessorCriticality

    Latency tolerance analysis

    • Srinivasan and Lebeck: MICRO-1998

      Heuristics-driven criticality predictors

    • Tune et al.: HPCA-2001

    • Srinivasan et al.: ISCA-2001

      “Local” slack detector

    • Casmira and Grunwald: Kool Chips Workshop-2000

      ProfileMe with pair-wise sampling

    • Dean, et al.: MICRO-1997



    Alternative i addressing unresolved issues
    Alternative I: Addressing Unresolved Issues

    Modeling and Measurement

    • What resources can we model effectively?

      • difficulty with mutual-exclusion-type resouces (ALUs)

    • Efficient algorithms

    • Release tool for measuring cost/slack

    Hardware

    • Detailed design for criticality analyzer

    • Shotgun profiler simplifications

      • gradual path from counters

    Optimization

    • explore heuristics for exploiting interactions


    Alternative ii chip multiprocessors
    Alternative II: Chip-Multiprocessors

    • Programmer Performance Tuning

      • Parallelizing applications

        • What makes a good division into threads?

        • How can we find them automatically, or at least help programmers to find them?

    • Design Decisions

      • Should each core support out-of-order execution?

      • Should SMT be supported?

      • How many processors are useful?

      • What is the effect of inter-processor latency?


    Unresolved issues

    F

    F

    F

    F

    F

    F

    F

    F

    E

    E

    E

    E

    E

    E

    E

    E

    Altered Execution(to compute cost of inst #3 cache miss)

    C

    C

    C

    C

    C

    C

    C

    C

    Original Execution

    1. ld r2, [Mem]2. add r3  r2 + 13. ld r4, [Mem]4. add r6  r4 + 1

    (cache miss)

    1. ld r2, [Mem]2. add r3  r2 + 13. ld r4, [Mem]4. add r6  r4 + 1

    (cache miss)

    Adder contention

    Nocontention

    (cache miss)

    (cache hit)

    Contention edge

    0

    0

    0

    0

    0

    0

    Should not be here

    1

    1

    1

    1

    1

    1

    1

    10

    10

    10

    2

    1

    10

    10

    1

    10

    1

    2

    1

    1

    0

    0

    0

    0

    0

    0

    Incorrect critical path due to contention edge

    Unresolved issues

    Modeling and Measurement

    • What resources can we model effectively?

      • difficulty with mutual-exclusion-type resouces (ALUs)

        • In other words, unanticipated side effects

    1

    1


    Unresolved issues1
    Unresolved issues

    Modeling and Measurement (cont.)

    • How should processor policies be modeled?

      • relationship to icost definition

    • Efficient algorithms for measuring icosts

      • pairs of events, etc.

    • Release tool for measuring cost/slack


    Unresolved issues2
    Unresolved issues

    Hardware

    • Detailed design for criticality analyzer

      • help to convince industry-types to build it

    • Shotgun profiler simplifications

      • gradual path from counters

    Optimization

    • Explore icost optimization heuristics

      • icosts are difficult to interpret



    Validation can we trust our model

     Expect “big” speedup

     Expect no speedup

    Validation: can we trust our model?



    Validation
    Validation

    Two steps:

    • Increase latencies of insts. by their apportioned slack

      • for three apportioning strategies:

        1) latency+1, 2) 5-cycles to as many instructions as possible, 3) 12-cycles to as many loads as possible

    • Compare to baseline (no delays inserted)


    Validation1

    Worst case: Inaccuracy of 0.6%

    Validation



    Three slack variants
    Three slack variants

    Local slack:

    # cycles latency can be increased

    without delaying any subsequent instructions

    Global slack:

    # cycles latency can be increased

    without delaying the last instruction in the program

    Apportioned slack:

    Distribute global slack among instructions

    using an apportioning strategy


    Slack measurements

    ~21% insts have at least 5 cycles of local slack

    Slack measurements

    local


    Slack measurements1
    Slack measurements

    ~90% insts have at least 5 cycles of global slack

    global

    local


    Slack measurements2

    A large amount of exploitable slack exists

    Slack measurements

    ~80% insts have at least 5 cycles of apportioned slack

    global

    apportioned

    local


    Application-centered

    Slack Measurements


    Load slack
    Load slack

    Can we tolerate a long-latency L1 hit?

    design:

    wire-constrained machine, e.g. Grid

    non-uniformity:

    multi-latency L1

    apportioning strategy:

    apportion ALL slack to load instructions


    Apportion all slack to loads

    Most loads can tolerate an L2 cache hit

    Apportion all slack to loads


    Multi speed alus
    Multi-speed ALUs

    Can we tolerate ALUs running at half frequency?

    design:

    fast/slow ALUs

    non-uniformity:

    multi-latency execution latency, bypass

    apportioning strategy:

    give slack equal to original latency + 1




    Predicting slack
    Predicting slack

    Two steps to PC-indexed, history-based prediction:

    • Measure slack of a dynamic instruction

    • Store in array indexed by PC of static instruction

    Two requirements:

    • Locality of slack

    • Ability to measure slack of a dynamic instruction




    Locality of slack2

    PC-indexed, history-based predictor can capture most of the available slack

    Locality of slack


    Slack detector
    Slack Detector

    delay and observe effective for hardware predictor

    Problem #2

    Determining if overall execution time increased

    Solution

    Check if delay made instruction critical

    Problem #1

    Iterating repeatedly over same dynamic instruction

    Solution

    Only sample dynamic instruction once


    Slack detector1
    Slack Detector

    delay and observe

    Goal:

    Determine whether instruction has n cycles of slack

    • Delay the instruction by n cycles

    • Check if critical (via critical-path analyzer)

      • No, instruction has n cycles of slack

      • Yes, instruction does not have n cycles of slack



    Fast slow cluster microarchitecture
    Fast/slow cluster microarchitecture

    P  F2save ~37% core power

    ALUs

    Reg

    WIN

    Data

    Cache

    Fast, 3-wide cluster

    Fetch +

    Rename

    Steer

    Reg

    WIN

    ALUs

    Bypass Bus

    Slow, 3-wide cluster

    • Aggressive non-uniform design:

    • Higher execution latencies

    • Increased (cross-domain) bypass latency

    • Decreased effective issue bandwidth


    Picking bins for the slack predictor
    Picking bins for the slack predictor

    • Two decisions

      • Steer to fast/slow cluster

      • Schedule with high/low priority within a cluster

    Use implicit slack predictor with four bins:

    • Steer to fast cluster + schedule with high priority

    • Steer to fast cluster + schedule with low priority

    • Steer to slow cluster + schedule with high priority

    • Steer to slow cluster + schedule with low priority


    Slack based policies

    10% better performance from hiding non-uniformities

    Slack-based policies

    2 fast, high-powerclusters

    slack-based policy

    reg-dep steering



    Multithreaded execution case study
    Multithreaded Execution Case Study

    Two questions:

    • How should a program be divided into threads?

      • what makes a good cutpoint?

      • how can we find them automatically, or at least help programmers find them?

    • What should a multiple-core design look like?

      • should each core support out-of-order execution?

      • should SMT be supported?

      • how many processors are useful?

      • what is the effect of inter-processor latency?


    Parallelizing an application
    Parallelizing an application

    Why parallelize a single-thread application?

    • Legacy code, large code bases

    • Difficult to parallelize apps

      • Interpreted code, kernels of operating systems

    • Like to use better programming languages

      • Scheme, Java instead of C/C++


    Parallelizing an application1
    Parallelizing an application

    Simplifying assumption

    • Program binary unchanged

      Simplified problem statement

    • Given a program of length L, find a cutpoint that divides the program into two threads that provides maximum speedup

    • Must consider:

      • data dependences, execution latencies, control dependences, proper load balancing


    Parallelizing an application2
    Parallelizing an application

    Naive solution:

    • try every possible cutpoint

    Our solution:

    • efficiently determine the effect of every possible cutpoint

    • model execution before and after every cut


    Solution

    start

    Solution

    first instruction

    0

    1

    0

    0

    1

    0

    1

    F

    4

    1

    1

    1

    1

    1

    1

    1

    1

    2

    3

    1

    1

    E

    1

    1

    1

    3

    2

    4

    2

    2

    2

    2

    2

    0

    1

    0

    1

    0

    1

    0

    C

    0

    0

    0

    0

    0

    last instruction


    Parallelizing an application3
    Parallelizing an application

    Considerations:

    • Synchronization overhead

      • add latency to EE edges

    • Synchronization may involve turning EE to EF

    • Scheduling of threads

      • additional CF edges

        Challenges:

    • State behavior (one thread to multiple processors)

      • caches, branch predictor

    • Control behavior

      • limits where cutpoints can be made


    Parallelizing an application4
    Parallelizing an application

    More general problem:

    • Divide a program into N threads

      • NP-complete

    Icost can help:

    • icost(p1,p2) << 0 implies p1 and p2 redundant

      • action: move p1 and p2 further apart


    Preliminary results
    Preliminary Results

    Experimental Setup

    • Simulator, based loosely on SimpleScalar

    • Alpha SpecInt binaries

      Procedure

    • Assume execution trace is known

    • Look at each 1k run of instructions

    • Test every possible cutpoint using 1k graphs


    Dynamic cutpoints
    Dynamic Cutpoints

    Only 20% of cuts yield benefits of > 20 cycles



    Static cutpoints
    Static Cutpoints

    Up to 60% of cuts yield benefits of > 20 cycles


    Future avenues of research
    Future Avenues of Research

    • Map cutpoints back to actual code

      • Compare automatically generated cutpoints to human-generated ones

      • See what performance gains are in a simulator, as opposed to just on the graph

    • Look at the effect of synchronization operations

      • What additional overhead do they introduce?

    • Deal with state, control problems

      • Might need some technique outside of the graph


    Multithreaded execution case study1
    Multithreaded Execution Case Study

    Two possible questions:

    • How should a program be divided into threads?

      • what makes a good cutpoint?

      • how can we find them automatically, or at least help programmers find them?

    • What should a multiple-core design look like?

      • should each core support out-of-order execution?

      • should SMT be supported?

      • how many processors are useful?

      • what is the effect of inter-processor latency?


    Cmp design study
    CMP design study

    What we can do:

    • Try out many configurations quickly

      • dramatic changes in architecture often only small changes in graph

    • Identifying bottlenecks

      • especially interactions


    Cmp design study out of orderness
    CMP design study: Out-of-orderness

    Is out-of-order execution necessary

    in a CMP?

    Procedure

    • model execution with different configurations

      • adjust CD edges

    • compute breakdowns

      • notice resource/events interacting with CD edges


    Cmp design study out of orderness1
    CMP design study: Out-of-orderness

    first instruction

    0

    1

    0

    0

    1

    0

    1

    F

    4

    1

    1

    1

    1

    1

    1

    1

    1

    2

    3

    1

    1

    E

    1

    1

    1

    3

    2

    4

    2

    2

    2

    2

    2

    0

    1

    0

    1

    0

    1

    0

    C

    0

    0

    0

    0

    0

    last instruction


    Cmp design study out of orderness2
    CMP design study: Out-of-orderness

    Results summary

    • Single-core: Performance taps out at 256 entries

    • CMP: Performance gains up through 1024 entries

      • some benchmarks see gains up to 16k entries

    Why more beneficial?

    • Use breakdowns to find out.....


    Cmp design study out of orderness3
    CMP design study: Out-of-orderness

    Components of window cost

    • cache misses holding up retirement?

    • long strands of data dependencies?

    • predictable control flow?

    Icost breakdowns give quantitative and qualitative answers


    Cmp design study out of orderness4

    Serial Interaction

    Parallel Interaction

    ALU

    ALU

    ALU

    cache

    misses

    cache

    misses

    equal

    cache

    misses

    interaction

    interaction

    CMP design study: Out-of-orderness

    Independent

    100%

    window

    cost

    0%

    cost(window) +

    icost(window, A) + icost(window, B) + icost(window, AB) = 0


    Summary of preliminary results
    Summary of Preliminary Results

    icost(window, ALU operations) << 0

    • primarily communication between processors

    • window often stalled waiting for data

    Implications

    • larger window may be overkill

    • need a cheap non-blocking solution

      • e.g., continual-flow pipelines


    Cmp design study smt
    CMP design study: SMT?

    Benefits

    • reduced thread start-up latency

    • reduced communication costs

    How we could help

    • distribution of thread lengths

    • breakdowns to understand effect of communication


    Cmp design study how many processors

    Start

    #1

    #2

    #1

    #2

    #1

    CMP design study: How many processors?


    Cmp design study other questions
    CMP design study: Other Questions

    What is the effect of inter-processor communication latency?

    • understand hidden vs. exposed communication

      Allocating processors to programs

    • methodology for O/S to better assign programs to processors





    Annotated with Dependence Edges

    Fetch BW

    Data Dep

    ROB

    Branch Misp.


    Edge Weights Added

    1

    1

    1

    1

    0

    2

    1

    1

    1

    3

    1

    1

    1


    F

    F

    F

    F

    F

    F

    F

    F

    F

    F

    F

    E

    E

    E

    E

    E

    E

    E

    E

    E

    E

    E

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    1

    1

    0

    1

    1

    1

    0

    2

    2

    1

    0

    1

    2

    1

    1

    1

    1

    0

    1

    1

    Convert to Graph

    1

    1

    1

    1

    1

    2

    0

    1

    1

    1

    3

    1

    1

    1


    F

    F

    F

    F

    F

    F

    F

    F

    F

    F

    F

    E

    E

    E

    E

    E

    E

    E

    E

    E

    E

    E

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    Find Critical Path

    1

    1

    0

    1

    1

    1

    1

    1

    0

    1

    1

    2

    2

    1

    2

    0

    1

    0

    1

    1

    1

    2

    1

    1

    1

    3

    1

    1

    0

    1

    1

    1

    1

    1


    1

    1

    0

    0

    1

    1

    F

    F

    F

    F

    F

    F

    F

    F

    F

    F

    F

    E

    E

    E

    E

    E

    E

    E

    E

    E

    E

    E

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    1

    1

    0

    1

    0

    1

    1

    1

    1

    0

    1

    0

    1

    1

    2

    2

    0

    2

    0

    1

    1

    1

    1

    1

    0

    0

    1

    1

    1

    1

    1

    0

    0

    1

    1

    1

    2

    1

    1

    0

    0

    1

    1

    1

    1

    1

    0

    3

    0

    1

    1

    1

    1

    1

    0

    1

    0

    1

    1

    1

    1

    0

    1

    0

    1

    1

    Add Non-last-arriving Edges


    1

    1

    0

    0

    1

    1

    F

    F

    F

    F

    F

    F

    F

    F

    F

    F

    F

    E

    E

    E

    E

    E

    E

    E

    E

    E

    E

    E

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    1

    1

    0

    1

    0

    1

    1

    1

    1

    0

    1

    0

    1

    1

    2

    2

    0

    2

    0

    1

    1

    1

    1

    1

    0

    0

    1

    1

    1

    1

    1

    0

    0

    1

    1

    1

    2

    1

    1

    0

    0

    1

    1

    1

    1

    1

    0

    0

    1

    1

    1

    1

    1

    0

    1

    0

    1

    1

    1

    1

    0

    1

    0

    1

    1

    Graph Alterations

    Branch misprediction made correct



    R1 R2 + R3

    Step 1. Observing

    • Observation:

    R2

    E

    R3

    Dependence resolved early

    If dependence into R2 is on critical path, then value of R2arrived last.

    critical  arrives last

    arrives last critical


    Determining last arrive edges1

    F

    F

    F

    F

    F

    F

    F

    F

    E

    E

    E

    E

    E

    E

    E

    E

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    EFif branch misp.

    CFif ROB stall

    FFotherwise

    EEobserve arrival order of operands

    F

    F

    F

    F

    F

    E

    E

    E

    E

    E

    FEif data ready on fetch

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    Determining last-arrive edges

    Observe events within the machine

    last_arrive[F] =

    last_arrive[C] =

    last_arrive[E] =

    F

    F

    F

    F

    E

    E

    E

    E

    C

    C

    C

    C

    C

    C

    C

    C

    ECif commit pointer is delayed

    CCotherwise


    Last-arrive edges: a CPU stethoscope

    F  E

    C  F

    E  E

    E  F

    E  C

    F  F

    C  C

    CPU


    Last arrive edges1
    Last-arrive edges

    0

    F

    0

    1

    1

    0

    1

    0

    1

    1

    1

    1

    1

    1

    1

    1

    4

    1

    2

    3

    1

    E

    1

    1

    1

    2

    3

    2

    4

    2

    2

    2

    2

    C

    0

    1

    0

    1

    0

    1

    0

    0

    0

    0

    0

    0


    Remove latencies

    Do not need explicit weights

    F

    E

    C


    Last arrive edges2
    Last-arrive edges

    The last-arrive rule

    CP consists only of “last-arrive” edges

    F

    E

    C


    Prune the graph1
    Prune the graph

    Only need to put last-arrive edges in graph

    No other edges could be on CP

    F

    E

    C

    newest


    …and we’ve found the critical path!

    Backward propagate along last-arrive edges

    F

    E

    C

    newest

    newest

    • Found CPby only observing last-arrive edges

    • but still requires constructing entire graph


    Step 2 efficient analysis
    Step 2. Efficient analysis

    CP is a ”long” chain of last-arrive edges.

    • the longer a given chain of last-arrive edges, the more likely it is part of the CP

      Algorithm: find sufficiently long last-arrive chains

    • Plant token into a node n

    • Propagate forward, only along last-arrive edges

    • Check for token after several hundred cycles

    • If token alive, n is assumed critical


    ROB Size

    1. plant token

    Critical

    3. is token alive?

    4. yes, train critical

    Token-passing example

    2. propagate token

    • Found CPwithout constructing entire graph


    Implementation: a small SRAM array

    Last-arrive producer node (inst id, type)

    Read

    Commited(inst id, type)

    Write

    Token Queue

    Size of SRAM: 3 bits  ROB size <200 Bytes

    Simply replicate for additional tokens


    Putting it all together

    Prediction Path

    CP

    predictiontable

    PC

    OOO Core

    E-critical?

    Token-Passing

    Analyzer

    Training Path

    Last-arrive edges

    (producer  retired instr)



    Case study 1 clustered architectures

    • Base + CP Scheduling

    • Base + CP Scheduling + CP Steering

    Case Study #1: Clustered architectures

    issue

    window

    steering

    scheduling


    Current State of the Art

    Constant issue width, clock frequency

    unclustered

    2 cluster

    4 cluster

    • Avg. clustering penalty for 4 clusters: 19%


    unclustered

    2 cluster

    4 cluster

    CP Optimizations

    Base + CP Scheduling


    unclustered

    2 cluster

    4 cluster

    CP Optimizations

    Base + CP Scheduling + CP Steering

    • Avg. clustering penalty reduced from 19% to 6%



    Local vs global analysis
    Local Vs. Global Analysis

    Previous CP predictors:

    local resource-sensitive predictions (HPCA 01, ISCA 01)

    token-passing

    oldest-uncommited

    oldest-unissued

    • CP exploitation seems to require global analysis



    Icost case study deep pipelines7

    Assume 4-cycle DL1 access; how to mitigate?

    Increase cache ports? Increase window size?

    Increase fetch BW? Reduce cache misses?

    Icost Case Study: Deep pipelines

    Deep pipelines cause long latency loops:

    • level-one (DL1) cache access, issue-wakeup, branch misprediction, …

    But can often mitigate them indirectly

    Really, looking for serial interactions!


    Icost case study deep pipelines8
    Icost Case Study: Deep pipelines

    DL1 access

    1

    0

    1

    12

    12

    F

    F

    F

    F

    F

    F

    5

    5

    5

    5

    5

    5

    4

    2

    E

    E

    E

    E

    E

    E

    14

    4

    1

    6

    9

    18

    7

    6

    7

    0

    1

    0

    1

    0

    C

    C

    C

    C

    C

    C

    i1

    i2

    i3

    i4

    i5

    i6

    window edge


    Icost case study deep pipelines9
    Icost Case Study: Deep pipelines

    DL1 access

    1

    0

    1

    12

    12

    F

    F

    F

    F

    F

    F

    5

    5

    5

    5

    5

    5

    4

    2

    E

    E

    E

    E

    E

    E

    14

    4

    1

    6

    9

    18

    7

    6

    7

    0

    1

    0

    1

    0

    C

    C

    C

    C

    C

    C

    i1

    i2

    i3

    i4

    i5

    i6

    window edge


    Icost case study deep pipelines10
    Icost Case Study: Deep pipelines

    DL1 access

    1

    0

    1

    12

    12

    F

    F

    F

    F

    F

    F

    5

    5

    5

    5

    5

    5

    4

    2

    E

    E

    E

    E

    E

    E

    14

    4

    1

    6

    9

    18

    7

    6

    7

    0

    1

    0

    1

    0

    C

    C

    C

    C

    C

    C

    i1

    i2

    i3

    i4

    i5

    i6

    window edge


    Icost case study deep pipelines11
    Icost Case Study: Deep pipelines

    DL1 access

    1

    0

    1

    12

    12

    F

    F

    F

    F

    F

    F

    5

    5

    5

    5

    5

    5

    4

    2

    E

    E

    E

    E

    E

    E

    14

    4

    1

    6

    9

    18

    7

    6

    7

    0

    1

    0

    1

    0

    C

    C

    C

    C

    C

    C

    i1

    i2

    i3

    i4

    i5

    i6

    window edge


    Icost case study deep pipelines12
    Icost Case Study: Deep pipelines

    DL1 access

    1

    0

    1

    12

    12

    F

    F

    F

    F

    F

    F

    5

    5

    5

    5

    5

    5

    4

    2

    E

    E

    E

    E

    E

    E

    14

    4

    1

    6

    9

    18

    7

    6

    7

    0

    1

    0

    1

    0

    C

    C

    C

    C

    C

    C

    i1

    i2

    i3

    i4

    i5

    i6

    window edge


    Icost case study deep pipelines13
    Icost Case Study: Deep pipelines

    DL1 access

    1

    0

    1

    12

    12

    F

    F

    F

    F

    F

    F

    5

    5

    5

    5

    5

    5

    4

    2

    E

    E

    E

    E

    E

    E

    14

    4

    1

    6

    9

    18

    7

    6

    7

    0

    1

    0

    1

    0

    C

    C

    C

    C

    C

    C

    i1

    i2

    i3

    i4

    i5

    i6

    window edge


    Icost case study deep pipelines14
    Icost Case Study: Deep pipelines

    DL1 access

    1

    0

    1

    12

    12

    F

    F

    F

    F

    F

    F

    5

    5

    5

    5

    5

    5

    4

    2

    E

    E

    E

    E

    E

    E

    14

    4

    1

    6

    9

    18

    7

    6

    7

    0

    1

    0

    1

    0

    C

    C

    C

    C

    C

    C

    i1

    i2

    i3

    i4

    i5

    i6

    window edge


    Icost breakdown 6 wide 64 entry window4
    Icost Breakdown (6 wide, 64-entry window)


    Icost breakdown 6 wide 64 entry window5
    Icost Breakdown (6 wide, 64-entry window)


    Icost breakdown 6 wide 64 entry window6
    Icost Breakdown (6 wide, 64-entry window)


    Icost breakdown 6 wide 64 entry window7
    Icost Breakdown (6 wide, 64-entry window)





    Profiling goal4
    Profiling goal

    Goal:

    • Construct graph

    many dynamic instructions

    Constraint:

    • Can only sample sparsely


    Profiling goal5

    Genome sequencing

    Profiling goal

    Goal:

    • Construct graph

    DNA strand

    DNA

    Constraint:

    • Can only sample sparsely


    Shotgun genome sequencing8
    Shotgun” genome sequencing

    DNA


    Shotgun genome sequencing9
    Shotgun” genome sequencing

    DNA


    Shotgun genome sequencing10
    Shotgun” genome sequencing

    DNA

    . . .

    . . .


    Shotgun genome sequencing11
    Shotgun” genome sequencing

    DNA

    . . .

    . . .

    Find overlaps among samples

    . . .

    . . .


    Mapping shotgun to our situation2

    Icache miss

    Dcache miss

    Branch misp.

    No event

    Mapping “shotgun” to our situation

    many dynamic instructions


    Profiler hardware requirements4

    . . .

    . . .

    Profiler hardware requirements


    Profiler hardware requirements5

    Match!

    Profiler hardware requirements

    . . .

    . . .


    Offline profiler algorithm
    Offline Profiler Algorithm

    long sample

    detailed samples


    Design issues

    • Choosing signature bits

    if

    =

    then

    =

    Start PC

    12

    16

    20

    24

    56

    60

    . . .

    branch

    encode taken/not-taken bit in signature

    Design issues

    • Determining PCs (for better detailed sample matching)

    long sample











    Compare icost and sensitivity study

    DL1 access

    1

    0

    1

    0

    1

    F

    F

    F

    F

    F

    F

    1

    1

    1

    1

    1

    1

    4

    2

    E

    E

    E

    E

    E

    E

    2

    3

    1

    2

    1

    2

    3

    2

    3

    0

    1

    0

    1

    0

    C

    C

    C

    C

    C

    C

    i1

    i2

    i3

    i4

    i5

    i6

    Compare Icost and Sensitivity Study

    Corollary to DL1 and ROB serial interaction:

    As load latency increases, the benefit from enlarging the ROB increases.



    Compare icost and sensitivity study2
    Compare Icost and Sensitivity Study

    Sensitivity Study Advantages

    • More information

      • e.g., concave or convex curves

        Interaction Cost Advantages

    • Easy (automatic) interpretation

      • Sign and magnitude have well defined meanings

    • Concise communication

      • DL1 and ROB interact serially


    Outline2
    Outline

    • Definition (ISCA ’01)

      • what does it mean for an event to be critical?

    • Detection (ISCA ’01)

      • how can we determine what events are critical?

    • Interpretation (MICRO ’04, TACO ’04)

      • what does it mean for two events to interact?

    • Application (ISCA ’01-’02, TACO ’04)

      • how can we exploit criticality in hardware?


    Our solution measure interactions2
    Our solution: measure interactions

    Two parallel cache misses (Each 100 cycles)

    miss #1 (100)

    miss #2 (100)

    Cost(miss #1) = 0

    Cost(miss #2) = 0

    Cost({miss #1, miss #2}) = 100

    Aggregate cost > Sum of individual costs Parallel interaction

    100

    0 + 0

    icost = aggregate cost – sum of individual costs = 100 – 0 – 0 = 100


    Interaction cost icost6

    miss #1

    • Positive icost  parallel interaction

    miss #2

    Interaction cost (icost)

    icost = aggregate cost – sum of individual costs

    • Zero icost ?


    Interaction cost icost7
    Interaction cost (icost)

    icost = aggregate cost – sum of individual costs

    miss #1

    • Positive icost  parallel interaction

    miss #2

    . . .

    • Zero icost  independent

    miss #2

    miss #1

    • Negative icost ?


    Negative icost4
    Negative icost

    Two serial cache misses (data dependent)

    miss #1 (100)

    miss #2 (100)

    ALU latency (110 cycles)

    Cost(miss #1) = ?


    Negative icost5
    Negative icost

    Two serial cache misses (data dependent)

    miss #1 (100)

    miss #2 (100)

    ALU latency (110 cycles)

    Cost(miss #1) = 90

    Cost(miss #2) = 90

    Cost({miss #1, miss #2}) = 90

    icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90

    Negative icost  serial interaction


    Interaction cost icost8

    Branch mispredict

    Load-Replay Trap

    Interaction cost (icost)

    icost = aggregate cost – sum of individual costs

    Fetch BW

    miss #1

    • Positive icost  parallel interaction

    LSQ stall

    miss #2

    . . .

    • Zero icost  independent

    miss #2

    miss #1

    miss #1

    miss #2

    • Negative icost serial interaction

    ALU latency


    Why care about serial interactions2

    Reason #1 We are over-optimizing!

    Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)

    Reason #2 We have a choice of what to optimize

    Prefetching miss #2 has the same effect as miss #1

    Why care about serial interactions?

    miss #1 (100)

    miss #2 (100)

    ALU latency (110 cycles)


    Outline3
    Outline

    • Definition (ISCA ’01)

      • what does it mean for an event to be critical?

    • Detection (ISCA ’01)

      • how can we determine what events are critical?

    • Interpretation (MICRO ’04, TACO ’04)

      • what does it mean for two events to interact?

    • Application (ISCA ’01-’02, TACO ’04)

      • how can we exploit criticality in hardware?


    Criticality analyzer isca 01
    Criticality Analyzer (ISCA ‘01)

    • Goal

      • Detect criticality of dynamic instructions

    • Procedure

      • Observelast-arrivingedges

        • uses simple rules

      • Propagate atokenforward along last-arriving edges

        • at worst, a read-modify-write sequence to a small array

      • If tokendies, non-critical; otherwise, critical


    Slack analyzer isca 02
    Slack Analyzer (ISCA ‘02)

    • Goal

      • Detect likely slack of static instructions

    • Procedure

      • Delay the instruction by n cycles

      • Check if critical (via critical-path analyzer)

        • No, instruction has n cycles of slack

        • Yes, instruction does not have n cycles of slack


    Shotgun profiling taco 04
    Shotgun Profiling (TACO ‘04)

    • Goal

      • Create representative graph fragments

    • Procedure

      • Enhance ProfileMe counters with context

      • Use context to piece together counter samples


    ad