using criticality to attack performance bottlenecks
Download
Skip this Video
Download Presentation
Using Criticality to Attack Performance Bottlenecks

Loading in 2 Seconds...

play fullscreen
1 / 233

Using Criticality to Attack Performance Bottlenecks - PowerPoint PPT Presentation


  • 64 Views
  • Uploaded on

Using Criticality to Attack Performance Bottlenecks. Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn). Bottleneck Analysis. Bottleneck Analysis: Determining the performance effect of an event on execution time. An event could be:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Using Criticality to Attack Performance Bottlenecks' - ruth-york


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
using criticality to attack performance bottlenecks

Using Criticality to Attack Performance Bottlenecks

Brian Fields

UC-Berkeley

(Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

bottleneck analysis
Bottleneck Analysis

Bottleneck Analysis:

Determining the performance effect of an

event on execution time

  • An event could be:
    • an instruction’s execution
    • an instruction-window-full stall
    • a branch mispredict
    • a network request
    • inter-processor communication
    • etc.
bottleneck analysis applications
Bottleneck Analysis Applications
  • Run-time Optimization
    • Resource arbitration
      • e.g., how to scheduling memory accesses?
    • Effective speculation
      • e.g., which branches to predicate?
    • Dynamic reconfiguration
      • e.g, when to enable hyperthreading?
    • Energy efficiency
      • e.g., when to throttle frequency?
  • Design Decisions
    • Overcoming technology constraints
      • e.g., how to mitigate effect of long wire latencies?
  • Programmer Performance Tuning
    • Where have the cycles gone?
      • e.g., which cache misses should be prefetched?
current state of art
miss1 (100 cycles)

miss2 (100 cycles)

2 misses but only 1 miss penalty

Current state-of-art

Event counts:

Exe. time = (CPU cycles + Mem. cycles) * Clock cycle time

where:

Mem. cycles = Number of cache misses * Miss penalty

parallelism
Parallelism

Parallelism in systems complicates performance understanding

  • Two parallel cache misses
  • Two parallel threads
  • A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing
criticality challenges
Criticality Challenges
  • Cost
    • How much speedup possible from optimizing an event?
  • Slack
    • How much can an event be “slowed down” before increasing execution time?
  • Interactions
    • When do multiple events need to be optimized simultaneously?
    • When do we have a choice?
  • Exploit in Hardware
our approach criticality
Our Approach: Criticality

Critical events affect execution time,

non-critical do not

Bottleneck Analysis:

Determining the performance effect of an

event on execution time

defining criticality
Defining criticality

Need Performance Sensitivity

  • slowing down a “critical” event should slow down the entire program
  • speeding up a “noncritical” event should leave execution time unchanged
slide13
(MISP)

Annotated with Dependence Edges

slide14
Fetch BW

Data Dep

ROB

Branch Misp.

Annotated with Dependence Edges

slide15
Edge Weights Added

1

1

1

1

0

2

1

1

1

3

1

1

1

slide16
F

F

F

F

F

F

F

F

F

F

F

E

E

E

E

E

E

E

E

E

E

E

C

C

C

C

C

C

C

C

C

C

C

1

1

1

1

1

1

2

2

1

1

1

1

2

1

1

1

1

1

1

1

1

1

Convert to Graph

0

1

1

0

1

1

1

2

0

0

1

1

1

3

0

1

1

1

slide17
F

F

F

F

F

F

F

F

F

F

F

E

E

E

E

E

E

E

E

E

E

E

C

C

C

C

C

C

C

C

C

C

C

Convert to Graph

1

1

0

1

1

1

1

1

1

0

1

1

2

2

1

2

0

1

1

0

1

1

1

1

2

1

1

1

1

3

1

1

0

1

1

1

1

1

1

1

smaller graph instance
Critical Icache miss,But how costly?

Non-critical,But how much slack?

Smaller graph instance

0

0

10

1

F

F

F

F

F

1

1

1

1

1

1

E

E

E

E

E

3

1

1

1

1

1

C

C

C

C

C

add hidden constraints
Critical Icache miss,But how costly?

Non-critical,But how much slack?

Add “hidden” constraints

0

0

10

1

F

F

F

F

F

1

1

1

1

1

1

2

1

1

1

1

E

E

E

E

E

3

1

1

1

1

1

C

C

C

C

C

1

0

0

1

add hidden constraints1
Slack = 13 – 7 = 6 cyclesAdd “hidden” constraints

Cost = 13 – 7 = 6 cycles

0

0

10

1

F

F

F

F

F

1

1

1

1

1

1

2

1

1

1

1

E

E

E

E

E

3

1

1

1

1

1

C

C

C

C

C

1

0

0

1

slack sharing
Slack = 6 cyclesSlack “sharing”

0

0

10

1

F

F

F

F

F

1

1

1

1

1

1

2

1

1

1

1

E

E

E

E

E

3

1

1

1

1

1

C

C

C

C

C

1

0

0

1

Slack = 6 cycles

Can delay one edge by 6 cycles, but not both!

criticality challenges1
Criticality Challenges
  • Cost
    • How much speedup possible from optimizing an event?
  • Slack
    • How much can an event be “slowed down” before increasing execution time?
  • Interactions
    • When do multiple events need to be optimized simultaneously?
    • When do we have a choice?
  • Exploit in Hardware
simple criticality not always enough
Simple criticality not always enough

Sometimes events have nearly equal criticality

miss #1 (99)

miss #2 (100)

Want to know

  • how critical is each event?
  • how far from critical is each event?

Actually, even that is not enough

our solution measure interactions
Our solution: measure interactions

Two parallel cache misses

miss #1 (99)

miss #2 (100)

Cost(miss #1) = 0

Cost(miss #2) = 1

Cost({miss #1, miss #2}) = 100

Aggregate cost > Sum of individual costs Parallel interaction

100

0 + 1

icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99

interaction cost icost
miss #1
  • Positive icost  parallel interaction

miss #2

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

  • Zero icost ?
interaction cost icost1
Interaction cost (icost)

icost = aggregate cost – sum of individual costs

miss #1

  • Positive icost  parallel interaction

miss #2

. . .

  • Zero icost  independent

miss #2

miss #1

  • Negative icost ?
negative icost
Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)

Cost(miss #1) = ?

negative icost1
Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)

Cost(miss #1) = 90

Cost(miss #2) = 90

Cost({miss #1, miss #2}) = 90

icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90

Negative icost  serial interaction

interaction cost icost2
Branch mispredict

Load-Replay Trap

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

Fetch BW

miss #1

  • Positive icost  parallel interaction

LSQ stall

miss #2

. . .

  • Zero icost  independent

miss #2

miss #1

miss #1

miss #2

  • Negative icost serial interaction

ALU latency

why care about serial interactions
Reason #1 We are over-optimizing!

Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)

Reason #2 We have a choice of what to optimize

Prefetching miss #2 has the same effect as miss #1

Why care about serial interactions?

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)

icost case study deep pipelines
4Icost Case Study: Deep pipelines

1

Dcache (DL1)

Looking for serial interactions!

icost case study deep pipelines1
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

icost case study deep pipelines2
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

icost case study deep pipelines3
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

icost case study deep pipelines4
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

icost case study deep pipelines5
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

icost case study deep pipelines6
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

criticality challenges2
Criticality Challenges
  • Cost
    • How much speedup possible from optimizing an event?
  • Slack
    • How much can an event be “slowed down” before increasing execution time?
  • Interactions
    • When do multiple events need to be optimized simultaneously?
    • When do we have a choice?
  • Exploit in Hardware
exploit in hardware
Exploit in Hardware
  • Criticality Analyzer
      • Online, fast-feedback
      • Limited to critical/not critical
  • Replacement for Performance Counters
      • Requires offline analysis
      • Constructs entire graph
slide45
R1 R2 + R3

Only last-arriving edges can be critical

  • Observation:

R2

E

R3

Dependence resolved early

If dependence into R2 is on critical path, then value of R2arrived last.

critical  arrives last

arrives last critical

determining last arrive edges
F

F

F

F

F

F

F

F

E

E

E

E

E

E

E

E

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

EFif branch misp.

CFif ROB stall

FFotherwise

EEobserve arrival order of operands

F

F

F

F

F

E

E

E

E

E

FEif data ready on fetch

C

C

C

C

C

C

C

C

C

C

Determining last-arrive edges

Observe events within the machine

last_arrive[F] =

last_arrive[C] =

last_arrive[E] =

F

F

F

F

E

E

E

E

C

C

C

C

C

C

C

C

ECif commit pointer is delayed

CCotherwise

last arrive edges
Last-arrive edges

The last-arrive rule

CP consists only of “last-arrive” edges

F

E

C

prune the graph
Prune the graph

Only need to put last-arrive edges in graph

No other edges could be on CP

F

E

C

newest

slide49
…and we’ve found the critical path!

Backward propagate along last-arrive edges

F

E

C

newest

newest

  • Found CPby only observing last-arrive edges
  • but still requires constructing entire graph
step 2 reducing storage reqs
Step 2. Reducing storage reqs

CP is a ”long” chain of last-arrive edges.

  • the longer a given chain of last-arrive edges, the more likely it is part of the CP

Algorithm: find sufficiently long last-arrive chains

  • Plant token into a node n
  • Propagate forward, only along last-arrive edges
  • Check for token after several hundred cycles
  • If token alive, n is assumed critical
slide51
Online Criticality Detection

Forward propagate token

F

PlantToken

E

C

newest

newest

slide52
Tokens“Die”

Online Criticality Detection

Forward propagate token

F

PlantToken

E

C

newest

newest

slide53
Online Criticality Detection

Forward propagate token

F

PlantToken

E

C

Token survives!

slide54
Putting it all together

Prediction Path

CP

predictiontable

PC

OOO Core

E-critical?

Token-Passing

Analyzer

Training Path

Last-arrive edges

(producer  retired instr)

results
Results
  • Performance (Speed)
    • Scheduling in clustered machines
      • 10% speedup
    • Selective value prediction
    • Deferred scheduling (Crowe, et al)
      • 11% speedup
    • Heterogeneous cache (Rakvic, et al.)
      • 17% speedup
  • Energy
    • Non-uniform machine: fast and slow pipelines
      • ~25% less energy
    • Instruction queue resizing (Sasanka, et al.)
    • Multiple frequency scaling (Semeraro, et al.)
      • 19% less energy with 3% less performance
    • Selective pre-execution (Petric, et al.)
exploit in hardware1
Exploit in Hardware
  • Criticality Analyzer
      • Online, fast-feedback
      • Limited to critical/not critical
  • Replacement for Performance Counters
      • Requires offline analysis
      • Constructs entire graph
profiling goal
Profiling goal

Goal:

  • Construct graph

many dynamic instructions

Constraint:

  • Can only sample sparsely
profiling goal1
Genome sequencingProfiling goal

Goal:

  • Construct graph

DNA strand

DNA

Constraint:

  • Can only sample sparsely
shotgun genome sequencing3
“Shotgun” genome sequencing

DNA

. . .

. . .

Find overlaps among samples

. . .

. . .

mapping shotgun to our situation
Icache miss

Dcache miss

Branch misp.

No event

Mapping “shotgun” to our situation

many dynamic instructions

conclusion grand challenges
token-passing analyzer

modeling

shotgun profiling

parallel interactions

serial interactions

Conclusion: Grand Challenges
  • Cost
    • How much speedup possible from optimizing an event?
  • Slack
    • How much can an event be “slowed down” before increasing execution time?
  • Interactions
    • When do multiple events need to be optimized simultaneously?
    • When do we have a choice?
conclusion bottleneck analysis applications
Selective value prediction

Scheduling and steering in clustered processors

Resize instruction window

Non-uniform machines

Helped cope with high-latency dcache

Measured cost of cache misses/branch mispredicts

Conclusion: Bottleneck Analysis Applications
  • Run-time Optimization
    • Effective speculation
    • Resource arbitration
    • Dynamic reconfiguration
    • Energy efficiency
  • Design Decisions
    • Overcoming technology constraints
  • Programmer Performance Tuning
    • Where have the cycles gone?
outline
Outline

Simple Criticality

  • Definition (ISCA ’01)
  • Detection (ISCA ’01)
  • Application (ISCA ’01-’02)

Advanced Criticality

  • Interpretation (MICRO ’03)
    • What types of interactions are possible?
  • Hardware Support (MICRO ’03, TACO ’04)
    • Enhancement to performance counters
simple criticality not always enough1
Simple criticality not always enough

Sometimes events have nearly equal criticality

miss #1 (99)

miss #2 (100)

Want to know

  • how critical is each event?
  • how far from critical is each event?

Actually, even that is not enough

our solution measure interactions1
Our solution: measure interactions

Two parallel cache misses

miss #1 (99)

miss #2 (100)

Cost(miss #1) = 0

Cost(miss #2) = 1

Cost({miss #1, miss #2}) = 100

Aggregate cost > Sum of individual costs Parallel interaction

100

0 + 1

icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99

interaction cost icost3
miss #1
  • Positive icost  parallel interaction

miss #2

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

  • Zero icost ?
interaction cost icost4
Interaction cost (icost)

icost = aggregate cost – sum of individual costs

miss #1

  • Positive icost  parallel interaction

miss #2

. . .

  • Zero icost  independent

miss #2

miss #1

  • Negative icost ?
negative icost2
Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)

Cost(miss #1) = ?

negative icost3
Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)

Cost(miss #1) = 90

Cost(miss #2) = 90

Cost({miss #1, miss #2}) = 90

icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90

Negative icost  serial interaction

interaction cost icost5
Branch mispredict

Load-Replay Trap

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

Fetch BW

miss #1

  • Positive icost  parallel interaction

LSQ stall

miss #2

. . .

  • Zero icost  independent

miss #2

miss #1

miss #1

miss #2

  • Negative icost serial interaction

ALU latency

why care about serial interactions1
Reason #1 We are over-optimizing!

Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)

Reason #2 We have a choice of what to optimize

Prefetching miss #2 has the same effect as miss #1

Why care about serial interactions?

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)

outline1
Outline

Simple Criticality

  • Definition (ISCA ’01)
  • Detection (ISCA ’01)
  • Application (ISCA ’01-’02)

Advanced Criticality

  • Interpretation (MICRO ’03)
    • What types of interactions are possible?
  • Hardware Support (MICRO ’03, TACO ’04)
    • Enhancement to performance counters
profiling goal2
Profiling goal

Goal:

  • Construct graph

many dynamic instructions

Constraint:

  • Can only sample sparsely
profiling goal3
Genome sequencingProfiling goal

Goal:

  • Construct graph

DNA strand

DNA

Constraint:

  • Can only sample sparsely
shotgun genome sequencing7
“Shotgun” genome sequencing

DNA

. . .

. . .

Find overlaps among samples

. . .

. . .

mapping shotgun to our situation1
Icache miss

Dcache miss

Branch misp.

No event

Mapping “shotgun” to our situation

many dynamic instructions

conclusion bottleneck analysis applications1
Selective value prediction

Scheduling and steering in clustered processors

Resize instruction window

Non-uniform machines

Helped cope with high-latency dcache

Measured cost of cache misses/branch mispredicts

Conclusion: Bottleneck Analysis Applications
  • Run-time Optimization
    • Effective speculation
    • Resource arbitration
    • Dynamic reconfiguration
    • Energy efficiency
  • Design Decisions
    • Overcoming technology constraints
  • Programmer Performance Tuning
    • Where have the cycles gone?
conclusion grand challenges1
token-passing analyzer

modeling

shotgun profiling

parallel interactions

serial interactions

Conclusion: Grand Challenges
  • Cost
    • How much speedup possible from optimizing an event?
  • Slack
    • How much can an event be “slowed down” before increasing execution time?
  • Interactions
    • When do multiple events need to be optimized simultaneously?
    • When do we have a choice?
criticality prior work
Criticality Prior Work

Critical-Path Method, PERT charts

  • Developed for Navy’s “Polaris” project-1957
  • Used as a project management tool
  • Simple critical-path, slack concepts

“Attribution” Heuristics

  • Rosenblum et al.: SOSP-1995, and many others
  • Marks instruction at head of ROB as critical, etc.
  • Empirically, has limited accuracy
  • Does not account for interactions between events
related work microprocessor criticality
Related Work: MicroprocessorCriticality

Latency tolerance analysis

  • Srinivasan and Lebeck: MICRO-1998

Heuristics-driven criticality predictors

  • Tune et al.: HPCA-2001
  • Srinivasan et al.: ISCA-2001

“Local” slack detector

  • Casmira and Grunwald: Kool Chips Workshop-2000

ProfileMe with pair-wise sampling

  • Dean, et al.: MICRO-1997
alternative i addressing unresolved issues
Alternative I: Addressing Unresolved Issues

Modeling and Measurement

  • What resources can we model effectively?
    • difficulty with mutual-exclusion-type resouces (ALUs)
  • Efficient algorithms
  • Release tool for measuring cost/slack

Hardware

  • Detailed design for criticality analyzer
  • Shotgun profiler simplifications
    • gradual path from counters

Optimization

  • explore heuristics for exploiting interactions
alternative ii chip multiprocessors
Alternative II: Chip-Multiprocessors
  • Programmer Performance Tuning
    • Parallelizing applications
      • What makes a good division into threads?
      • How can we find them automatically, or at least help programmers to find them?
  • Design Decisions
    • Should each core support out-of-order execution?
    • Should SMT be supported?
    • How many processors are useful?
    • What is the effect of inter-processor latency?
unresolved issues
F

F

F

F

F

F

F

F

E

E

E

E

E

E

E

E

Altered Execution(to compute cost of inst #3 cache miss)

C

C

C

C

C

C

C

C

Original Execution

1. ld r2, [Mem]2. add r3  r2 + 13. ld r4, [Mem]4. add r6  r4 + 1

(cache miss)

1. ld r2, [Mem]2. add r3  r2 + 13. ld r4, [Mem]4. add r6  r4 + 1

(cache miss)

Adder contention

Nocontention

(cache miss)

(cache hit)

Contention edge

0

0

0

0

0

0

Should not be here

1

1

1

1

1

1

1

10

10

10

2

1

10

10

1

10

1

2

1

1

0

0

0

0

0

0

Incorrect critical path due to contention edge

Unresolved issues

Modeling and Measurement

  • What resources can we model effectively?
    • difficulty with mutual-exclusion-type resouces (ALUs)
      • In other words, unanticipated side effects

1

1

unresolved issues1
Unresolved issues

Modeling and Measurement (cont.)

  • How should processor policies be modeled?
    • relationship to icost definition
  • Efficient algorithms for measuring icosts
    • pairs of events, etc.
  • Release tool for measuring cost/slack
unresolved issues2
Unresolved issues

Hardware

  • Detailed design for criticality analyzer
    • help to convince industry-types to build it
  • Shotgun profiler simplifications
    • gradual path from counters

Optimization

  • Explore icost optimization heuristics
    • icosts are difficult to interpret
validation can we trust our model
Run two simulations :
  • Reduce CP latencies
  • Reduce non-CP latencies

 Expect “big” speedup

 Expect no speedup

Validation: can we trust our model?
validation
Validation

Two steps:

  • Increase latencies of insts. by their apportioned slack
    • for three apportioning strategies:

1) latency+1, 2) 5-cycles to as many instructions as possible, 3) 12-cycles to as many loads as possible

  • Compare to baseline (no delays inserted)
three slack variants
Three slack variants

Local slack:

# cycles latency can be increased

without delaying any subsequent instructions

Global slack:

# cycles latency can be increased

without delaying the last instruction in the program

Apportioned slack:

Distribute global slack among instructions

using an apportioning strategy

slack measurements1
Slack measurements

~90% insts have at least 5 cycles of global slack

global

local

slack measurements2
A large amount of exploitable slack existsSlack measurements

~80% insts have at least 5 cycles of apportioned slack

global

apportioned

local

slide114
Application-centered

Slack Measurements

load slack
Load slack

Can we tolerate a long-latency L1 hit?

design:

wire-constrained machine, e.g. Grid

non-uniformity:

multi-latency L1

apportioning strategy:

apportion ALL slack to load instructions

multi speed alus
Multi-speed ALUs

Can we tolerate ALUs running at half frequency?

design:

fast/slow ALUs

non-uniformity:

multi-latency execution latency, bypass

apportioning strategy:

give slack equal to original latency + 1

predicting slack
Predicting slack

Two steps to PC-indexed, history-based prediction:

  • Measure slack of a dynamic instruction
  • Store in array indexed by PC of static instruction

Two requirements:

  • Locality of slack
  • Ability to measure slack of a dynamic instruction
slack detector
Slack Detector

delay and observe effective for hardware predictor

Problem #2

Determining if overall execution time increased

Solution

Check if delay made instruction critical

Problem #1

Iterating repeatedly over same dynamic instruction

Solution

Only sample dynamic instruction once

slack detector1
Slack Detector

delay and observe

Goal:

Determine whether instruction has n cycles of slack

  • Delay the instruction by n cycles
  • Check if critical (via critical-path analyzer)
    • No, instruction has n cycles of slack
    • Yes, instruction does not have n cycles of slack
fast slow cluster microarchitecture
Fast/slow cluster microarchitecture

P  F2save ~37% core power

ALUs

Reg

WIN

Data

Cache

Fast, 3-wide cluster

Fetch +

Rename

Steer

Reg

WIN

ALUs

Bypass Bus

Slow, 3-wide cluster

  • Aggressive non-uniform design:
  • Higher execution latencies
  • Increased (cross-domain) bypass latency
  • Decreased effective issue bandwidth
picking bins for the slack predictor
Picking bins for the slack predictor
  • Two decisions
    • Steer to fast/slow cluster
    • Schedule with high/low priority within a cluster

Use implicit slack predictor with four bins:

  • Steer to fast cluster + schedule with high priority
  • Steer to fast cluster + schedule with low priority
  • Steer to slow cluster + schedule with high priority
  • Steer to slow cluster + schedule with low priority
slack based policies
10% better performance from hiding non-uniformitiesSlack-based policies

2 fast, high-powerclusters

slack-based policy

reg-dep steering

multithreaded execution case study
Multithreaded Execution Case Study

Two questions:

  • How should a program be divided into threads?
    • what makes a good cutpoint?
    • how can we find them automatically, or at least help programmers find them?
  • What should a multiple-core design look like?
    • should each core support out-of-order execution?
    • should SMT be supported?
    • how many processors are useful?
    • what is the effect of inter-processor latency?
parallelizing an application
Parallelizing an application

Why parallelize a single-thread application?

  • Legacy code, large code bases
  • Difficult to parallelize apps
    • Interpreted code, kernels of operating systems
  • Like to use better programming languages
    • Scheme, Java instead of C/C++
parallelizing an application1
Parallelizing an application

Simplifying assumption

  • Program binary unchanged

Simplified problem statement

  • Given a program of length L, find a cutpoint that divides the program into two threads that provides maximum speedup
  • Must consider:
    • data dependences, execution latencies, control dependences, proper load balancing
parallelizing an application2
Parallelizing an application

Naive solution:

  • try every possible cutpoint

Our solution:

  • efficiently determine the effect of every possible cutpoint
  • model execution before and after every cut
solution
startSolution

first instruction

0

1

0

0

1

0

1

F

4

1

1

1

1

1

1

1

1

2

3

1

1

E

1

1

1

3

2

4

2

2

2

2

2

0

1

0

1

0

1

0

C

0

0

0

0

0

last instruction

parallelizing an application3
Parallelizing an application

Considerations:

  • Synchronization overhead
    • add latency to EE edges
  • Synchronization may involve turning EE to EF
  • Scheduling of threads
    • additional CF edges

Challenges:

  • State behavior (one thread to multiple processors)
    • caches, branch predictor
  • Control behavior
    • limits where cutpoints can be made
parallelizing an application4
Parallelizing an application

More general problem:

  • Divide a program into N threads
    • NP-complete

Icost can help:

  • icost(p1,p2) << 0 implies p1 and p2 redundant
    • action: move p1 and p2 further apart
preliminary results
Preliminary Results

Experimental Setup

  • Simulator, based loosely on SimpleScalar
  • Alpha SpecInt binaries

Procedure

  • Assume execution trace is known
  • Look at each 1k run of instructions
  • Test every possible cutpoint using 1k graphs
dynamic cutpoints
Dynamic Cutpoints

Only 20% of cuts yield benefits of > 20 cycles

static cutpoints
Static Cutpoints

Up to 60% of cuts yield benefits of > 20 cycles

future avenues of research
Future Avenues of Research
  • Map cutpoints back to actual code
      • Compare automatically generated cutpoints to human-generated ones
      • See what performance gains are in a simulator, as opposed to just on the graph
  • Look at the effect of synchronization operations
      • What additional overhead do they introduce?
  • Deal with state, control problems
      • Might need some technique outside of the graph
multithreaded execution case study1
Multithreaded Execution Case Study

Two possible questions:

  • How should a program be divided into threads?
    • what makes a good cutpoint?
    • how can we find them automatically, or at least help programmers find them?
  • What should a multiple-core design look like?
    • should each core support out-of-order execution?
    • should SMT be supported?
    • how many processors are useful?
    • what is the effect of inter-processor latency?
cmp design study
CMP design study

What we can do:

  • Try out many configurations quickly
    • dramatic changes in architecture often only small changes in graph
  • Identifying bottlenecks
    • especially interactions
cmp design study out of orderness
CMP design study: Out-of-orderness

Is out-of-order execution necessary

in a CMP?

Procedure

  • model execution with different configurations
    • adjust CD edges
  • compute breakdowns
    • notice resource/events interacting with CD edges
cmp design study out of orderness1
CMP design study: Out-of-orderness

first instruction

0

1

0

0

1

0

1

F

4

1

1

1

1

1

1

1

1

2

3

1

1

E

1

1

1

3

2

4

2

2

2

2

2

0

1

0

1

0

1

0

C

0

0

0

0

0

last instruction

cmp design study out of orderness2
CMP design study: Out-of-orderness

Results summary

  • Single-core: Performance taps out at 256 entries
  • CMP: Performance gains up through 1024 entries
    • some benchmarks see gains up to 16k entries

Why more beneficial?

  • Use breakdowns to find out.....
cmp design study out of orderness3
CMP design study: Out-of-orderness

Components of window cost

  • cache misses holding up retirement?
  • long strands of data dependencies?
  • predictable control flow?

Icost breakdowns give quantitative and qualitative answers

cmp design study out of orderness4
Serial Interaction

Parallel Interaction

ALU

ALU

ALU

cache

misses

cache

misses

equal

cache

misses

interaction

interaction

CMP design study: Out-of-orderness

Independent

100%

window

cost

0%

cost(window) +

icost(window, A) + icost(window, B) + icost(window, AB) = 0

summary of preliminary results
Summary of Preliminary Results

icost(window, ALU operations) << 0

  • primarily communication between processors
  • window often stalled waiting for data

Implications

  • larger window may be overkill
  • need a cheap non-blocking solution
    • e.g., continual-flow pipelines
cmp design study smt
CMP design study: SMT?

Benefits

  • reduced thread start-up latency
  • reduced communication costs

How we could help

  • distribution of thread lengths
  • breakdowns to understand effect of communication
cmp design study other questions
CMP design study: Other Questions

What is the effect of inter-processor communication latency?

  • understand hidden vs. exposed communication

Allocating processors to programs

  • methodology for O/S to better assign programs to processors
slide157
Annotated with Dependence Edges

Fetch BW

Data Dep

ROB

Branch Misp.

slide158
Edge Weights Added

1

1

1

1

0

2

1

1

1

3

1

1

1

slide159
F

F

F

F

F

F

F

F

F

F

F

E

E

E

E

E

E

E

E

E

E

E

C

C

C

C

C

C

C

C

C

C

C

1

1

0

1

1

1

0

2

2

1

0

1

2

1

1

1

1

0

1

1

Convert to Graph

1

1

1

1

1

2

0

1

1

1

3

1

1

1

slide160
F

F

F

F

F

F

F

F

F

F

F

E

E

E

E

E

E

E

E

E

E

E

C

C

C

C

C

C

C

C

C

C

C

Find Critical Path

1

1

0

1

1

1

1

1

0

1

1

2

2

1

2

0

1

0

1

1

1

2

1

1

1

3

1

1

0

1

1

1

1

1

slide161
1

1

0

0

1

1

F

F

F

F

F

F

F

F

F

F

F

E

E

E

E

E

E

E

E

E

E

E

C

C

C

C

C

C

C

C

C

C

C

1

1

0

1

0

1

1

1

1

0

1

0

1

1

2

2

0

2

0

1

1

1

1

1

0

0

1

1

1

1

1

0

0

1

1

1

2

1

1

0

0

1

1

1

1

1

0

3

0

1

1

1

1

1

0

1

0

1

1

1

1

0

1

0

1

1

Add Non-last-arriving Edges

slide162
1

1

0

0

1

1

F

F

F

F

F

F

F

F

F

F

F

E

E

E

E

E

E

E

E

E

E

E

C

C

C

C

C

C

C

C

C

C

C

1

1

0

1

0

1

1

1

1

0

1

0

1

1

2

2

0

2

0

1

1

1

1

1

0

0

1

1

1

1

1

0

0

1

1

1

2

1

1

0

0

1

1

1

1

1

0

0

1

1

1

1

1

0

1

0

1

1

1

1

0

1

0

1

1

Graph Alterations

Branch misprediction made correct

slide164
R1 R2 + R3

Step 1. Observing

  • Observation:

R2

E

R3

Dependence resolved early

If dependence into R2 is on critical path, then value of R2arrived last.

critical  arrives last

arrives last critical

determining last arrive edges1
F

F

F

F

F

F

F

F

E

E

E

E

E

E

E

E

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

EFif branch misp.

CFif ROB stall

FFotherwise

EEobserve arrival order of operands

F

F

F

F

F

E

E

E

E

E

FEif data ready on fetch

C

C

C

C

C

C

C

C

C

C

Determining last-arrive edges

Observe events within the machine

last_arrive[F] =

last_arrive[C] =

last_arrive[E] =

F

F

F

F

E

E

E

E

C

C

C

C

C

C

C

C

ECif commit pointer is delayed

CCotherwise

slide166
Last-arrive edges: a CPU stethoscope

F  E

C  F

E  E

E  F

E  C

F  F

C  C

CPU

last arrive edges1
Last-arrive edges

0

F

0

1

1

0

1

0

1

1

1

1

1

1

1

1

4

1

2

3

1

E

1

1

1

2

3

2

4

2

2

2

2

C

0

1

0

1

0

1

0

0

0

0

0

0

slide168
Remove latencies

Do not need explicit weights

F

E

C

last arrive edges2
Last-arrive edges

The last-arrive rule

CP consists only of “last-arrive” edges

F

E

C

prune the graph1
Prune the graph

Only need to put last-arrive edges in graph

No other edges could be on CP

F

E

C

newest

slide171
…and we’ve found the critical path!

Backward propagate along last-arrive edges

F

E

C

newest

newest

  • Found CPby only observing last-arrive edges
  • but still requires constructing entire graph
step 2 efficient analysis
Step 2. Efficient analysis

CP is a ”long” chain of last-arrive edges.

  • the longer a given chain of last-arrive edges, the more likely it is part of the CP

Algorithm: find sufficiently long last-arrive chains

  • Plant token into a node n
  • Propagate forward, only along last-arrive edges
  • Check for token after several hundred cycles
  • If token alive, n is assumed critical
slide173
ROB Size

1. plant token

Critical

3. is token alive?

4. yes, train critical

Token-passing example

2. propagate token

  • Found CPwithout constructing entire graph
slide174
Implementation: a small SRAM array

Last-arrive producer node (inst id, type)

Read

Commited(inst id, type)

Write

Token Queue

Size of SRAM: 3 bits  ROB size <200 Bytes

Simply replicate for additional tokens

slide175
Putting it all together

Prediction Path

CP

predictiontable

PC

OOO Core

E-critical?

Token-Passing

Analyzer

Training Path

Last-arrive edges

(producer  retired instr)

case study 1 clustered architectures
Current state of art (Base)
  • Base + CP Scheduling
  • Base + CP Scheduling + CP Steering
Case Study #1: Clustered architectures

issue

window

steering

scheduling

slide178
Current State of the Art

Constant issue width, clock frequency

unclustered

2 cluster

4 cluster

  • Avg. clustering penalty for 4 clusters: 19%
slide179
unclustered

2 cluster

4 cluster

CP Optimizations

Base + CP Scheduling

slide180
unclustered

2 cluster

4 cluster

CP Optimizations

Base + CP Scheduling + CP Steering

  • Avg. clustering penalty reduced from 19% to 6%
local vs global analysis
Local Vs. Global Analysis

Previous CP predictors:

local resource-sensitive predictions (HPCA 01, ISCA 01)

token-passing

oldest-uncommited

oldest-unissued

  • CP exploitation seems to require global analysis
icost case study deep pipelines7
Assume 4-cycle DL1 access; how to mitigate?

Increase cache ports? Increase window size?

Increase fetch BW? Reduce cache misses?

Icost Case Study: Deep pipelines

Deep pipelines cause long latency loops:

  • level-one (DL1) cache access, issue-wakeup, branch misprediction, …

But can often mitigate them indirectly

Really, looking for serial interactions!

icost case study deep pipelines8
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

icost case study deep pipelines9
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

icost case study deep pipelines10
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

icost case study deep pipelines11
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

icost case study deep pipelines12
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

icost case study deep pipelines13
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

icost case study deep pipelines14
Icost Case Study: Deep pipelines

DL1 access

1

0

1

12

12

F

F

F

F

F

F

5

5

5

5

5

5

4

2

E

E

E

E

E

E

14

4

1

6

9

18

7

6

7

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

window edge

profiling goal4
Profiling goal

Goal:

  • Construct graph

many dynamic instructions

Constraint:

  • Can only sample sparsely
profiling goal5
Genome sequencingProfiling goal

Goal:

  • Construct graph

DNA strand

DNA

Constraint:

  • Can only sample sparsely
shotgun genome sequencing11
“Shotgun” genome sequencing

DNA

. . .

. . .

Find overlaps among samples

. . .

. . .

mapping shotgun to our situation2
Icache miss

Dcache miss

Branch misp.

No event

Mapping “shotgun” to our situation

many dynamic instructions

offline profiler algorithm
Offline Profiler Algorithm

long sample

detailed samples

design issues
Identify microexecution context
  • Choosing signature bits

if

=

then

=

Start PC

12

16

20

24

56

60

. . .

branch

encode taken/not-taken bit in signature

Design issues
  • Determining PCs (for better detailed sample matching)

long sample

compare icost and sensitivity study
DL1 access

1

0

1

0

1

F

F

F

F

F

F

1

1

1

1

1

1

4

2

E

E

E

E

E

E

2

3

1

2

1

2

3

2

3

0

1

0

1

0

C

C

C

C

C

C

i1

i2

i3

i4

i5

i6

Compare Icost and Sensitivity Study

Corollary to DL1 and ROB serial interaction:

As load latency increases, the benefit from enlarging the ROB increases.

compare icost and sensitivity study2
Compare Icost and Sensitivity Study

Sensitivity Study Advantages

  • More information
    • e.g., concave or convex curves

Interaction Cost Advantages

  • Easy (automatic) interpretation
    • Sign and magnitude have well defined meanings
  • Concise communication
    • DL1 and ROB interact serially
outline2
Outline
  • Definition (ISCA ’01)
    • what does it mean for an event to be critical?
  • Detection (ISCA ’01)
    • how can we determine what events are critical?
  • Interpretation (MICRO ’04, TACO ’04)
    • what does it mean for two events to interact?
  • Application (ISCA ’01-’02, TACO ’04)
    • how can we exploit criticality in hardware?
our solution measure interactions2
Our solution: measure interactions

Two parallel cache misses (Each 100 cycles)

miss #1 (100)

miss #2 (100)

Cost(miss #1) = 0

Cost(miss #2) = 0

Cost({miss #1, miss #2}) = 100

Aggregate cost > Sum of individual costs Parallel interaction

100

0 + 0

icost = aggregate cost – sum of individual costs = 100 – 0 – 0 = 100

interaction cost icost6
miss #1
  • Positive icost  parallel interaction

miss #2

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

  • Zero icost ?
interaction cost icost7
Interaction cost (icost)

icost = aggregate cost – sum of individual costs

miss #1

  • Positive icost  parallel interaction

miss #2

. . .

  • Zero icost  independent

miss #2

miss #1

  • Negative icost ?
negative icost4
Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)

Cost(miss #1) = ?

negative icost5
Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)

Cost(miss #1) = 90

Cost(miss #2) = 90

Cost({miss #1, miss #2}) = 90

icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90

Negative icost  serial interaction

interaction cost icost8
Branch mispredict

Load-Replay Trap

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

Fetch BW

miss #1

  • Positive icost  parallel interaction

LSQ stall

miss #2

. . .

  • Zero icost  independent

miss #2

miss #1

miss #1

miss #2

  • Negative icost serial interaction

ALU latency

why care about serial interactions2
Reason #1 We are over-optimizing!

Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)

Reason #2 We have a choice of what to optimize

Prefetching miss #2 has the same effect as miss #1

Why care about serial interactions?

miss #1 (100)

miss #2 (100)

ALU latency (110 cycles)

outline3
Outline
  • Definition (ISCA ’01)
    • what does it mean for an event to be critical?
  • Detection (ISCA ’01)
    • how can we determine what events are critical?
  • Interpretation (MICRO ’04, TACO ’04)
    • what does it mean for two events to interact?
  • Application (ISCA ’01-’02, TACO ’04)
    • how can we exploit criticality in hardware?
criticality analyzer isca 01
Criticality Analyzer (ISCA ‘01)
  • Goal
    • Detect criticality of dynamic instructions
  • Procedure
    • Observelast-arrivingedges
      • uses simple rules
    • Propagate atokenforward along last-arriving edges
      • at worst, a read-modify-write sequence to a small array
    • If tokendies, non-critical; otherwise, critical
slack analyzer isca 02
Slack Analyzer (ISCA ‘02)
  • Goal
    • Detect likely slack of static instructions
  • Procedure
    • Delay the instruction by n cycles
    • Check if critical (via critical-path analyzer)
      • No, instruction has n cycles of slack
      • Yes, instruction does not have n cycles of slack
shotgun profiling taco 04
Shotgun Profiling (TACO ‘04)
  • Goal
    • Create representative graph fragments
  • Procedure
    • Enhance ProfileMe counters with context
    • Use context to piece together counter samples
ad