Percent of peak on single PE?

A Monte Carlo Model of In-order Micro-architectural Performance:Decomposing Processor StallsOlaf LubeckRam SrinivasanJeanine Cook

Single Processor Efficiency is Critical in Parallel Systems Efficiency Loss in Scaling from 1 to 1000’s of PEs ? Roughly 2-3x Deterministic Transport Kerbyson, Hoisie, Pautz (2003) Percent of peak on single PE? About 5-8% (12-20x less than peak)

Processor Model: A Monte Carlo Approach to Predict CPI Token Generator Token: Instruction classes Max rate: 1 token every CPII Retire: non-producing tokens Stall producer Tokens Feedback Loop - Stalls: Latencies interacting with app characteristics • Service Centers: • Delays caused by • ALU Latencies • Memory Latencies • Branch Misprediction Inherent CPI: Best application cpi given no processor stalls – infinite zero-latency resources.

Processor Model: A Monte Carlo Approach to Predict CPI Token Generator Token: Instruction classes Max rate: 1 token every CPII Retire: non-producing tokens Producer Tokens Dependence Check and Stall Generator Transition probabilities associated with each path FPU:4 TLB:31 Dependence Distance Generation INT:1 L1:1 GSF:6 L2:6 BrM:6 L3:16 MEM:V Retire

Processor Stalls and Characterization of Application Dependence Major source of stalls for in-order processors are RAW and WAW dependences Prob Distribution: load-to-use distance (instructions) Based on path that token 2 has taken, we compute the stall time (and cause) for token 6: 16 – 4*CPIi • Application pdf’s: • Load-to-use • FP-to-use • INT-to-use

Summary of Model Parameters • Inherent CPI • Binary instrumentation tool • Instruction Classes • INT, FP, Reg, Branch mis-prediction, Loads, Non-producing • Note that Stores are retired immediately (treated as non-producers) • Transition Probabilities • Probabilities of generating each instruction class computed from binary instrumentation • Cache hits computed from performance counters – can be predicted from models • Distribution Functions of dependence distances (measured in instructions) • Load-to-use, FP-to-use, INT-to-use from binary instrumentation • Processor and Memory Latencies – from architecture manuals • Parameters are computed in 1-2 hours • Model converges in a few secs • 3. ~800 lines of C code

Model Accuracy Constant Memory Latencies Bench – 1.3 GHz Itanium 2 3 MB L3 cache 260 cp mem latency Hertz – 900 MHz Itanium 2 1.5 MB L3 cache 112 cp mem latency

CPI Decomposition on Bench

Model Extensions for Variable Memory Latency: Compiler-controlled Prefetching Yes Yes No No

Model Extensions for Variable Memory Latency: Compiler-controlled Prefetching 1.Linear relationship between prefetch distance and memory latency 2. Late prefetch can increase memory latency This relationship suggest the Prefetch-to-load pdf

Model AccuracyVariable Memory Latencies from Prefetch

Model Extensions:Toward Multicore Chips (CMP’s) Hertz: slope of 27 cps Bench: slope of 101 cps Slope times are obtained empirically and a function of memory controllers, chip speeds, bus bandwidths, etc. Memory latency as a function of outstanding loads

Application Characteristics: Dependence Distributions Dependence Distances are need to explain stalls Sixtrack L2 Hits: 5.7% Eon L2 Hits: 3.6% But, Eon L2 stalls are 6x larger than Sixtrack

What kinds of questions can we explore with the model? What if FPU was not pipelined? What if L2 was removed (2 Level cache)? What if processor freq was changed (power aware)?

Summary • Monte Carlo techniques can be effectively applied to micro-architecture performance prediction • Main Advantage: whole program analysis, predictive and extensible • Main Problems that we have seen: small loops are not well predicted, binary instrumentation for prefetch can take >24 hrs. • The model is surprisingly accurate given the architectural & application simplifications • Distributions that are used to develop predictive models are significant application characteristics that need to be evaluated • We are ready to go into a “production” mode where we apply the model to a number of in-order architectures: Cell and Niagara

Percent of peak on single PE?

Percent of peak on single PE?

Presentation Transcript

Single Crystals

Percent of Change

Peak Performer Prospecting…

internet technology by abney and associates

Using Proportions to Solve Percent Problems

Percent

The county and the County Council

Math 010: Percent (Ch. 8)

Percent Proportions II

Percent Change: Percent Increase / Percent Decrease

The county and the County Council

Different Types of Percent Problems

Hubbert oil peak and Hotelling rent revisited by a simulation model

Pike’s Peak

Issues summary

Percent