1 / 17

Percent of peak on single PE?

A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook. Single Processor Efficiency is Critical in Parallel Systems. Efficiency Loss in Scaling from 1 to 1000’s of PEs ?. Roughly 2-3x.

Download Presentation

Percent of peak on single PE?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Monte Carlo Model of In-order Micro-architectural Performance:Decomposing Processor StallsOlaf LubeckRam SrinivasanJeanine Cook

  2. Single Processor Efficiency is Critical in Parallel Systems Efficiency Loss in Scaling from 1 to 1000’s of PEs ? Roughly 2-3x Deterministic Transport Kerbyson, Hoisie, Pautz (2003) Percent of peak on single PE? About 5-8% (12-20x less than peak)

  3. Processor Model: A Monte Carlo Approach to Predict CPI Token Generator Token: Instruction classes Max rate: 1 token every CPII Retire: non-producing tokens Stall producer Tokens Feedback Loop - Stalls: Latencies interacting with app characteristics • Service Centers: • Delays caused by • ALU Latencies • Memory Latencies • Branch Misprediction Inherent CPI: Best application cpi given no processor stalls – infinite zero-latency resources.

  4. Processor Model: A Monte Carlo Approach to Predict CPI Token Generator Token: Instruction classes Max rate: 1 token every CPII Retire: non-producing tokens Producer Tokens Dependence Check and Stall Generator Transition probabilities associated with each path FPU:4 TLB:31 Dependence Distance Generation INT:1 L1:1 GSF:6 L2:6 BrM:6 L3:16 MEM:V Retire

  5. Processor Stalls and Characterization of Application Dependence Major source of stalls for in-order processors are RAW and WAW dependences Prob Distribution: load-to-use distance (instructions) Based on path that token 2 has taken, we compute the stall time (and cause) for token 6: 16 – 4*CPIi • Application pdf’s: • Load-to-use • FP-to-use • INT-to-use

  6. Summary of Model Parameters • Inherent CPI • Binary instrumentation tool • Instruction Classes • INT, FP, Reg, Branch mis-prediction, Loads, Non-producing • Note that Stores are retired immediately (treated as non-producers) • Transition Probabilities • Probabilities of generating each instruction class computed from binary instrumentation • Cache hits computed from performance counters – can be predicted from models • Distribution Functions of dependence distances (measured in instructions) • Load-to-use, FP-to-use, INT-to-use from binary instrumentation • Processor and Memory Latencies – from architecture manuals • Parameters are computed in 1-2 hours • Model converges in a few secs • 3. ~800 lines of C code

  7. Model Accuracy Constant Memory Latencies Bench – 1.3 GHz Itanium 2 3 MB L3 cache 260 cp mem latency Hertz – 900 MHz Itanium 2 1.5 MB L3 cache 112 cp mem latency

  8. CPI Decomposition on Bench

  9. Model Extensions for Variable Memory Latency: Compiler-controlled Prefetching Yes Yes No No

  10. Model Extensions for Variable Memory Latency: Compiler-controlled Prefetching 1.Linear relationship between prefetch distance and memory latency 2. Late prefetch can increase memory latency This relationship suggest the Prefetch-to-load pdf

  11. Model AccuracyVariable Memory Latencies from Prefetch

  12. Model Extensions:Toward Multicore Chips (CMP’s) Hertz: slope of 27 cps Bench: slope of 101 cps Slope times are obtained empirically and a function of memory controllers, chip speeds, bus bandwidths, etc. Memory latency as a function of outstanding loads

  13. Application Characteristics: Dependence Distributions Dependence Distances are need to explain stalls Sixtrack L2 Hits: 5.7% Eon L2 Hits: 3.6% But, Eon L2 stalls are 6x larger than Sixtrack

  14. What kinds of questions can we explore with the model? What if FPU was not pipelined? What if L2 was removed (2 Level cache)? What if processor freq was changed (power aware)?

  15. Summary • Monte Carlo techniques can be effectively applied to micro-architecture performance prediction • Main Advantage: whole program analysis, predictive and extensible • Main Problems that we have seen: small loops are not well predicted, binary instrumentation for prefetch can take >24 hrs. • The model is surprisingly accurate given the architectural & application simplifications • Distributions that are used to develop predictive models are significant application characteristics that need to be evaluated • We are ready to go into a “production” mode where we apply the model to a number of in-order architectures: Cell and Niagara

More Related