1 / 33

PINTOS : An Execution Phase Based Optimization and Simulation Tool )

PINTOS : An Execution Phase Based Optimization and Simulation Tool ). Wei Hsu , Jinpyo Kim, Sreekumar Kodak Computer Science Department University of Minnesota October 9 , 2004 PIN Tutorial at ASPLOS`04. Outline. What is Pintos? Wh at can Pintos do ?

Download Presentation

PINTOS : An Execution Phase Based Optimization and Simulation Tool )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim, Sreekumar Kodak Computer Science Department University of Minnesota October 9, 2004 PIN Tutorial at ASPLOS`04

  2. Outline • What is Pintos? • What can Pintos do? • Phase detection for optimization and simulation • Optimization (instruction prefetching) • Fast Simulation • Summary

  3. What is Pintos? • PINTOS is a PIN based Tool for Optimization and Simulation • A research framework supports adaptive object code optimization • Supports deep analysis of run-time program behavior for object code optimization (e.g. instruction, data prefetching) • Integrates HPM performance monitoring (Pfmon) with dynamic instrumentation (PIN). • Also supports fast performance simulation • Identifies program phases (with coarse and fine granularity) • Generates simulation strings that capture representative program behaviors

  4. Pintos Framework PIN-based Analysis Filtered Opt Targets pfmon profile analysis Optimization control flow program profile Opt targets Cache Sim PIN-based Phase Detection pfmon profile analysis Simulation Strings Simulation Phase Info program profile phase targets Simulation String Gen

  5. Our Background • ADORE dynamic optimization system Code Cache Phase Detection Main Thread Dynamic Optimization Thread Trace Selection Optimization Deployment Kernel / Pfmon Hardware Performance Monitoring Unit

  6. ADORE Performance: Speedup of ORC2.1 +O2 Compiled SPEC2000 Benchmarks

  7. ADORE Performance at Different Sampling Rates

  8. Future Enhancements to ADORE • I-cache prefetching • Help thread based optimizations • Value prediction based optimizations • Dynamically undo aggressive optimizations (e.g. control/data speculations, indirect array prefetches) • Software Branch Predictions

  9. What can Pintos do for us? • Pintos uses pfmon to identify high-level performance problems (e.g. I-cache miss) and locate target code (phases) for optimization • Pintos then uses PIN-based analysis tool tofocus on target code (phases) to conduct deep analysis • Pintos provides a framework to support deep analysis of program behavior so that we may experience with new object code optimization techniques and feed them to ADORE. • Simulation strings can be generated by Pintos and used for more efficient micro-architecture simulations

  10. Phase based Optimization and Simulation • Phase is a sequence of code that consistently exhibits certain performance behaviors in Pintos, for example • Gzip shows consistent and repeated data cache miss patterns • Crafty exhibits consistent I-cache misses • A repeating phase can serve as an unit for dynamic and adaptive optimization, or for fast performance simulations. • Optimization unit can be basic block, trace, procedure and region (loops and loop nests including complex control transfers) • Simulation unit can be an extended code sequence

  11. Phase Detection • One phase detection method doesn’t fit all needs. • Dynamic data cache prefetching requires coarse grain phases (e.g. loops) while dynamic I-cache prefetching requires fine-grain phases (e.g. frequent calling paths). • A phase tuple is used to determine the current point of execution in PIN instrumentation • Phase tuple: (phase ID #, ip addr, # of retired insts)

  12. Pintos for Optimization (I-Prefetch) • Many applications still suffer from significant I-cache misses (e.g.data base apps, some SPEC CPU2000 benchmarks, etc) • Predictable call sequence • results in relatively low miss • rate • Complex control flows • cause high miss ratefrom • streaming prefetches

  13. I-Cache Miss Analysis (pfmon) • Miss address based info • Crafty (2125/4760000) • 25% 30 (1.41%) Each topmiss PC was caused by 10-40 • 50% 91 (4.28%) different paths. • 75% 228 (10.73%) • 90% 442 (20.80%) • Path based info • Crafty (8016/4760000)Each top path leading to I-cache • 25% 28 (0.34%)miss has 1-2 possible prefetch targets • 50% 126 (1.57%) • 75% 436 (5.43%) Data show we can reducepoints of • 90% 1118 (13.94%) interest for inst prefetching

  14. Exploring prospective points of instruction prefetching (PIN) • Pintos generatesprospective paths leading to frequent I-cache misses by analyzing pfmon profile • PIN instrumentation routine constructs control flow graph and simulates instruction cache along execution • It inserts I-cache prefetching instructions for the prospective paths based on control flow edge weight and estimated cache replacement Paths frequently causing I-cache misses B1 B2 B6 B3 B4 B5 B7 Instruction Cache Simulator B8 Control flow graph

  15. Exploring prospective points of instruction prefetching (PIN) • Key observation • Most I-cache misses happen in the following cache lines after the entry or the return of a function call. • L1I cache misses are mostly capacity misses. We need to estimate how prefetch affect incoming instruction stream. • Key idea • Run ahead by exploring CFG and I-cache simulator • Evaluate prospective paths given by Pintos Paths frequently causing I-cache misses B1 B2 B6 B3 B4 B5 B7 Instruction Cache Simulator B8 Control flow graph

  16. Pintos for Fast Simulation • Execution driven micro-architectral simulation is commonly used for evaluating newmicro-architecture features and respective code optimizations. • Simulation time is often too long for a complete simulation. New methods for fast simulations such as Simpoint and Smarts have been proposed. • PASS (Phase Aware Stratified Sampling)is a different way to generate representative and customized traces for targeted simulations

  17. Fast Simulation Techniques • Truncated Execution • Run Z, FastFoward-W-R • Sampling • SMARTS • SIMPOINT • Stratified Sampling • Reduced Input Sets • MinneSPEC

  18. Problems of Previous Works • Truncated Execution gives very inaccurate results • Reduced Input sets do not always behave the same as reference inputs sothe performance estimation based on reduced input sets may be misleading.

  19. Program Run Time W U W U (K-1) * U Mechanism of SMARTS W: Warm up time (Fixed to 2000 instructions for SPEC 2000) U: Detailed Simulation (Fixed to 1000 instructions for SPEC2000) (K-1)*U: Function Simulation with Functional Warming (The tool gives the value of K for which the IPC will be within + 3% of the actual value with 99.7% confidence interval)

  20. Issues in Previous Work SMARTS • Value of U and W fixed for SPEC 2000 suite. Have to identify them for every new benchmark suite (Very time consuming) • Over sampling in steady phases. Does not effectively exploit the existence of phases in programs SIMPOINT • The user chooses the length of simulation point (100 million, 10 million, 1 million) • Provides Simulation Points based on Clustering of Basic Block profiles which is generated using sim-fast or ATOM

  21. Phase Aware Stratified Sampling (PASS) • Deploy a hierarchical method to detect coarse and fine grain program phases (1) Tracking calling stack (stable bottom = coarse grain phase)  inter-procedure (2) Detecting loops within the procedure  intra-procedure (3)Tracking data access pattern such as stride within loops (fine grain phases) • Select stratified samples from each phase until getting high statistical confidence

  22. IPC simpoint IPC vs SimPoint (cc1-166, 1 million insts)

  23. IPC vs Phase Classification on PASS(cc1-166, 1 million insts)

  24. IPC vs SimPoint (cc1-166, 250 million insts)

  25. IPC simpoint IPC vs SimPoint (gzip-source, 1 million insts)

  26. IPC vs Phase Classificationon PASS(gzip-source, 1 million insts)

  27. IPC vs SimPoint (gzip-source, 250 million insts)

  28. IPC simpoint IPC vs SimPoint (mcf-ref, 1 million insts)

  29. IPC vs Phase Classification on PASS(mcf-ref)

  30. IPC vs SimPoint (mcf-ref, 250 million insts)

  31. IPC vs Phase Classification on PASS(gap-ref, 1 million insts)

  32. IPC vs SimPoint (gap-ref, 250 million insts)

  33. Summary • We show the combination of HPM sampling (Pfmon) and dynamic instrumentation (Pin)in our research framework (Pintos) for adaptive object code optimization and micro-architectural simulation. • PASS (Phase Aware Stratified Sampling) may lead to a more efficient way in simulating the interaction between compiler optimizations and new micro-architectural features.

More Related