profile guided optimization targeting high performance embedded applications n.
Skip this Video
Loading SlideShow in 5 Seconds..
Profile-Guided Optimization Targeting High Performance Embedded Applications PowerPoint Presentation
Download Presentation
Profile-Guided Optimization Targeting High Performance Embedded Applications

Profile-Guided Optimization Targeting High Performance Embedded Applications

136 Views Download Presentation
Download Presentation

Profile-Guided Optimization Targeting High Performance Embedded Applications

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Profile-Guided OptimizationTargeting High Performance Embedded Applications David Kaeli Murat Bicer Efe Yardimci Center for Subsurface Sensing and Imaging Systems (CenSSIS) Northeastern University Jeffrey Smith Mercury Computer

  2. Why Consider Using Profile-Guided Optimization? • Much of the potential performance available on data-parallel systems can not be obtained due to unpredictable control flow and data flow in programs • Memory system performance continues to dominate the performance of many data-parallel applications • Program profiles provide clues to the compiler/linker/runtime to: • Enable more aggressive use of interprocedural optimizations • Eliminate bottlenecks in the data flow/control flow and • Improve a program’s layout on the available memory hierarchy • Applications can then be developed at higher levels of programming abstraction (e.g., from UML) and tuned for performance later

  3. Profile Guidance • Obtain run-time profiles in the form of: • Procedure call graphs, basic block traversals • Program variable value profiles • Hardware performance counters (using PCL) • Cache and TLB misses, pipeline stalls, heap allocations, synchronization messages • Utilize run-time profiles as input to: • Provide user feedback (e.g., program visualization) • Perform profile-driven compilation (recompile using the profile) • Enable dynamic optimization (just-in-time compilation) • Evaluate software testing coverage

  4. Profiling Tools • Mercury Tools • TATL – Trace Analysis Tool and Library • Procedure profiles • Gnu gprof • PowerPC Performance Counters • PCL – Performance Counter Library • PM API – targeting the PowerPC • Greenhills Compiler • MULTI profiling support • Custom instrumentation drivers

  5. Data Parallel Applications SAR Binary-level Optimizations Program Binary C O M P I L E R Feedback GPR Program run • Program profile • counter values • program paths • variable values Software Defined Radio Feedback MRI Compile-time Optimizations

  6. Target Optimizations • Compile-time • Aggressive procedure inlining • Aggressive constant propagation • Program variable specialization • Procedure cloning • Removal of redundant loads/stores • Link-time • Code reordering utilizing coloring • Static data reordering • Dynamic (during runtime) • Heap layout optimization

  7. Memory Performance is Key to Scalability in Data-parallel applications • The performance gap between processor technology and memory technology continues to grow • Hierarchical memory systems (multi-level caches) have been used to bridge this gap • Embedded processing applications place a heavy burden on the supporting memory system • Applications will need to adapt (potentially dynamically) to better utilize the available memory system

  8. Cache Line Coloring • Attempts to reorder a program executable by coloring the cache space, avoiding caller-callee conflicts in a cache • Can be driven by either statically-generated call graphs or profile data • Improves upon the work of Pettis and Hansen by considering the organization of the cache space (i.e., cache size, line size, associativity) • Can be used with different levels of granularity (procedures, basic blocks) and both intra- and inter- procedurally

  9. Cache Line Coloring Algorithm A 40 90 • Build program call graph • nodes represent procedures • edges represent calls • edge weight represent call frequencies • Prune edges based on a threshold value • Sort graph edges and process in decreasing edge weight order • Place procedures in the cache space, avoiding color conflicts • Fill in gaps with remaining procedures • Reduces execution time by up to 49% for data compression algorithms B E

  10. Data Memory Access • A disproportionate number of data cache misses are caused by accesses to dynamically allocated (heap) memory • Increases in cache size do not effectively reduce data cache misses caused by heap accesses • A small number of objects account for a large percentage of heap misses (90/10 rule) • Existing memory allocation routines tend to balance allocation speed and memory usage (locality preservation has not been a major concern)

  11. Miss rates (%) vs. Cache Configurations

  12. Profile-driven Data Layout • We have developed a profile-guided approach to allocating heap objects to improve heap behavior • The idea is to use existing knowledge of the computing platform (e.g., cache organization), combined with profile data, to enable the target application to execute more efficiently • Mapping temporally local memory blocks possessing high reference counts to the same cache area will generate a significant number of cache misses

  13. Allocation • We have developed our own mallocroutine which uses a conflict profile to avoid allocating potentially conflicting addresses • A multi-step allocation algorithm is repeated until a non-conflicting allocation is made • If all steps produce conflicts, allocation is made within the wilderness region • If conflicts still occur in the wilderness region, we allocate these conflicting chunks (creating a hole) • Allocation occurs at the first non-conflicting address after the chunk • The hole is immediately freed, causing minimal space wastage (though possibly some limited fragmentation)

  14. Runtime improvements over non-optimized heap layout

  15. Future Work • Present algorithms have only been evaluated on uniprocessor platforms • Follow-on work will target Mercury RACE multiprocessor systems • Target applications will include: • FM3TR for Software Defined Radio • Steepest Decent Fast Multipole Methods (SDFMM) and Method for demining applications

  16. Related Publications • “Improving the Performance of Heap-based Memory Access,” E. Yardimci and D. Kaeli, Proc. of the Workshop on Memory Performance Issues, June 2001. • “Accurate Simulation and Evaluation of Code Reordering,” J. Kalamatianos and D. Kaeli, Proc. of the IEEE International Symposium on the Performance Analysis of Systems and Software, May 2000. • “`Model Based Parallel Programming with Profile-Guided Application Optimization,” J. Smith and D. Kaeli, Proc. of the 4th Annual High Performance Embedded Computing Workshop, MIT Lincoln Labs, Lexington, MA, September 2000, pp.85-86. • “Cache Line Coloring Using Real and Estimated Profiles,” A. Hashemi, J. Kalamatianos, D. Kaeli and W. Meleis, Digital Technical Journal, Special Issues on Tools and Languages, February 1999. • `` Parameter Value Characterization of Windows NT-based Applications,'‘ J. Kalamatianos and D. Kaeli, Workload Characterization: Methodology and Case Studies, IEEE Computer Society, 1999, pp.142-149.

  17. Related Publications (also see • “Analysis of Temporal-based Program Behavior for Improved Instruction Cache Performance,” J. Kalamatianos, A. Khalafi, H. Hashemi, D. Kaeli and W. Meleis, IEEE Transactions on Computers, Vol.10, No. 2, February 1999, pp. 168-175. • “Memory Architecture Dependent Program Mapping,” B. Calder, A. Hashemi, and D. Kaeli, US Patent No. 5,963,972, October 5, 1999. • “Temporal-based Procedure Reordering for Improved Instruction Cache Performance,” Proc. of the 4th HPCA, Feb. 1998, pp. 244-253. • “Efficient Procedure Mapping Using Cache Line Coloring,” H. Hashemi, D. Kaeli and B. Calder, Proc. of PLDI’97, June 1997, pp. 171-182. • “Procedure Mapping Using Static Call Graph Estimation,” Proc. of the Workshop on the Interaction Between Compilers and Computer Architecture, TCCA News, 1997.