
Profile-Guided OptimizationTargeting High Performance Embedded Applications David Kaeli Murat Bicer Efe Yardimci Center for Subsurface Sensing and Imaging Systems (CenSSIS) Northeastern University Jeffrey Smith Mercury Computer
Why Consider Using Profile-Guided Optimization? • Much of the potential performance available on data-parallel systems can not be obtained due to unpredictable control flow and data flow in programs • Memory system performance continues to dominate the performance of many data-parallel applications • Program profiles provide clues to the compiler/linker/runtime to: • Enable more aggressive use of interprocedural optimizations • Eliminate bottlenecks in the data flow/control flow and • Improve a program’s layout on the available memory hierarchy • Applications can then be developed at higher levels of programming abstraction (e.g., from UML) and tuned for performance later
Profile Guidance • Obtain run-time profiles in the form of: • Procedure call graphs, basic block traversals • Program variable value profiles • Hardware performance counters (using PCL) • Cache and TLB misses, pipeline stalls, heap allocations, synchronization messages • Utilize run-time profiles as input to: • Provide user feedback (e.g., program visualization) • Perform profile-driven compilation (recompile using the profile) • Enable dynamic optimization (just-in-time compilation) • Evaluate software testing coverage
Profiling Tools • Mercury Tools • TATL – Trace Analysis Tool and Library • Procedure profiles • Gnu gprof • PowerPC Performance Counters • PCL – Performance Counter Library • PM API – targeting the PowerPC • Greenhills Compiler • MULTI profiling support • Custom instrumentation drivers
Data Parallel Applications SAR Binary-level Optimizations Program Binary C O M P I L E R Feedback GPR Program run • Program profile • counter values • program paths • variable values Software Defined Radio Feedback MRI Compile-time Optimizations
Target Optimizations • Compile-time • Aggressive procedure inlining • Aggressive constant propagation • Program variable specialization • Procedure cloning • Removal of redundant loads/stores • Link-time • Code reordering utilizing coloring • Static data reordering • Dynamic (during runtime) • Heap layout optimization
Memory Performance is Key to Scalability in Data-parallel applications • The performance gap between processor technology and memory technology continues to grow • Hierarchical memory systems (multi-level caches) have been used to bridge this gap • Embedded processing applications place a heavy burden on the supporting memory system • Applications will need to adapt (potentially dynamically) to better utilize the available memory system
Cache Line Coloring • Attempts to reorder a program executable by coloring the cache space, avoiding caller-callee conflicts in a cache • Can be driven by either statically-generated call graphs or profile data • Improves upon the work of Pettis and Hansen by considering the organization of the cache space (i.e., cache size, line size, associativity) • Can be used with different levels of granularity (procedures, basic blocks) and both intra- and inter- procedurally
Cache Line Coloring Algorithm A 40 90 • Build program call graph • nodes represent procedures • edges represent calls • edge weight represent call frequencies • Prune edges based on a threshold value • Sort graph edges and process in decreasing edge weight order • Place procedures in the cache space, avoiding color conflicts • Fill in gaps with remaining procedures • Reduces execution time by up to 49% for data compression algorithms B E
Data Memory Access • A disproportionate number of data cache misses are caused by accesses to dynamically allocated (heap) memory • Increases in cache size do not effectively reduce data cache misses caused by heap accesses • A small number of objects account for a large percentage of heap misses (90/10 rule) • Existing memory allocation routines tend to balance allocation speed and memory usage (locality preservation has not been a major concern)
Profile-driven Data Layout • We have developed a profile-guided approach to allocating heap objects to improve heap behavior • The idea is to use existing knowledge of the computing platform (e.g., cache organization), combined with profile data, to enable the target application to execute more efficiently • Mapping temporally local memory blocks possessing high reference counts to the same cache area will generate a significant number of cache misses
Allocation • We have developed our own mallocroutine which uses a conflict profile to avoid allocating potentially conflicting addresses • A multi-step allocation algorithm is repeated until a non-conflicting allocation is made • If all steps produce conflicts, allocation is made within the wilderness region • If conflicts still occur in the wilderness region, we allocate these conflicting chunks (creating a hole) • Allocation occurs at the first non-conflicting address after the chunk • The hole is immediately freed, causing minimal space wastage (though possibly some limited fragmentation)
Future Work • Present algorithms have only been evaluated on uniprocessor platforms • Follow-on work will target Mercury RACE multiprocessor systems • Target applications will include: • FM3TR for Software Defined Radio • Steepest Decent Fast Multipole Methods (SDFMM) and Method for demining applications
Related Publications • “Improving the Performance of Heap-based Memory Access,” E. Yardimci and D. Kaeli, Proc. of the Workshop on Memory Performance Issues, June 2001. • “Accurate Simulation and Evaluation of Code Reordering,” J. Kalamatianos and D. Kaeli, Proc. of the IEEE International Symposium on the Performance Analysis of Systems and Software, May 2000. • “`Model Based Parallel Programming with Profile-Guided Application Optimization,” J. Smith and D. Kaeli, Proc. of the 4th Annual High Performance Embedded Computing Workshop, MIT Lincoln Labs, Lexington, MA, September 2000, pp.85-86. • “Cache Line Coloring Using Real and Estimated Profiles,” A. Hashemi, J. Kalamatianos, D. Kaeli and W. Meleis, Digital Technical Journal, Special Issues on Tools and Languages, February 1999. • `` Parameter Value Characterization of Windows NT-based Applications,'‘ J. Kalamatianos and D. Kaeli, Workload Characterization: Methodology and Case Studies, IEEE Computer Society, 1999, pp.142-149.
Related Publications (also see http://www.ece.neu.edu/info/architecture/publications.html) • “Analysis of Temporal-based Program Behavior for Improved Instruction Cache Performance,” J. Kalamatianos, A. Khalafi, H. Hashemi, D. Kaeli and W. Meleis, IEEE Transactions on Computers, Vol.10, No. 2, February 1999, pp. 168-175. • “Memory Architecture Dependent Program Mapping,” B. Calder, A. Hashemi, and D. Kaeli, US Patent No. 5,963,972, October 5, 1999. • “Temporal-based Procedure Reordering for Improved Instruction Cache Performance,” Proc. of the 4th HPCA, Feb. 1998, pp. 244-253. • “Efficient Procedure Mapping Using Cache Line Coloring,” H. Hashemi, D. Kaeli and B. Calder, Proc. of PLDI’97, June 1997, pp. 171-182. • “Procedure Mapping Using Static Call Graph Estimation,” Proc. of the Workshop on the Interaction Between Compilers and Computer Architecture, TCCA News, 1997.