High Performance Analysis Tools and Optimization Workshop
E N D
Presentation Transcript
Performance Analysis, Tools and Optimization Philip J. Mucci Kevin S. London University of Tennessee, Knoxville ARL MSRC Users’ Group Meeting September 2, 1998
PET, UT and You • Training • Environments • Benchmarking • Evaluation and Reviews • Consulting • Development
Training • Courses on Benchmarking, Performance Optimization, Parallel Tools • Provides good mechanism for technology transfer • Develop needs and direction from the interaction with the user community • Tremendous knowledge base from which to draw
Environments • Use of the MSRC environments provides • Bug reports to the vendor • System tuning • System administrator support • Analysis of software needs • Performance evaluation • Researchers access to advanced hardware
Performance Understanding • In order to optimize we must understand • Why is our code performing a certain way? • What can be done about it? • How good can we do? • Results in confidence, efficiency and better code development • Time spent is an investment in the future
Tool EvaluationPtools Consortium • Review of available performance tools, particularly parallel • Regular reports are issued • Tools that we find useful get presented to the developers in training or consultation • Installation, testing and training • Example: VAMPIR for scalability analysis
Optimization Course • Course focuses on compiler options, available tools and single processor performance • Single biggest bottleneck to many codes, especially cache performance • Why? Link speeds have increased within an order of magnitude of memory bandwidths • Also, MPI and language specific issues
Benchmarks • CacheBench - performance of the memory hierarchy • MPBench - performance of core MPI operations • BLASBench - performance of dense numerical kernels • Intended to provide an orthogonal set of low-level benchmarks with which we can parameterize codes
Cache Performance • Tuning for caches is difficult without some understanding of computer architecture • No way to really know what’s in the cache during a given point in an application • Factor of 2-4 performance increase is common • Develop a tool to help identify regions in the source code, a specific reference.
Cache Simulator • Profiling the code reveals cache problems • Automated instrumentation of offending routines via a GUI or by hand • Link with simulator library • Make architecture configuration file • Addresses are traced and simulated • Miss locations are recorded and reports are generated
PerfAPI • A standardized interface to hardware performance counters • Easily usable by application engineers as well as tool developers • Intended for • Performance tools • Evaluation • Modeling • Watch http://www.cs.utk.edu/~mucci/pdsa
High Performance Debugger • Industry wide lack of good debugging support for parallel programs • TotalView is expensive and GUI only • Bandwidth is often not-available off-site • Based on dbx and gdb as backends • Uses p2d2 from NASA as a framework • Standardized, familiar command-line interface
MPI Connect • Connects separate MPI jobs with PVM • 3 function calls to enroll • Uses include • Metacomputing with Vendor MPI • Dynamic and Fault Tolerant MPI jobs now
The Future • BYOC Workshops • Regular Training Schedule • Web Based Training • Consulting • Cross-MSRC Information Exchange • Technology Transfer • Tool development
Origin 2000 Performance Prescription • Always use dplace on all codes • Always use -LNO:cache_size2=4096 • For accuracy compile and link with -O2 -IPA -SWP:=ON -LNO -TENV:X=0-5 • or -Ofast=ip27 -OPT:roundoff=0-3 -OPT:IEEE_arithmetic=1-3
Origin 2000 Performance Prescription • In Fortran, innermost array index should change fastest • Use functions in -lcomplib.sgimath or -lscs -lfastm -lm • Use MPI_Ixxxx primitives • Always execute IRECV early