Special Topics in Grid Enablement of Scientific Applications (CIS 6612)

Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Performance/Profiling Presented by Javier Figueroa Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu

Presentation Outline • Performance • Benchmarking • Parallel Performance • Profiling • Dimemas • Paraver • Acknowledgements

Why Performance? • To do a time-consuming operation in less time • I am an aircraft engineer • I need to run a simulation to test the stability of the wings at high speed • I’d rather have the result in 5 minutes than in 5 hours so that I can complete the aircraft final design sooner. • To do an operation before a tighter deadline • I am a weather prediction agency • I am getting input from weather stations/sensors • I’d like to make the forecast for tomorrow before tomorrow

Why Performance? • To do a high number of operations per seconds • I am the CTO of Amazon.com • My Web server gets 1,000 hits per seconds • I’d like my Web server and my databases to handle 1,000 transactions per seconds so that customers do not experience delays • Also called scalability • Amazon does “process” several GBytes of data per seconds

Performance - Software • “High Performance Code” • Performance concerns • Correctness • You will see that when trying to make code go fast one often breaks it • Readability • Fast code typically requires more lines! • Modularity can hurt performance • e.g., Too many classes • Portability • Code that is fast on machine A can be slow on machine B • At the extreme, highly optimized code is not portable at all, and in fact is done in hardware!

Performance - Time • Time between the start and the end of an operation • Also called running time, elapsed time, wall-clock time, response time, latency, execution time, ... • Most straightforward measure: “my program takes 12.5s on a Pentium 3.5GHz” • Can be normalized to some reference time • Must be measured on a “dedicated” machine

Performance - Rate • Used often so that performance can be independent on the “size” of the application • e.g., compressing a 1MB file takes 1 minute. compressing a 2MB file takes 2 minutes. The performance is the same. • Millions of instructions / sec (MIPS) • MIPS = instruction count / (execution time * 106) = clock rate / (CPI * 106) • But Instructions Set Architectures are not equivalent • 1 CISC instruction = many RISC instructions • Programs use different instruction mixes • May be ok for same program on same architectures

Performance – Rate cont. • Millions of floating point operations /sec (MFlops) • Very popular, but often misleading • e.g., A high MFlops rate in a badly designed algorithm could have poor application performance • Application-specific: • Millions of frames rendered per second • Millions of amino-acid compared per second • Millions of HTTP requests served per seconds • Application-specific metrics are often preferable and others may be misleading • MFlops can be application-specific thought • For instance: • I want to add to n-element vectors • This requires 2*n Floating Point Operations • Therefore MFlops is a good measure

What is “Peak” Performance? • Resource vendors always talk about peak performance rate • computed based on specifications of the machine • For instance: • I build a machine with 2 floating point units • Each unit can do an operation in 2 cycles • My CPU is at 1GHz • Therefore I have a 1*2/2 =1GFlops Machine • Problem: • In real code you will never be able to use the two floating point units constantly • Data needs to come from memory and cause the floating point units to be idle • Typically, real code achieves only a (often small) fraction of the peak performance

Performance - Benchmarks • Since many performance metrics turn out to be misleading, people have designed benchmarks • Example: SPEC Benchmark • Integer benchmark • Floating point benchmark • These benchmarks are typically a collection of several codes that come from “real-world software” • The question “what is a good benchmark?” is difficult • If the benchmarks do not correspond to what you’ll do with your computer, then the benchmark results are not relevant to you

Performance – A Program • Concerned with improving a program’s performance • For a given platform, take a given program • Run it and measure its wall-clock time • Enhance it, run it again an quantify the performance improvement • i.e., the reduction in wall-clock time • For each version compute its performance • preferably as a relevant performance rate • so that you can say: the best implementation we have so far goes “this fast” (perhaps a % of the peak performance)

Performance – Program Speedup • We need a metric to quantify the impact of your performance enhancement • Speedup: ratio of “old” time to “new” time • old time = 2h • new time = 1h • speedup = 2h / 1h = 2h • Sometimes one talks about a “slowdown” in case the “enhancement” is not beneficial • Happens more often than one thinks

Parallel Performance • The notion of speedup is completely generic • By using a rice cooker I’ve achieved a 1.20 speedup for rice cooking • For parallel programs on defines the Parallel Speedup (we’ll just say “speedup”): • Parallel program takes time T1 on 1 processor • Parallel program takes time Tp on p processors • Parallel Speedup(p) = T1 / Tp • In the ideal case, if my sequential program takes 2 hours on 1 processor, it will take 1 hour on 2 processors: called linear speedup

Speedup linear speedup speedup superlinear speedup!! sub-linear speedup number of processors

Superlinear Speedup? • There are several possible causes • Algorithm • e.g., with optimization problems, throwing many processors at it increases the chances that one will “get lucky” and find the optimum fast • Hardware • e.g., with many processors, it is possible that the entire application data resides in cache (vs. RAM) or in RAM (vs. Disk)

Amdahl’s Law • Consider a program whose execution consists of two phases • One sequential phase • One phase that can be perfectly parallelized (linear speedup) T1 = time spent in phase that cannot be parallelized. Sequential program T1 T2 Old time: T = T1 + T2 T2 = time spent in phase that can be parallelized. Parallel program T1’ = T1 T2’ <= T2 T2’ = time spent in parallelized phase New time: T’ = T1’ + T2’

Amdahl’s Law cont. • P1 = 11%, P2 = 18%, P3 = 23%, P4 = 48%, which add up to 100% • S1 = 1 or 100%, P2 is sped up 5x, so S2 = 500%, P3 is sped up 20x, so S3 = 2000%, and P4 is sped up 1.6x, so S4 = 160% • P1/S1 + P2/S2 + P3/S3 + P4/S4, we find the running time is 0.11/1 + 0.18/5 + 0.23/20 + 0.48/1.6 = 0.4575 or a little less than ½ the original running time which we know is 1 • the overall speed boost is 1/0.4575 = 2.186 or a little more than double the original speed using the formula (P1/S1 + P2/S2 + P3/S3 + P4/S4)−1

Amdahl’s Law - Parallelization • P is the proportion of a program that can be made parallel (i.e. benefit from parallelization), and (1 − P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is

Parallel Efficiency • Definition: Eff(p) = S(p) / p • Typically < 1, unless linear or superlinear speedup • Used to measure how well the processors are utilized • If increasing the number of processors by a factor 10 increases the speedup by a factor 2, perhaps it’s not worth it: efficiency drops by a factor 5 • Important when purchasing a parallel machine for instance: if due to the application’s behavior efficiency is low, forget buying a large cluster

Scalability • Measure of the “effort” needed to maintain efficiency while adding processors • For a given problem size, plot Efd(p) for increasing values of p • It should stay close to a flat line • Isoefficiency: At which rate does the problem size need to be increase to maintain efficiency • By making a problem ridiculously large, one can typically achieve good efficiency • Problem: is it how the machine/code will be used?

Performance Measurement • how does one measure the performance of a program in practice? • Two issues: • Measuring wall-clock times • We’ll see how it can be done shortly • Measuring performance rates • Measure wall clock time (see above) • “Count” number of “operations” (frames, flops, amino-acids: whatever makes sense for the application) • Either by actively counting (count++) • Or by looking at the code and figure out how many operations are performed • Divide the count by the wall-clock time

Why is Performance Poor? • Performance is poor because the code suffers from a performance bottleneck • Definition • An application runs on a platform that has many components • CPU, Memory, Operating System, Network, Hard Drive, Video Card, etc. • Pick a component and make it faster • If the application performance increases, that component was the bottleneck!

Removing a Bottleneck • Hardware Upgrade • Is sometimes necessary • But can only get you so far and may be very costly • e.g., memory technology • Code Modification • The bottleneck is there because the code uses a “resource” heavily or in non-intelligent manner • We will learn techniques to alleviate bottlenecks at the software level

Identifying a Bottleneck • It can be difficult • You’re not going to change the memory bus just to see what happens to the application • But you can run the code on a different machine and see what happens • One Approach • Know/discover the characteristics of the machine • Observe the application execution on the machine • Tinker with the code • Run the application again • Repeat • Reason about what the bottleneck is

Memory Latency and Bandwidth • The performance of memory is typically defined by Latency and Bandwidth (or Rate) • Latency: time to read one byte from memory • measured in nanoseconds these days • Bandwidth: how many bytes can be read per seconds • measured in GB/sec • Note that you don’t have bandwidth = 1 / latency! • Reading 2 bytes in sequence may be cheaper than reading one byte only • Let’s see why...

Latency and Bandwidth • Latency = time for one “data item” to go from memory to the CPU • In-flight = number of data items that can be “in flight” on the memory bus • Bandwidth = Capacity / Latency • Maximum number of items I get in a latency period • Pipelining: • initial delay for getting the first data item = latency • if I ask for enough data items (> in-flight), after latency seconds, I receive data items at rate bandwidth • Just like networks, hardware pipelines, etc. “memory bus” Memory CPU

Latency and Bandwidth • Why is memory bandwidth important? • Because it gives us an upper bound on how fast one could feed data to the CPU • Why is memory latency important? • Because typically one cannot feed the CPU data constantly and one gets impacted by the latency • Latency numbers must be put in perspective with the clock rate, i.e., measured in cycles

Memory Bottleneck: Example • Fragment of code: a[i] = b[j] + c[k] • Three memory references: 2 reads, 1 write • One addition: can be done in one cycle • If the memory bandwidth is 12.8GB/sec, then the rate at which the processor can access integers (4 bytes) is: 12.8*1024*1024*1024 / 4 = 3.4GHz • The above code needs to access 3 integers • Therefore, the rate at which the code gets its data is ~ 1.1GHz • But the CPU could perform additions at 4GHz! • Therefore: The memory is the bottleneck • And we assumed memory worked at the peak!!! • We ignored other possible overheads on the bus • In practice the gap can be around a factor 15 or higher

A realistic approach: Profiling • A profiler is a tool that monitors the execution of a program and reports the computational characteristics used in different parts of the code i.e. dynamic program analysis • Useful to identify “expensive” functions • Profiling cycle • Compile the code with the profiler • Run the code • Identify the most expensive function • Optimize that function • call it less often if possible, make it faster, etc. • Repeat

Profiler Types based on Output • Flat profiler • Flat profiler's compute the average call times, from the calls, and do not breakdown the call times based on the callee or the context. • Call-Graph profiler • Call Graph profilers show the call times, and frequencies of the functions, and also the call-chains involved based on the callee. However context is not preserved.

Methods of data gathering • Event based profilers • In Programming languages listed, all of them have event-based profilers • Java: JVM-Profiler Interface JVM API provides hooks to profiler, for trapping events like calls, class-load, unload, thread enter leave. • Python: Python profilers are profile module, hotspot which are call-graph based, and use the 'sys.set_profile()' module to trap events like c_{call,return,exception}, python_{call,return,exception}.

Methods of data gathering • Statistical profilers • Some profilers operate by sampling. • Some profilers instrument the target program with additional instructions to collect the required information • The resulting data are not exact, but a statistical approximation. • Some of the most commonly used statistical profilers are GNU's gprof, Oprofile and SGI's Pixie.

Methods of data gathering • Instrumentation • Manual: Done by the programmer, e.g. by adding instructions to explicitly calculate runtimes. • Compiler assisted: Example: "gcc -pg ..." for gprof, "quantify g++ ..." for Quantify • Binary translation: The tool adds instrumentation to a compiled binary. Example: ATOM • Runtime instrumentation: Directly before execution the code is instrumented. The program run is fully supervised and controlled by the tool. Examples: PIN, Valgrind • Runtime injection: More lightweight than runtime instrumentation. Code is modified at runtime to have jumps to helper functions. Example: DynInst • Hypervisor: Data are collected by running the (usually) unmodified program under a hypervisor. Example: SIMMON • Simulator: Data are collected by running under an Instruction Set Simulator. Example: SIMMON

UPC – BSC Profiling Tools • Dimemas • Simulation tool for the parametric analysis of the behavior of message-passing applications on a configurable parallel platform. • Paraver • A very powerful performance visualization and analysis tool based on traces that can be used to analyze any information that is expressed on its input trace format.

Dimemas • Development • Tuning • Benchmarking • Training Utilization - Features • Predict behavior of target architecture • Forecast effects of code optimization • What if analysis

Dimemas cont. • Integrated with visualization tool • Extensive application characterization • Squeezing the information that a single instrumented run can provide Usage

Dimemas cont. • enables the user to develop and tune parallel applications on a workstation • providing an accurate prediction of their performance on the parallel target machine • The Dimemas simulator reconstructs the time behavior of a parallel application on a machine modeled by a set of performance parameters

Dimemas – DiP Environment

Dimemas - Benefits • the possibility of not using a parallel machine to run the application to get the traces • the possibility to analyze different application choices without changing neither rerunning the application itself

Dimemas – Benefits cont. • The loop involving Dimemas and Paraver, the simulator and the visualization tool. • The user has the chance to analyze the application behavior when some parameters are changed, for example, what will happen to application execution time is the execution time of a given function is reduced in 50%?

Dimemas - Prediction • A tracefile can be either obtained in a dedicated parallel machine or in a shared sequential machine. • Using the instrumentation records, Dimemas will rebuild the application behavior, using the tracefile and the architectural parameters defined using a GUI. Per process CPU time is used as opposed to elapsed time.

Dimemas – Prediction Advantages and Disadvantages • Cost reduction • Analysis of non-existent machines • Quality of performance prediction • Performance

Dimemas – Output Information • Basic results • execution time and speedup • For each task • computation time • blocking time • information related to messages: number of sends, number of received (divided in number of receives that produces a block because a message is not local to the task, and number of messages that promotes no block).

Dimemas – Output Information cont. • Basic results

Dimemas – Output Information cont. • Application Analysis • Is my application well balanced? Are there any dependencies? • Can we improve the application grouping messages? Does latency influence application time? • Are we planning to run our application in a low bandwidth network? Does bandwidth have influence in application time? • Do resources limit application time?

Dimemas – Output Information cont. • Additional Analysis • What is analysis • Critical path • Synthetic perturbation analysis

Paraver • Paraver is a flexible performance visualization and analysis tool that can be used to analyze • MPI • OpenMP • MPI+OpenMP • Java • Hardware counters profile • Operating system activity... and many other things you may think of

Paraver – DiP Environment

Paraver – Features • Motif based GUI • Detailed quantitative analysis of program performance • Concurrent comparative analysis of several traces • Fast analysis of very large traces

Paraver – Features cont. • Support for mixed message passing and shared memory (networks of SMPs) • Customizable semantics of the visualized information Cooperative work, sharing views of the tracefile • Building of derived metrics

Special Topics in Grid Enablement of Scientific Applications (CIS 6612)