Performance Measurement of Applications with GPU Acceleration using CUDA

Performance Measurement of Applications with GPU Acceleration using CUDA • Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile • {smeitei,malony,matt}@cs.uoregon.edu • Computer and Information Science Department • Performance Research Laboratory • University of Oregon

Outline • Motivation • Performance perspectives • Acceleration, asynchrony, and concurrency • CPU-GPU execution scenarios • Performance measurement for GPGPUs • Accelerator performance measurement in PGI compiler • TAUcuda performance measurement • operation and API • TAUcuda tests and application case studies • Conclusions and future work

Motivation • Heterogeneous computing technology more accessible • Multicore processors • Manycore accelerators (e.g., NVIDIA Tesla GPU) • High-performance processing engines (e.g., IBM Cell BE) • Achieving performance potential is challenging • Complexity of hardware operation and programming interface • CUDA created to help in GPU accelerator code development • Few performance tools for parallel accelerated applications • Need to understand acceleration in context of whole program • Need integration of accelerator measurements in scalable parallel performance tools • Focus on GPGPU performance measurement using CUDA

Heterogeneous Performance Perspective • Heterogeneous applications can have concurrent execution • Main “host” path and “external” task paths • Want to capture performance for all execution paths • External execution may be difficult or impossible to measure • “Host” creates measurement view for external entity • Maintains local and remote performance data • External entity may provide performance data to the host • What perspective does the host have of the external entity? • Determines the semantics of the measurement data • Existing parallel performance tools are CPU(host)-centric • Event-based sampling (not appropriate for accelerators) • Direct measurement (through instrumentation of events)

CUDA Performance Perspective • CUDA enables programming of kernels for GPU acceleration • GPU acceleration acts as an external tasks • Performance measurement appears straightforward • Execution model complicates performance measurement • Synchronous and asynchronous operation with respect to host • Overlapping of data transfer and kernel execution • Multiple GPU devices and multiple streams per device • Different acceleration kernels used in parallel application • Multiple application sections • Multiple application threads/processes • See performance in context: temporal, spatial, thread/process • Two general approaches: synchronous and asynchronous

CPU – GPU Execution / Measurement Scenarios Synchronous Asynchronous

Approach • Consider use of NVIDIA PerfKit and CUDA Profiler • PerfKit provides low-level data for GPU driver interface • limited for use with CUDA programming environment • CUDA Profiler provides extensive stream-level measurements • creates post-mortem event trace of kernel operation on streams • difficult to merge with application performance data • Goal is to produce profiles (traces) showing distribution of accelerator performance with respect to application events • Approach 1: force all measurements to be synchronous • Restricts CUDA usage, disallowing concurrent operation • Create new thread for every CUDA invocation • Approach 2: develop CUDA measurement mechanism • Merge with TAU performance system

PGI Compiler for GPU (using CUDA) • PGI accelerator compiler (PGI 9.x, C and Fortran, x64 Linux) • Loop parallelization for acceleration on GPUs using CUDA • Directive-based presenting a GPGPU programming abstraction • Compiler not source translation – CUDA code hidden • TAU measurement of PGI acceleration • Wrappers of runtime system • Track runtime system events as seen from the host processor • Show source information associated with events • Routine name • File name, source line number for kernel • Variable names in memory upload, download operations • Grid sizes

Matrix Multiplication Profile (3000x3000, ~22 GF)

CUDA Programming for GPGPU • PGI compiler represents GPGPU programming abstraction • Performance tool uses runtime system wrappers • essentially a synchronous call performance model!!! • In general, programming of GPGPU devices is more complex • CUDA environment • Programming of multiple streams and GPU devices • multiple streams execute concurrently • Programming of data transfers to/from GPU device • Programming of GPU kernel code • Synchronization with streams • Stream event interface

TAU CUDA Performance Measurement (TAUcuda) • Build on CUDA stream event interface • Allow “events” to be placed in streams and processed • events are timestamped • CUDA runtime reports GPU timing in event structure • Events are reported back to CPU when requested • use begin and end events to calculate intervals • Want to associate TAU event context with CUDA events • Get top of TAU event stack at begin (TAU context) • CUDA kernel invocations are asynchronous • CPU does not see actual CUDA “end” event • CPU retrieves events in a non-blocking and blocking manner • Want to capture “waiting time”

CPU-GPU Operation and TAUcuda Events

TAU CUDA Measurement API void tau_cuda_init(int argc, char **argv); • To be called when the application starts • Initializes data structures and checks GPU status void tau_cuda_exit() • To be called before any thread exits at end of application • All the CUDA profile data output for each thread of execution void* tau_cuda_stream_begin(char *event, cudaStream_t stream); • Called before CUDA statements to be measured • Returns handle which should be used in the end call • If event is new or the TAU context is new for the event, a new CUDA event profile object is created void tau_cuda_stream_end(void * handle); • Called immediately after CUDA statements to be measured • Handle identifies the stream • Inserts a CUDA event into the stream

TAU CUDA Measurement API (2) vector<Event> tau_cuda_update(); • Checks for completed CUDA events on all streams • Non-blocking and returns # completed on each stream int tau_cuda_update(cudaStream_t stream); • Same as tau_cuda_update() except for a particular stream • Non-blocking and returns # completed on the stream vector<Event> tau_cuda_finalize(); • Waits for all CUDA events to complete on all streams • Blocking and returns # completed on each stream int tau_cuda_finalize(cudaStream_t stream); • Same as tau_cuda_finalize() except for a particular stream • Blocking and returns # completed on the stream

Scenario Results – One and Two Streams • Run simple CUDA experiments to validate TAU CUDA • Tesla S1070 test system

Scenario Results – Two Devices, Two Contexts

TAUcuda Compared to CUDA Profiler • CUDA Profiler integrated in CUDA runtime system • Captures time measures for GPGPU kernel and memory tasks • Creates a trace in memory and outputs at end of execution • Can use to verify TAUcuda • Slight time variation due to differences in mechanism

Case Study: TAUcuda in NAMD and ParFUM • TAU integrated in Charm++ (ICPP 2009 paper) • Charm++ applications • NAMD is a molecular dynamics application • Parallel Framework for Unstructure Meshing (ParFUM) • Both have been accelerated with CUDA • Demonstration use of TAUcuda • Observe the effect of CUDA acceleration • Show scaling results for GPU cluster execution • Experimental environments • Two S1070 GPU servers (Universit of Oregon) • AC cluster: 32 nodes, 4 Tesla GPUs per node (UIUC) 18

NAMD GPU Profile (Two GPU Devices) • Test out TAU CUDA with NAMD • Two processes with one Tesla GPU for each CPU profile GPU profile (P0) GPU profile (P1)

NAMD GPU Efficiency Gain (16 versus 32 GPUs) • AC cluster: 16 and 32 processes • dev_sum_forces: 50% improvement • dev_nonbounded: 100% improvement Event TAU Context Device Stream 20

NAMD GPU Scaling (4 to 64 GPUs) Scaling Efficiency Non-bonded calculations Sum forces calculations Number of Devices 21 Strong scaling by event and device number Good scaling for non-bounded calculations Sum forces scales less well, but overall is small

ParFUM CUDA speedup (Single CPU plus GPU) • Problem size: 128 x 8 x 8 mesh • With GPU acceleration, only 9 seconds in CUDA kernels 22

Case Study: HMPP-TAU User Application HMPP Runtime CUDA HMPP CUDA Codelet TAU TAUcuda • Measurement • User events • HMPP events • Codelet events • Measurement • CUDA stream events • Waiting information

HMPP Data/Overlap Experiment TAUcudaevents

Conclusions and Future Work • Heterogeneous parallel computing will challenge parallel performance technology • Must deal with diversity in hardware and software • Must deal with richer parallelism and concurrency • Developed and demonstrated TAUcuda • TAU + CUDA measurement approach • Showed case studies and integrated in HMPP • Next targeting OpenCL (TAUopenCL) • Better merge TAU and TAUcuda performance data • Take advantage of other tools in TAU toolset • Performance database (PerfDMF), data mining (PerfExplorer) • Integrated in application and heterogeneous environments

Performance Measurement of Applications with GPU Acceleration using CUDA