1 / 25

Performance Measurement of Applications with GPU Acceleration using CUDA

Performance Measurement of Applications with GPU Acceleration using CUDA. Shangkar Mayanglambam, Allen D. Malony , Matthew J. Sottile {smeitei,malony,matt}@cs.uoregon.edu Computer and Information Science Department Performance Research Laboratory University of Oregon. Outline. Motivation

hali
Download Presentation

Performance Measurement of Applications with GPU Acceleration using CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Measurement of Applications with GPU Acceleration using CUDA • Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile • {smeitei,malony,matt}@cs.uoregon.edu • Computer and Information Science Department • Performance Research Laboratory • University of Oregon

  2. Outline • Motivation • Performance perspectives • Acceleration, asynchrony, and concurrency • CPU-GPU execution scenarios • Performance measurement for GPGPUs • Accelerator performance measurement in PGI compiler • TAUcuda performance measurement • operation and API • TAUcuda tests and application case studies • Conclusions and future work

  3. Motivation • Heterogeneous computing technology more accessible • Multicore processors • Manycore accelerators (e.g., NVIDIA Tesla GPU) • High-performance processing engines (e.g., IBM Cell BE) • Achieving performance potential is challenging • Complexity of hardware operation and programming interface • CUDA created to help in GPU accelerator code development • Few performance tools for parallel accelerated applications • Need to understand acceleration in context of whole program • Need integration of accelerator measurements in scalable parallel performance tools • Focus on GPGPU performance measurement using CUDA

  4. Heterogeneous Performance Perspective • Heterogeneous applications can have concurrent execution • Main “host” path and “external” task paths • Want to capture performance for all execution paths • External execution may be difficult or impossible to measure • “Host” creates measurement view for external entity • Maintains local and remote performance data • External entity may provide performance data to the host • What perspective does the host have of the external entity? • Determines the semantics of the measurement data • Existing parallel performance tools are CPU(host)-centric • Event-based sampling (not appropriate for accelerators) • Direct measurement (through instrumentation of events)

  5. CUDA Performance Perspective • CUDA enables programming of kernels for GPU acceleration • GPU acceleration acts as an external tasks • Performance measurement appears straightforward • Execution model complicates performance measurement • Synchronous and asynchronous operation with respect to host • Overlapping of data transfer and kernel execution • Multiple GPU devices and multiple streams per device • Different acceleration kernels used in parallel application • Multiple application sections • Multiple application threads/processes • See performance in context: temporal, spatial, thread/process • Two general approaches: synchronous and asynchronous

  6. CPU – GPU Execution / Measurement Scenarios Synchronous Asynchronous

  7. Approach • Consider use of NVIDIA PerfKit and CUDA Profiler • PerfKit provides low-level data for GPU driver interface • limited for use with CUDA programming environment • CUDA Profiler provides extensive stream-level measurements • creates post-mortem event trace of kernel operation on streams • difficult to merge with application performance data • Goal is to produce profiles (traces) showing distribution of accelerator performance with respect to application events • Approach 1: force all measurements to be synchronous • Restricts CUDA usage, disallowing concurrent operation • Create new thread for every CUDA invocation • Approach 2: develop CUDA measurement mechanism • Merge with TAU performance system

  8. PGI Compiler for GPU (using CUDA) • PGI accelerator compiler (PGI 9.x, C and Fortran, x64 Linux) • Loop parallelization for acceleration on GPUs using CUDA • Directive-based presenting a GPGPU programming abstraction • Compiler not source translation – CUDA code hidden • TAU measurement of PGI acceleration • Wrappers of runtime system • Track runtime system events as seen from the host processor • Show source information associated with events • Routine name • File name, source line number for kernel • Variable names in memory upload, download operations • Grid sizes

  9. Matrix Multiplication Profile (3000x3000, ~22 GF)

  10. CUDA Programming for GPGPU • PGI compiler represents GPGPU programming abstraction • Performance tool uses runtime system wrappers • essentially a synchronous call performance model!!! • In general, programming of GPGPU devices is more complex • CUDA environment • Programming of multiple streams and GPU devices • multiple streams execute concurrently • Programming of data transfers to/from GPU device • Programming of GPU kernel code • Synchronization with streams • Stream event interface

  11. TAU CUDA Performance Measurement (TAUcuda) • Build on CUDA stream event interface • Allow “events” to be placed in streams and processed • events are timestamped • CUDA runtime reports GPU timing in event structure • Events are reported back to CPU when requested • use begin and end events to calculate intervals • Want to associate TAU event context with CUDA events • Get top of TAU event stack at begin (TAU context) • CUDA kernel invocations are asynchronous • CPU does not see actual CUDA “end” event • CPU retrieves events in a non-blocking and blocking manner • Want to capture “waiting time”

  12. CPU-GPU Operation and TAUcuda Events

  13. TAU CUDA Measurement API void tau_cuda_init(int argc, char **argv); • To be called when the application starts • Initializes data structures and checks GPU status void tau_cuda_exit() • To be called before any thread exits at end of application • All the CUDA profile data output for each thread of execution void* tau_cuda_stream_begin(char *event, cudaStream_t stream); • Called before CUDA statements to be measured • Returns handle which should be used in the end call • If event is new or the TAU context is new for the event, a new CUDA event profile object is created void tau_cuda_stream_end(void * handle); • Called immediately after CUDA statements to be measured • Handle identifies the stream • Inserts a CUDA event into the stream

  14. TAU CUDA Measurement API (2) vector<Event> tau_cuda_update(); • Checks for completed CUDA events on all streams • Non-blocking and returns # completed on each stream int tau_cuda_update(cudaStream_t stream); • Same as tau_cuda_update() except for a particular stream • Non-blocking and returns # completed on the stream vector<Event> tau_cuda_finalize(); • Waits for all CUDA events to complete on all streams • Blocking and returns # completed on each stream int tau_cuda_finalize(cudaStream_t stream); • Same as tau_cuda_finalize() except for a particular stream • Blocking and returns # completed on the stream

  15. Scenario Results – One and Two Streams • Run simple CUDA experiments to validate TAU CUDA • Tesla S1070 test system

  16. Scenario Results – Two Devices, Two Contexts

  17. TAUcuda Compared to CUDA Profiler • CUDA Profiler integrated in CUDA runtime system • Captures time measures for GPGPU kernel and memory tasks • Creates a trace in memory and outputs at end of execution • Can use to verify TAUcuda • Slight time variation due to differences in mechanism

  18. Case Study: TAUcuda in NAMD and ParFUM • TAU integrated in Charm++ (ICPP 2009 paper) • Charm++ applications • NAMD is a molecular dynamics application • Parallel Framework for Unstructure Meshing (ParFUM) • Both have been accelerated with CUDA • Demonstration use of TAUcuda • Observe the effect of CUDA acceleration • Show scaling results for GPU cluster execution • Experimental environments • Two S1070 GPU servers (Universit of Oregon) • AC cluster: 32 nodes, 4 Tesla GPUs per node (UIUC) 18

  19. NAMD GPU Profile (Two GPU Devices) • Test out TAU CUDA with NAMD • Two processes with one Tesla GPU for each CPU profile GPU profile (P0) GPU profile (P1)

  20. NAMD GPU Efficiency Gain (16 versus 32 GPUs) • AC cluster: 16 and 32 processes • dev_sum_forces: 50% improvement • dev_nonbounded: 100% improvement Event TAU Context Device Stream 20

  21. NAMD GPU Scaling (4 to 64 GPUs) Scaling Efficiency Non-bonded calculations Sum forces calculations Number of Devices 21 Strong scaling by event and device number Good scaling for non-bounded calculations Sum forces scales less well, but overall is small

  22. ParFUM CUDA speedup (Single CPU plus GPU) • Problem size: 128 x 8 x 8 mesh • With GPU acceleration, only 9 seconds in CUDA kernels 22

  23. Case Study: HMPP-TAU User Application HMPP Runtime CUDA HMPP CUDA Codelet TAU TAUcuda • Measurement • User events • HMPP events • Codelet events • Measurement • CUDA stream events • Waiting information

  24. HMPP Data/Overlap Experiment TAUcudaevents

  25. Conclusions and Future Work • Heterogeneous parallel computing will challenge parallel performance technology • Must deal with diversity in hardware and software • Must deal with richer parallelism and concurrency • Developed and demonstrated TAUcuda • TAU + CUDA measurement approach • Showed case studies and integrated in HMPP • Next targeting OpenCL (TAUopenCL) • Better merge TAU and TAUcuda performance data • Take advantage of other tools in TAU toolset • Performance database (PerfDMF), data mining (PerfExplorer) • Integrated in application and heterogeneous environments

More Related