1 / 39

Algorithm Engineering „ GPGPU“

Algorithm Engineering „ GPGPU“. Stefan Edelkamp. Graphics Processing Units. GPGPU = (GP)²U General Purpose Programming on the GPU „ Parallelism for the masses “ Application : Fourier-Transformation, Model Checking , Bio- Informatics , see CUDA-ZONE.

zea
Download Presentation

Algorithm Engineering „ GPGPU“

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithm Engineering „GPGPU“ Stefan Edelkamp

  2. Graphics Processing Units • GPGPU = (GP)²U General PurposeProgramming on the GPU • „Parallelismforthemasses“ • Application: Fourier-Transformation, Model Checking, Bio-Informatics, see CUDA-ZONE

  3. Programming the Graphics Processing Unitwith Cuda

  4. Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • GPGPU languages • CUDA • Small Example

  5. Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • GPGPU languages • CUDA • Small Example

  6. Cluster / Multicore / GPU CPU RAM • Cluster system • many unique systems • each one • one (or more) processors • internal memory • often HDD • communication over network • slow compared to internal • no shared memory HDD CPU RAM Switch HDD CPU RAM HDD

  7. Cluster / Multicore / GPU • Multicore systems • multiple CPUs • RAM • external memory on HDD • communication over RAM CPU1 CPU2 CPU3 CPU4 RAM HDD

  8. Cluster / Multicore / GPU • System with a Graphic Processing Unit • Many (240) Parallel processing units • Hierarchical memory structure • RAM • VideoRAM • SharedRAM • Communication • PCI BUS Graphics Card CPU VRAM GPU RAM SRAM Hard Disk Drive

  9. Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • GPGPU languages • CUDA • Small Example

  10. Computing on the GPU • Hierarchical execution • Groups • executed sequentially • Threads • executed parallel • lightweight (creation / switching nearly free)‏ • one Kernel function • executed by each thread Group 0

  11. Computing on the GPU • Hierarchical memory • Video RAM • 1 GB • Comparable to RAM • Shared RAM in the GPU • 16 KB • Comparable to registers • parallel access by threads Graphic Card VideoRAM GPU SRAM

  12. Beispielarchitektur G200 z.B. in 280GTX

  13. Beispielprobleme

  14. Ranking und Unranking mit Parity

  15. 2-Bit BFS

  16. 1-Bit BFS

  17. Schiebepuzzle

  18. SomeResults…

  19. Weitere Resultate …

  20. Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • GPGPU languages • CUDA • Small Example

  21. GPGPU Languages • RapidMind • Supports MultiCore, ATI, NVIDIA and Cell • C++ analysed and compiled for target hardware • Accelerator (Microsoft)‏ • Library for .NET language • BrookGPU (Stanford University)‏ • Supports ATI, NVIDIA • Own Language, variant of ANSI C

  22. Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • Programming languages • CUDA • Small Example

  23. CUDA • Programming language • Similar to C • File suffix .cu • Own compiler called nvcc • Can be linked to C

  24. CUDA C++ code CUDA Code Compile with GCC Compile with nvcc Link with ld Executable

  25. CUDA • Additional variable types • Dim3 • Int3 • Char3

  26. CUDA • Different types of functions • __global__ invoked from host • __device__ called from device • Different types of variables • __device__ located in VRAM • __shared__ located in SRAM

  27. CUDA • Calling the kernel function • name<<<dim3 grid, dim3 block>>>(...)‏ • Grid dimensions (groups)‏ • Block dimensions (threads)‏

  28. CUDA • Memory handling • CudaMalloc(...) - allocating VRAM • CudaMemcpy(...) - copying Memory • CudaFree(...) - free VRAM

  29. CUDA • Distinguish threads • blockDim – Number of all groups • blockIdx – Id of Group (starting with 0)‏ • threadIdx – Id of Thread (starting with 0)‏ • Id = blockDim.x*blockIdx.x+threadIdx.x

  30. Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • Programming languages • CUDA • Small Example

  31. __global__ void inc(int *a, int b, int N)‏ { int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N) a[id] = a[id] + b; } void main()‏ { ... int * a_d = CudaAlloc(N); CudaMemCpy(a_d,a,N,HostToDevice); dim3 dimBlock ( blocksize, 0, 0 ); dim3 dimGrid ( N / blocksize, 0, 0 ); inc<<<dimGrid,dimBlock>>>(a_d,b,N); } CUDA void inc(int *a, int b, int N) { for (inti = 0; i<N; i++) a[i] = a[i] + b; } void main()‏ { ... inc(a,b,N); }

  32. Realworld Example • LTL Model checking • Traversing an implicit Graph G=(V,E)‏ • Vertices called states • Edges represented by transitions • Duplicate removal needed

  33. Realworld Example • External Model checking • Generate Graph with external BFS • Each BFS layer needs to be sorted • GPU proven to be fast in sorting

  34. Realworld Example • Challenges • Millions of states in one layer • Huge state size • Fast access only in SRAM • Elements needs to be moved

  35. Realworld Example • Solutions: • Gpuqsort • Qsort optimized for GPUs • Intensive swapping in VRAM • Bitonic based sorting • Fast for subgroups • Concatenating Groups slow

  36. SRAM VRAM Realworld Example • Our solution • States S presorted by Hash H(S) • Bucket sorted in SRAM by a Group

  37. Realworld Example • Our solution • Order given by H(S),S

  38. Realworld Example • Results

  39. Programming the GPU • Questions???

More Related