1 / 78

Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk

Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk. Sean Baxter (324). GPU Architecture. Why GPU? Scan Idiomatic GPU Programming Sparse Matrix Optimization. Why GPU?. Why GPU?. Consider advance of CMOS process technology. CMOS Process Tech.

gibson
Download Presentation

Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Solving Common Problems with GPU:A Case Study with Sparse Matrices17 Mar 2011ARTS Talk Sean Baxter (324)

  2. GPU Architecture • Why GPU? • Scan • Idiomatic GPU Programming • Sparse Matrix Optimization

  3. Why GPU?

  4. Why GPU? Consider advance of CMOS process technology

  5. CMOS Process Tech • 1972 – Intel 8008 – 10000nm - 3500 trans • 1985 – Intel 386 – 1000nm - 275k trans • 1989 – Intel 486 – 800nm - 1.18m trans • 1997 – Pentium II – 350nm - 7.5m trans • 2000 – Pentium 4 – 180nm - 42m trans • 2006 – Core 2 Duo – 65nm - 291m trans • 2011 – Sandy Bridge – 32nm - 995m trans

  6. CMOS Process Tech Moore’s Law is Ending because…

  7. CMOS Process Tech …we’re running out of atoms

  8. CMOS Process Tech • Current node is 28nm • 28nm is half distance between start of repeating features • Radius of Si is .11nm • Only 250 Si atoms between start of repeating features • Not many more nodes before we run out of atoms

  9. Why GPU? • Feature density will plateau • To get continuously improving performance we need to improve efficiency • Focus on features that actually do work • CPU pipeline length enables high clockspeed • CPU pipeline length makes big, inefficient chip • GPU uses latency hiding to manage pipeline

  10. Instruction Pipeline • Old school RISC pipeline: • Instruction fetch • Instruction decode • Execute • Memory access • Write back to register

  11. Instruction Pipeline • Cycle 0: Fet In0 • Cycle 1: Dec In0, Fet In0 • Cycle 2: Ex In0, Dec In1, Fet In2 • Cycle 3: Mem In0, Ex In1, Dec In2, Fet In3 • Cycle 4: WB In0, Mem In1, Ex In2, Dec In2, Fet In3 • Cycle 5: WB In1, Mem In2, Ex In3, Dec In4, Fet In5

  12. Instruction Pipeline • Pipelining allows concurrent execution of instructions • Latency increases but so does throughput • Longer pipelines allow higher clockspeeds for increased throughput • Longer pipelines require more parallel circuitry to implement all stages

  13. Pipeline Hazards 0: x3 <- x2 op x1 1: x5 <- x4 op x3 Result of In0 is source of In1: Pipeline stall NOP are silently inserted into pipeline Also stall on memory contention or branch

  14. Pipeline Performance • Mitigate hazards with longer pipelines • Out-of-order execution • Branch prediction • Speculative execution • Register renaming • Microfusion • Micro-op cache (1500 uOp on Sandy Bridge)

  15. Superscalar • Instruction Level Parallelism (ILP) on CPU • Retire multiple instructions per cycle (IPC) • Intel Core architecture retires 4 IPC • Build multiple pipelines and have each fetch, decode, execute multiple instructions

  16. Instruction Pipeline • Pipeline length drove MHz in 90s • CMOS Process improvements enabled longer pipelines with increasing feature density • Circuitry to support long pipelines grew faster than circuitry to execute • Superscalar not actually all that effective: • Measured performance more like 1-1.5 IPC

  17. End of Long Pipelines • Pentium 4 Prescott (90nm Netburst) • Longest pipeline of any consumer chip: • Pentium 4 Northwood 20 pipeline stages • Pentium 4 Prescott 31 pipeline stages • Pipeline irony • Longer pipelines increase latency, make hazards harder to resolve • After Prescott, all vendors retreated from long pipelines

  18. GPU Revisionist History • CPU pipeline is very complex to ensure single-threaded program correctness • GPU drops single threaded support and implements latency hiding • Scalar, in-order execution • Long pipelines for high MHz • Almost all circuitry is in execution blocks for doing work, not decoding, re-ordering, or analyzing instructions

  19. GPU Pipeline • NVIDIA DX11 part “Fermi” (GF100-GF118) • GTX 580 (GF110) 16 Streaming Multiprocessors (SM) • Each SM operates independently, akin to cores on CPU • Within each SM, 32 active threads • Within each SM, 1536 threads in flight • GTX 580 is like 16-core exotic sequential processor

  20. GPU SM Pipeline • Instruction dispatch grouped into warps • On NV warp = 32 threads, on ATI warp = 64 threads • Threads within warp execute in parallel • In0 completes on all threads of warp 0 prior to In1 executing on any thread of warp 0 • ATI and G200 execute quarter warp per cycle • Fermi executes full warp per cycle

  21. GPU SM Pipeline • Fake RISC-like pipeline on GPU • Cycle 0: • Tid 0-31 fetch In0 • Cycle 1: • Tid 0-31 decode In0 • Tid 32-63 fetch In0 • Cycle 2: • Tid 0-31 execute In0 • Tid 32-63 decode In0 • Tid 64-95 fetch In0

  22. GPU SM Pipeline • On Fermi, up to 48 warps in flight • For 100% SM occupancy, 48 cycle instruction latency can be hidden with zero pipeline stalls • Actual instruction latency suspected around 22 cycles – launch at least 704/1536 threads

  23. GPU SM Pipeline Performance • 32 instructions executed per cycle • Core speed 1544MHz • 16 MPs on device • FMA x 1544MHz x 32IPC x 16MP = 1.58TFlop • ATI is VLIW4 for almost 3TFlop

  24. GPU SM Pipeline • The programmer has little control over the CPU pipeline • The programmer has great control over the GPU pipeline, as expressed by the shape of thread execution • Branching over a small number of warps stalls the pipeline

  25. Memory Bandwidth • AMD Phenom II: 14.6 GB/s • Intel Sandy Bridge Quad: 21.3 GB/s • Itanium 2 (NASA Columbia): 6.4 GB/s • NVidiaGeforce GTX 580: 192 GB/s Rule of thumb: GPU has 10x memory bandwidth and 50x arithmetic throughput compared to CPU

  26. GPU Memory Architecture • Memory segmented into aligned 128 byte intervals • On global memory I/O, number of segments addressed by threads in a warp is computed • Each segment is a memory transaction • On coalesced r/w, each thread in warp addresses different 4byte address in segment • Coalesced r/w mean only 1 transaction per 128 bytes

  27. Coalescing • For bandwidth-bound kernels, memory coalescing is #1 performance priority • For 4 byte types, transactions are serviced for full warps • For 8 byte types, transactions are serviced for half warps • For 16 bytes types (like float4 vecs in D3D), transactions are serviced for quarter warps • For larger types, penalties apply when loading from typed pointers • To r/w large structures (>16bytes), spread transactions over multiple threads

  28. GPU Memory Architecture • GTX 580 memory clock 1002 MHz • GDDR5 controller is quad-pumped • Memory controller is 384 bits (6x64) (48 byte) • 1002MHz x 48 byte x 4 = 192.3e9 byte/s Thread switching enables rest of warps on SM to execute while some are waiting on memory

  29. GPU Core Speed • 16 MPs • 1544 MHz shader speed • 32 IPC • 16 MPs x 1544 MHz x 32 IPC = 790e9 IPS • If each thread reads 4 bytes, 3162e9 byte/s • 3162e9 / 192.3e9 = 16.5 cycles/read

  30. GPU Memory Bandwidth • On GTX 580, you need to do a fully coalesced memory op every 16 cycles to saturate the controller • The high arithmetic throughput is there to enable this! • Latency hiding averages this out – may either stream data in or load in all data, do all ops, write all data – still get max throughput if ratio is met

  31. GPU Memory Bandwidth • GPU has expansive fields of ALUs to allow fast read-execute-write turnaround and see max performance in real problems • CPU has 1/10th memory bandwidth because it doesn’t have the arithmetic performance to do work on data even if it had more bandwidth

  32. Worst Idea for Sparse Matrix • Consider a CSR encoded MxN sparse matrix • Launch M threads (one per row) • Each thread reads an interval from the Row array • Each thread dynamically loops over all non-zero values in the row, reads from the Col array, and uses that to index into the dense vector texture • Accumulate products and write

  33. Worst Idea for Sparse Matrix • Consider memory transactions on GPU hardware • Threads 0-31 (computing rows 0-31) run concurrently • If average row density is 9 values, the warp reads over an interval of 32*9 = 288 values • 288 values span 10 segments • Memory throughput is 1/10th of peak

  34. Worst Idea for Sparse Matrix • Threads 0-31 (computing rows 0-31) run concurrently • All threads take as long as the thread with the most work • If all threads in warp have rows with 5 values, except one thread with 40 values, all threads wait through 40 values • Runs 1/8th efficiency!

  35. Understand This MPs are not coarse-grained parallel processors! (They are exotic sequential processors)

  36. Sean’s Sparse Matrix Performance • Utilize special encoding of sparse matrices to fully saturate the memory controller • For double precision, each matrix element is 12 bytes, each vector element is 8 bytes • How many dot-product components computed at 20bytes/component? • On GTX 570 (peak bandwidth 141.5 GB/s) I see up to 197 GB/s throughput. Thanks texture cache! Incredible CG method speed

  37. Scan The GPGPU algorithm for everything

  38. Scan • GPU is fast YAY • Your old code won’t work BOO • GPU is really hard to program • GPU is fairly easy to optimize • Throw away algorithms book • One algorithm to rule them all: scan

  39. Scan • Not a callable routine – more like Batman’s utility belt • At its simplest, adds up sequence of numbers: • Inclusive scan transforms sequence: (3, 2, 1, 4, 5) -> (3, 5, 6, 10, 15) • Exclusive scan transforms sequence: (3, 2, 1, 4, 5) -> (0, 3, 5, 6, 10)

  40. // Basic 32-element scan in cs_5_0 HLSL #define NUM_THREADS 32 #define sync GroupMemoryBarrierWithGroupSync Buffer<uint> readbuf : register(b0); RWBuffer<uint> writebuf : register(u0); groupshareduintsharedArray[NUM_THREADS]; [numthreads(NUM_THREADS, 1, 1)] void WarpScan(uinttid : SV_GroupIndex) { uint x = readbuf[tid]; sharedArray[tid] = x; sync(); [unroll] for(uint offset = 1; offset < NUM_THREADS; offset<<= 1) { uint left = (NUM_THREADS - 1) & (tid - offset); uint y = sharedArray[left]; sync(); if(offset <= tid) x += y; sharedArray[tid] = x; sync(); } writebuf[tid] = x; }

  41. cs_5_0 dcl_globalFlags refactoringAllowed dcl_resource_buffer (uint,uint,uint,uint) t0 dcl_uav_typed_buffer (uint,uint,uint,uint) u0 dcl_input vThreadIDInGroupFlattened dcl_temps 3 dcl_tgsm_structured g0, 4, 32 dcl_thread_group 32, 1, 1 ld_indexable(buffer)(uint,uint,uint,uint) r0.x, vThreadIDInGroupFlattened.xxxx, t0.xyzw store_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.x sync_g_t iadd r1.xyzw, vThreadIDInGroupFlattened.xxxx, l(-1, -2, -4, -8) and r1.xyzw, r1.xyzw, l(31, 31, 31, 31) ld_structured r0.y, r1.x, l(0), g0.xxxx iadd r0.y, r0.x, r0.y sync_g_t uge r2.xyzw, vThreadIDInGroupFlattened.xxxx, l(1, 2, 4, 8) movc r0.x, r2.x, r0.y, r0.x store_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.x sync_g_t ld_structured r0.y, r1.y, l(0), g0.xxxx iadd r0.y, r0.x, r0.y movc r0.x, r2.y, r0.y, r0.x (cont)

  42. sync_g_t store_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.x sync_g_t ld_structured r0.y, r1.z, l(0), g0.xxxx iadd r0.y, r0.x, r0.y movc r0.x, r2.z, r0.y, r0.x sync_g_t store_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.x sync_g_t ld_structured r0.y, r1.w, l(0), g0.xxxx iadd r0.y, r0.x, r0.y movc r0.x, r2.w, r0.y, r0.x sync_g_t store_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.x sync_g_t iadd r0.y, vThreadIDInGroupFlattened.x, l(-16) and r0.y, r0.y, l(31) ld_structured r0.y, r0.y, l(0), g0.xxxx iadd r0.y, r0.x, r0.y uge r0.z, vThreadIDInGroupFlattened.x, l(16) movc r0.x, r0.z, r0.y, r0.x store_uav_typed u0.xyzw, vThreadIDInGroupFlattened.xxxx, r0.xxxx ret // Approximately 38 instruction slots used

  43. Scan • Parallel scan is inefficient for adding numbers, yet critical for idiomatic GPU programming • Complex predicates allow many problems to be solved • Essential for load balancing jagged problems across simple threads • A well-used scan broadcasts information-dense values to all threads in the warp or block

  44. Segmented Scan • Sum from left-to-right within segments • Same as above code but with a modified predicate: [unroll] for(uint offset = 1; offset < NUM_THREADS; offset<<= 1) { uint left = (NUM_THREADS - 1) & (tid - offset); uint y = sharedArray[left]; sync(); if(offset <= delta) x += y; sharedArray[tid] = x; sync(); } • delta is distance from tid to start of segment

  45. Segmented Scan • Values of the same color are in the same segment: (2 1 2 0 3 4 5 12 10 4 5 2 1) • Segmented scan performs a complete scan within each segment (2 3 5 0 3 7 12 12 30 4 9 11 12)

  46. Sparse Matrix • Sparse matrix * vector is a very obvious segmented scan problem • Each segment is the size of the number of non-zero rows in the matrix • Each element is the product of a non-zero element and its corresponding component from the dense vector

  47. Sparse Matrix • Consider data in CSR (Compressed Sparse Row) format: • Row: (0 3 5 6 9 14) • Col: (5 4 3 1 234 6 3 1 2 1 5 4) • Index vec from col to get vector values for the matrix values, and multiply into the matrix values: • Mat*vec: (x1 x2 x3 x4 x5x6x7 x8 x9 x10 x11 x12 x13 x14)

  48. Run a segmented scan – the last value in each segment is the dot product of a sparse matrix row and the dense vector (x1 x2 x3 x4 x5x6x7 x8 x9 x10 x11 x12 x13 x14) scans to x1 x1+x2 x1+x2+x3 x4 x4+x5 x6 x7 x7+x8 x7+x8+x9 x10 x10+x11 x10+x11+x12 x10+x11+x12+x13 x10+x11+x12+x13+x14

  49. Sparse Matrix Challenges • Sparse matrix * vector should be bandwidth limited – each component requires only a multiply-and-add. • To achieve peak bandwidth, we need to issue a coalesced read every 16 cycles • DP is nerfed on Geforce series – only runs at 1/4th speed as same die on Quadro/Tesla part, so extra efficiency during reduction is essential • Parallel scan has poor work efficiency • Matrix data not in a convenient order for processing multiple values per thread

  50. Idiomatic GPU Programming GPU architecture guides software design

More Related