1 / 30

Outline

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim. Outline. Background and Motivation Policies Experimental Setup Results Conclusion. 2. GPU Architecture (based on Tesla Architecture).

hachi
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effect of Instruction Fetch and Memory Scheduling on GPU PerformanceNagesh B Lakshminarayana, Hyesoon Kim

  2. Outline Background and Motivation Policies Experimental Setup Results Conclusion 2

  3. GPU Architecture (based on Tesla Architecture) SM – Streaming Multiprocessor SP – Scalar Processor SIMT – Single Instruction Multiple Thread 3

  4. SM Architecture (based on Tesla Architecture) • Fetch Mechanism • Fetch 1 instruction for selected warp • Stall Fetch for warp when it executes a Load/Store or when it encounters a Branch • Scheduler Policy • Oldest first and Inorder (within warp) • Caches • I Cache, Shared Memory, Constant Cache and Texture Cache 4

  5. Handling Multiple Memory Requests • MSHR/Memory Request Queue • Allows merging of memory requests (Intra-core) • DRAM Controller • Allows merging of memory requests (Inter-core) 5

  6. Intra-core Merging 6

  7. Code Example - Intra-Core Merging • From MonteCarlo in CUDA SDK for(iSum = threadIdx.x; iSum < SUM_N; iSum += blockDim.x) { … for(int i = iSum; i < pathN; i += SUM_N) { real r = d_Samples[i]; real callValue = endCallValue(S, X, r, MuByT, VBySqrtT); sumCall.Expected += callValue; sumCall.Confidence += callValue * callValue; } … } iSum 0, 2 = 2 iSum 1, 2 = 2 iSum 2, 2 = 2 A X, Y X – Block Id, Y – Thread Id i 0, 2 = 2 i 1, 2 = 2 i 2, 2 = 2 r 0, 2 = r 1, 2 = r 2, 2= d_Samples[2] multiple blocks are assigned to the same SM threads with corresponding Ids in different blocks access the same memory locations 7

  8. Inter-core Merging 8

  9. Why look at Fetch? • Allows implicit control over resources allocated to a warp • Can control progress of a warp • Can boost performance by fetching more for critical warps • Implicit resource control within a core 9

  10. Why look at DRAM Scheduling? • Memory System is a performance bottleneck for several applications • DRAM scheduling decides the order in which memory requests are granted • Can prioritize warps based on criticality • Implicit performance control across cores 10

  11. By controlling Fetch and DRAM Scheduling we can control performance 11

  12. How is This Useful? • Understand applications and their behavior better • Detect patterns or behavioral groups across applications • Design new policies for GPGPU applications to improve performance 12

  13. Outline Background and Motivation Policies Experimental Setup Results Conclusion 13

  14. Fetch Policies • Round Robin (RR) [default in Tesla architecture] • FAIR • Ensures uniform progress of all warps • ICOUNT [Tullsen’96] • Same as ICOUNT in SMT • Tries to increase throughput by giving priority to fast moving threads • Least Recently Fetched(LRF) – Prevents starvation of warps 14

  15. New Oracle Based Fetch Policies • ALL • Gives priority to longer warps (total length until termination) • Ensures all warps finish at the same time,this results in higher occupancy Priorities: warp 0 > warp 1 > warp 2 > warp 3 15

  16. New Oracle Based Fetch Policies • BAR • Gives priority to warps with greater number of instructions to next barrier • Idea is to reduce wait time at barriers Priorities: warp 0 > warp 1 > warp 2 > warp 3 Priorities: warp 2 > warp 1 > warp 0 > warp 3 16

  17. New Oracle Based Fetch Policies • MEM_BAR • Similar to BAR but gives higher priority to warps with more memory instructions Priorities: warp 0 > warp 2 > warp 1 = warp 3 Priorities: warp 1 > warp 0 = warp 2 > warp 3 Priority(Wa) > Priority(Wb) If MemInst(Wa) > MemInst(Wb)or If MemInst(Wa) = MemInst(Wb) AND Inst(Wa) > Inst(Wb) 17

  18. DRAM Scheduling Policies • FCFS • FRFCFS [Rixner’00] • FR_FAIR (new policy) • Row hit with fairness • Ensures uniform progress of warps • REM_INST (new Oracle based policy) • Row hit with priority for warps with greater number of instructions remaining for termination • Prioritizes longer warps 18

  19. Outline Background and Motivation Policies Experimental Setup Results Conclusion 19

  20. Experimental Setup • Simulated GPU Architecture • 8 SMs • Frontend : 1 wide, 1KB I Cache, branch stall • Execution : 8 wide SIMD execution unit, IO scheduling, 4 cycle latency for most instructions • Caches : 64KB software managed cache, 8 load accesses/cycle • Memory : 32B wide bus, 8 DRAM banks • RR fetch, FRFCFS DRAM scheduling (baseline) • Trace driven, cycle accurate simulator • Per warp traces generated using GPU Ocelot[Kerr’09] 20

  21. Benchmarks • Taken from • CUDA SDK 2.2 – MonteCarlo, Nbody, ScalarProd • PARBOIL[UIUC’09] – MRI-Q, MRI-FHD, CP, PNS • RODINIA[Che’09] – Leukocyte, Cell, Needle • Classification based on lengths of warps • Symmetric, if <= 2% divergence • Asymmetric, otherwise (results included in paper) 21

  22. Outline Background and Motivation Policies Experimental Setup Results Conclusion 22

  23. Results - Symmetric Applications Baseline : RR + FRFCFS • Compute intensive – no variation with different fetch policies • Memory bound – improvement with fairness oriented fetch policies i.e., FAIR, ALL, BAR, MEM_BAR 23

  24. Results – Symmetric Applications Baseline : RR + FRFCFS • On average, better than FRFCFS • MersenneTwister shows huge improvement • REM_INST DRAM policy performs similar to FRFAIR 24

  25. Analysis: MonteCarlo FRFCFS DRAM Scheduling • Fairness oriented fetch policies improve performance by increasing intra-core merging 25

  26. Analysis: MersenneTwister Baseline : RR + FRFCFS • FAIR DRAM Scheduling (FRFAIR, REM_INST) improves performance by increasing DRAM Row Buffer Hit ratio 26

  27. Analysis: BlackScholes FRFCFS DRAM Scheduling • Fairness oriented fetch policies increase MLP • Increased (MLP + Row Buffer Hit ratio) improves performance 27

  28. Outline Background and Motivation Policies Experimental Setup Results Conclusion 28

  29. Conclusion • Compute intensive applications • Fetch and DRAM Scheduling do not matter • Symmetric memory intensive applications • Fairness oriented Fetch (FAIR, ALL, BAR, MEM_BAR) and DRAM policies (FR_FAIR, REM_INST) provide performance improvement • MonteCarlo(40%),MersenneTwister(50%), BlackScholes(18%) • Asymmetric memory intensive applications • No correlation between performance and Fetch and DRAM Scheduling policies 29

  30. THANK YOU! 30

More Related