Outline

Effect of Instruction Fetch and Memory Scheduling on GPU PerformanceNagesh B Lakshminarayana, Hyesoon Kim

Outline Background and Motivation Policies Experimental Setup Results Conclusion 2

GPU Architecture (based on Tesla Architecture) SM – Streaming Multiprocessor SP – Scalar Processor SIMT – Single Instruction Multiple Thread 3

SM Architecture (based on Tesla Architecture) • Fetch Mechanism • Fetch 1 instruction for selected warp • Stall Fetch for warp when it executes a Load/Store or when it encounters a Branch • Scheduler Policy • Oldest first and Inorder (within warp) • Caches • I Cache, Shared Memory, Constant Cache and Texture Cache 4

Handling Multiple Memory Requests • MSHR/Memory Request Queue • Allows merging of memory requests (Intra-core) • DRAM Controller • Allows merging of memory requests (Inter-core) 5

Intra-core Merging 6

Code Example - Intra-Core Merging • From MonteCarlo in CUDA SDK for(iSum = threadIdx.x; iSum < SUM_N; iSum += blockDim.x) { … for(int i = iSum; i < pathN; i += SUM_N) { real r = d_Samples[i]; real callValue = endCallValue(S, X, r, MuByT, VBySqrtT); sumCall.Expected += callValue; sumCall.Confidence += callValue * callValue; } … } iSum 0, 2 = 2 iSum 1, 2 = 2 iSum 2, 2 = 2 A X, Y X – Block Id, Y – Thread Id i 0, 2 = 2 i 1, 2 = 2 i 2, 2 = 2 r 0, 2 = r 1, 2 = r 2, 2= d_Samples[2] multiple blocks are assigned to the same SM threads with corresponding Ids in different blocks access the same memory locations 7

Inter-core Merging 8

Why look at Fetch? • Allows implicit control over resources allocated to a warp • Can control progress of a warp • Can boost performance by fetching more for critical warps • Implicit resource control within a core 9

Why look at DRAM Scheduling? • Memory System is a performance bottleneck for several applications • DRAM scheduling decides the order in which memory requests are granted • Can prioritize warps based on criticality • Implicit performance control across cores 10

By controlling Fetch and DRAM Scheduling we can control performance 11

How is This Useful? • Understand applications and their behavior better • Detect patterns or behavioral groups across applications • Design new policies for GPGPU applications to improve performance 12

Fetch Policies • Round Robin (RR) [default in Tesla architecture] • FAIR • Ensures uniform progress of all warps • ICOUNT [Tullsen’96] • Same as ICOUNT in SMT • Tries to increase throughput by giving priority to fast moving threads • Least Recently Fetched(LRF) – Prevents starvation of warps 14

New Oracle Based Fetch Policies • ALL • Gives priority to longer warps (total length until termination) • Ensures all warps finish at the same time,this results in higher occupancy Priorities: warp 0 > warp 1 > warp 2 > warp 3 15

New Oracle Based Fetch Policies • BAR • Gives priority to warps with greater number of instructions to next barrier • Idea is to reduce wait time at barriers Priorities: warp 0 > warp 1 > warp 2 > warp 3 Priorities: warp 2 > warp 1 > warp 0 > warp 3 16

New Oracle Based Fetch Policies • MEM_BAR • Similar to BAR but gives higher priority to warps with more memory instructions Priorities: warp 0 > warp 2 > warp 1 = warp 3 Priorities: warp 1 > warp 0 = warp 2 > warp 3 Priority(Wa) > Priority(Wb) If MemInst(Wa) > MemInst(Wb)or If MemInst(Wa) = MemInst(Wb) AND Inst(Wa) > Inst(Wb) 17

DRAM Scheduling Policies • FCFS • FRFCFS [Rixner’00] • FR_FAIR (new policy) • Row hit with fairness • Ensures uniform progress of warps • REM_INST (new Oracle based policy) • Row hit with priority for warps with greater number of instructions remaining for termination • Prioritizes longer warps 18

Experimental Setup • Simulated GPU Architecture • 8 SMs • Frontend : 1 wide, 1KB I Cache, branch stall • Execution : 8 wide SIMD execution unit, IO scheduling, 4 cycle latency for most instructions • Caches : 64KB software managed cache, 8 load accesses/cycle • Memory : 32B wide bus, 8 DRAM banks • RR fetch, FRFCFS DRAM scheduling (baseline) • Trace driven, cycle accurate simulator • Per warp traces generated using GPU Ocelot[Kerr’09] 20

Benchmarks • Taken from • CUDA SDK 2.2 – MonteCarlo, Nbody, ScalarProd • PARBOIL[UIUC’09] – MRI-Q, MRI-FHD, CP, PNS • RODINIA[Che’09] – Leukocyte, Cell, Needle • Classification based on lengths of warps • Symmetric, if <= 2% divergence • Asymmetric, otherwise (results included in paper) 21

Results - Symmetric Applications Baseline : RR + FRFCFS • Compute intensive – no variation with different fetch policies • Memory bound – improvement with fairness oriented fetch policies i.e., FAIR, ALL, BAR, MEM_BAR 23

Results – Symmetric Applications Baseline : RR + FRFCFS • On average, better than FRFCFS • MersenneTwister shows huge improvement • REM_INST DRAM policy performs similar to FRFAIR 24

Analysis: MonteCarlo FRFCFS DRAM Scheduling • Fairness oriented fetch policies improve performance by increasing intra-core merging 25

Analysis: MersenneTwister Baseline : RR + FRFCFS • FAIR DRAM Scheduling (FRFAIR, REM_INST) improves performance by increasing DRAM Row Buffer Hit ratio 26

Analysis: BlackScholes FRFCFS DRAM Scheduling • Fairness oriented fetch policies increase MLP • Increased (MLP + Row Buffer Hit ratio) improves performance 27

Conclusion • Compute intensive applications • Fetch and DRAM Scheduling do not matter • Symmetric memory intensive applications • Fairness oriented Fetch (FAIR, ALL, BAR, MEM_BAR) and DRAM policies (FR_FAIR, REM_INST) provide performance improvement • MonteCarlo(40%),MersenneTwister(50%), BlackScholes(18%) • Asymmetric memory intensive applications • No correlation between performance and Fetch and DRAM Scheduling policies 29

THANK YOU! 30

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

OUTLINE