1 / 36

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures. George L. Yuan, Ali Bakhoda and Tor M. Aamodt Electrical and Computer Engineering University of British Columbia December 14 th , 200 9 (MICRO 200 9 ). The Trend: DRAM Access Locality in Many-Core.

tuvya
Download Presentation

Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures George L. Yuan, Ali Bakhodaand Tor M.Aamodt Electrical and Computer Engineering University of British Columbia December 14th, 2009 (MICRO 2009)

  2. The Trend: DRAM Access Locality in Many-Core Good Bad Pre-interconnect access locality Post-interconnect access locality Inside the interconnect, interleaving of memory request streams reduces the DRAM access locality seen by the memory controller George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 2

  3. Today’s Solution: Out-of-Order Scheduling Request Queue Youngest • Queue size needs to increase as number of cores increase • Requires fully-associative logic • Circuit issues: • Cycle time • Area • Power Row A Row B Row A Row A Row B Oldest Row A Opened Row: A DRAM Opened Row: B Switching Row OoO OK for Single Core, OK for Multi-Core, but for Many-Core..? George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 3

  4. Related Work • Rixner, Dally, et al • First-Ready First-Come First-Serve (FRFCFS) • Patents by Intel, Nvidia, etc.. • Mutlu & Moscibroda • Stall-time Fair Memory • Parallelism-Aware Batch Scheduling No prior work for memory access scheduling for 10,000+ threads George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 4

  5. Our Contributions • Show request stream interleaving in interconnect • First paper that considers problem of DRAM scheduling for tens of thousands of threads • Integration of DRAM scheduling in interconnect, allowing for more complexity-effective design • Achieves 91% of performance of out-of-order scheduling with in-order scheduling for memory-limited applications George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 5

  6. Outline • Introduction • Background on DRAM • The Request Interleaving Problem • Hold-Grant Interconnect Arbitration • Experimental Results • Conclusion George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 6

  7. Example of many-core accelerator? GPUs • High FLOP capacity for high resolution graphics • Nvidia’s GTX285: 30 8-wide multiprocessors • 10,000’s of concurrent threads • Demand on memory system extremely high George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 7

  8. DRAM Column Decoder Column Decoder Column Decoder Row Buffer Row Buffer Row Buffer Row Buffer Memory Controller Memory Array Row Decoder Row Decoder Background: DRAM • Row Access: • Activate a row of DRAM bank and load into row buffer (slow) • Column Access: • Read and write data in row buffer (fast) • Precharge: • Write row buffer data back into row (slow) George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 8

  9. Background: DRAM Row Access Locality Definition: Number of accesses to a row between row switches “row switch” tRC = row cycle time tRP = row precharge time tRCD = row activate time Row access locality Achievable DRAM Bandwidth Performance George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 9

  10. The Request Interleaving Problem George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 10

  11. FR-FCFS vs FIFO FRFCFS vs FIFO: Almost 2x Speedup Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 11

  12. Alternative Solution: Banked FIFO for Bank-level Parallelism FIFO FIFO for DRAM Bank 0 1 Banked FIFO 2 ~23% speedup over FIFO 3 George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 12

  13. Our SolutionHold grant interconnection arbitration policies “Hold Grant” (HG): Previously granted input has highest priority“Row-Matching Hold Grant” (RMHG): Previously granted input has highest priority if requested row matches previously requested row George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 13

  14. Interconnect Arbitration Policy: Round-Robin RowY RowA RowA Memory Controller 0 N W Router E S RowX RowC RowB RowB RowC RowB RowA Memory Controller 1 RowB RowA RowX RowY George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 14

  15. Interconnect Arbitration Policy: HG RowY RowA RowA Memory Controller 0 N W Router E S RowX RowC RowB RowB RowC RowB RowB Memory Controller 1 RowA RowA RowX RowY George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 15

  16. Interconnect Arbitration Policy: RMHG RowY RowA RowA Memory Controller 0 N W Router E S RowX RowC RowB RowB RowC RowB RowB Memory Controller 1 RowA RowA RowX RowY George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 16

  17. Complexity Comparison For 32 entry queues: 15x reduction in bit comparisons, reduction from 32-way associative to direct mapped George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 17

  18. Methodology: MicroarchitectureParameters George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 18

  19. Methodology: Simulator • GPGPU-Sim: A massively multithreaded architecture performance simulator (www.gpgpu-sim.org) • Supports NVIDIA’s Compute Unified Device Architecture (CUDA) framework • Simulates Parallel Thread Execution (PTX) instructions George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures

  20. Results – IPC Normalized to FR-FCFS FIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR - FCFS 100% 80% 60% 40% 20% 0% fwt lib mum neu nn ray red sp wp HM Crossbar network, 28 shader cores, 8 DRAM controllers, 8-entry DRAM queues:BFIFO: 14% speedup over regular FIFOBFIFO+HG: 18% speedup over BFIFO, within 91% of FRFCFS George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 20

  21. Row Streak Breakers Youngest Oldest Memory Controller Queue DRAM RowA RowB RowC RowA RowA RowA Stranded Request Row Streak Breakers “Row Streak” Requests From Core 1 Requests From Core 2 George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 21

  22. Row Streak Breakers “bad” “good” B = banked FIFO; H = banked FIFO + Hold Grant Arithmetic mean average reduction: 73% Harmonic mean average reduction: 96% George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 22

  23. Conclusion • Show request stream interleaving in interconnect • Effect gets worse as number of cores increase • First paper that considers problem of DRAM scheduling for tens of thousands of threads • No prior work on memory scheduling for many-core • Integration of DRAM scheduling in interconnect, allowing for more complexity-effective design • Should allow for faster clock speeds, power/area savings • Achieves 91% of performance of out-of-order scheduling with in-order scheduling for memory-limited applications George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 23

  24. Future Work • Improve upon our memory scheduler design • Evaluate performance of graphic applications • Design a hold-grant scheme that works in conjunction with multiple virtual channel deadlock avoidance schemes for torus networks • Synthesize, layout, and use SPICE to determine actual power/area overheads, cycle time George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 24

  25. Thank you George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 25

  26. Methodology: MicroarchitectureParameters George Yuan, Ali Bakhoda, Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 26

  27. FR-FCFS vs FIFO Need out-of-order scheduling inside DRAM controller to improve row access locality of requests to DRAM chips FIFO vs FRFCFS: 46.8% Slowdown George Yuan Supervisor: Dr. Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 27

  28. Varying Topology Ring networks require multiple virtual channels for deadlock avoidanceMultiple virtual channels = path diversityPath diversity => requests arrive out of order = interleaving George Yuan Supervisor: Dr. Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 28

  29. Multiple Virtual Channels : Dynamic Virtual Channel Allocation Source Row B Row A Row A Destination Row X VC0 VC1 Router Congestion George Yuan Supervisor: Dr. Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 29

  30. Multiple Virtual Channels : Static Virtual Channel Allocation Source Row B Row A Row A Destination Row X VC0 VC1 Router Congestion George Yuan Supervisor: Dr. Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 30

  31. SVCA vs DVCA Harmonic mean IPC for different virtual channel configurations SVCA speedup over DVCA by up to 18.5% George Yuan Supervisor: Dr. Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 31

  32. Benchmarks George Yuan Supervisor: Dr. Tor Aamodt University of British Columbia 32

  33. Sensitivity Analysis Varying DRAM Controller Queue Size Varying Topology George Yuan Supervisor: Dr. Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 33

  34. More Results Memory Latency:33.9% reduction for HG and35.3% reduction for HMHG4compared to BFIFODRAM Efficiency:15.1% improvement for HG and HMHG4 over BFIFO George Yuan Supervisor: Dr. Tor Aamodt University of British Columbia Complexity-Effective Memory Access Scheduling for Many-Core Accelerator Architectures 34

  35. Row Access Locality Reduction After Interconnect 44% for Crossbar, 48% for Mesh, 52% for Ring George Yuan Supervisor: Dr. Tor Aamodt University of British Columbia

  36. DRAM Parameters George Yuan Supervisor: Dr. Tor Aamodt University of British Columbia

More Related