1 / 23

Design of a Parallel Vector Access Unit for SDRAM Memory Systems

Design of a Parallel Vector Access Unit for SDRAM Memory Systems. Impulse Group Department of Computer Science University of Utah Presented by Binu K. Mathew. Motivation. Current microprocessors are very powerful, but ... Irregular applications still perform poorly Vectorizable loop

gittel
Download Presentation

Design of a Parallel Vector Access Unit for SDRAM Memory Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design of a Parallel Vector Access Unit for SDRAM Memory Systems Impulse Group Department of Computer Science University of Utah Presented by Binu K. Mathew

  2. Motivation • Current microprocessors are very powerful, but ... • Irregular applications still perform poorly • Vectorizable loop • e.g. : for(i = 0; i < L * S; i += S) y[i] += a[i] * x[i]; • Poor cache utilization • Cache pollution • Poor bus utilization • Access pattern may be predictable • Memory system enhancements for vectors • Vectors are back again

  3. A Vector Memory Controller • Handle both strided and normal accesses • Fast scatter/gather • Efficient cache-line fills • New fast Parallel Vector Access Algorithm • Scheduling heuristics • Prototype implementation

  4. The Serial Vector Access Problem Vector = < Base Address, Stride, length > 0 1 2 3 4 5 6 7

  5. The Serial Vector Access Problem V = < 1024, 1, 16 >, cache-line fill 1024 1025 1026 1027 ... 0 1 2 3 4 5 6 7 Cache-line Interleaved Serial Memory System

  6. The Serial Vector Access Problem V = < 1024, 32, 16 >, strided access 1024 1056 1088 1120 ... 0 1 2 3 4 5 6 7 Cache-line Interleaved Serial Memory System

  7. Parallel Vector Access • Serial vector access : Low throughput • Exploit bank parallelism • Exploit internal parallelism • History of Parallel Vector Access • CVMS : Corbal, Espasa, Valero • Two to 15 cycle algorithm • Interconnect and crossbar • Module stride : Steven Moyer • Our PVA Algorithm • Two to three cycles, Merge on Bus • Scalable • Word and Block interleave

  8. Memory Organization

  9. The PVA Problem V = < 1024, 1, 16 >, cache-line fill 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 0 1 2 3 4 5 6 7 Bank Access Sequence : 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7

  10. The PVA Problem : Stride 2 V = < 1024, 2, 16 > 1024 1026 1028 1030 1032 1034 1036 1038 1040 1042 1044 1046 1048 1050 1052 1054 0 1 2 3 4 5 6 7 Bank Access Sequence : 0, 2, 4, 6, 0, 2, 4, 6, ...

  11. 0 1 2 3 4 5 6 7 The PVA Problem : Stride 3 V = < 1024, 3, 16 > 1024 1027 1030 1033 1036 1039 1042 1045 1048 1051 1054 1057 1060 1063 1066 1069 Bank Access Sequence : 0, 3, 6, 1, 4, 7, 2, 5, 0, ...

  12. Our PVA Solution • Functions • FirstHit(V,b) : Compute first vector element of V that hits b Operations: Table lookup, multiply or shift and add • NextHit(V.S) : Compute incremental index of next element Operations : Trivial PLA • Bank Controller Algorithm • Compute i = FirstHit(V, b) • If there is no hit, continue • Till the end of the vector is reached do : • Schedule access memory location V.B + i * V.S • i = i + NextHit(V.S) • Scheduling heuristics • Early row open • Reordering and interleaving requests

  13. Our PVA Solution : Stride 2 V = < 1024, 2, 16 > 0 1 2 3 4 5 6 7

  14. Hit, 0 No Hit Hit, 1 No Hit Hit, 2 No Hit Hit, 3 No Hit δ=4 δ=4 δ=4 δ=4 1024 1026 1028 1030 1032 1034 1036 1038 1040 1042 1044 1046 1048 1050 1052 1054 Our PVA Solution : Stride 2 0 1 2 3 4 5 6 7

  15. Our PVA Solution • Functions • FirstHit(V,b) : Compute first vector element of V that hits b Operations: Table lookup, multiply or shift and add • NextHit(V.S) : Compute incremental index of next element Operations : Trivial PLA • Bank Controller Algorithm • Compute i = FirstHit(V, b) • If there is no hit, continue • Till the end of the vector is reached do : • Schedule access memory location V.B + i * V.S • i = i + NextHit(V.S) • Scheduling heuristics • Early row open • Reordering and interleaving requests

  16. Hardware Prototype • Verilog model : Approximately 3600 lines of code • Timing estimates with FPGA, gate level simulation • Hardware cost per bank controller • Approximately 11000 gates • 2K bytes on-chip RAM • Target • CPU : R10000 • L2 cache-line size : 128 bytes • System bus : 64 bits wide, split transaction • Four outstanding memory requests • Memory: 256 Mbit Micron SDRAM at 100 MHz • 32 bits x 16 banks • Word interleaved

  17. Register File Access Scheduler Vector Contexts SDRAM Interface PVA Implementation Request FIFO FirstHit Predict FirstHit Calculate Vector Bus Staging Unit SDRAM Bus

  18. Performance Evaluation • Kernels • 240 data points • Compared PVA with : • Cache-line interleaved SDRAM 64 bits wide Burst length of 16 • Scatter/Gather serial SDRAM 32 bits wide, 16 banks Overlapped RAS and precharge • Parallel Vector Access SRAM 32 bits wide, 16 banks Single cycle latency, pipelined SRAM

  19. Results : Cache-line fills

  20. Results : Strided Access 10.86 32.69

  21. Results : SRAM Comparison

  22. Future Work • Full program simulation • Integration with virtual memory • Parallel access techniques for other patterns • Indirect vectors • FFT • Impulse ASIC

  23. Summary • New and improved PVA Algorithm • Up to five times improvement over older method • Speedups in the range 1.0 to 32.8 • Technique for block interleaving • Scalable • Hardware prototype designed • Moderate hardware complexity

More Related