1 / 24

Embedded DRAM for a Reconfigurable Array

This paper discusses the use of on-chip DRAM in a reconfigurable architecture, including the Configurable Memory Block (CMB) and its evaluation. It also examines the challenges and advantages of on-chip DRAM compared to SRAM.

dori
Download Presentation

Embedded DRAM for a Reconfigurable Array

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Embedded DRAMfor a Reconfigurable Array S.Perissakis, Y.Joo1, J.Ahn1, A.DeHon, J.Wawrzynek University of California, Berkeley 1LG Semicon Co., Ltd

  2. Outline • Reconfigurable architecture overview • Motivation for on-chip DRAM • Configurable Memory Block (CMB) • Evaluation • Conclusion

  3. CPU Long Term Architecture Goal • On-chip CPU • LUT-based compute pages • DRAM memory pages • Fat pyramid networkfat tree + shortcuts

  4. CPU Long Term Architecture Goal • On-chip CPU • LUT-based compute pages • DRAM memory pages • Fat pyramid networkfat tree + shortcuts

  5. CPU Long Term Architecture Goal • On-chip CPU • LUT-based compute pages • DRAM memory pages • Fat pyramid networkfat tree + shortcuts

  6. CPU Long Term Architecture Goal • On-chip CPU • LUT-based compute pages • DRAM memory pages • Fat pyramid networkfat tree + shortcuts

  7. Long Term Architecture Goal CPU CPU Reconfigure K e r n e l 1 K e r n e l 2 ( p r o d u c e r ) ( c o n s u m e r )

  8. Motivation • Stream buffersReduce reconfiguration frequency • Configuration memorySpeed up reconfiguration • Application memorySpeed up individual kernels Need large on-chip memory for:

  9. Challenges DRAM offers increased density (10X to 20X that of SRAM), but: • Harder to use • Row/Col accesses & variable latency • Refresh • Lower performance • Increased access latency Q: Is it worth the trouble ?

  10. CPU Trumpet test chip • One compute page • One memory page • Corresponding fraction of network Trumpet

  11. CMB Functions • Configuration source • State source/sink • Data store • Input/output

  12. CMB Overview Ctl[1:0] Cmd CMB Controller Addr[9:0] From host Ctl[1:0] Addr[17:0] DRAM Macro Tree[159:0] From compute DQ[127:0] page Short[159:0] [127:0] [63:0] Rate Address & Stall Retiming Matching Data Xbars Buffers Registers

  13. DRAM Macro • 0.25µm, 4 metal eDRAM process • 1 to 8 Mbits (2 Mbits in test chip) • 128-bit wide SDRAM interface • Up to 125 MHz clock  2 GB/s peak B/W • 36ns/12ns row/col latencies • Row buffers to hide precharge & refresh Designed by LG Semicon

  14. SRAM Abstraction • SRAM-like interfaceReq, R/W, Address, Data • Row buffers  simple direct-mapped cache • 6-cycle minimum latency, pipelined • Misses handled by logic stalls • 10-cycle miss latency “hidden” from logic

  15. Stalls • Stall sources: • Row buffer miss (10 cycles) • Write after read (4 cycles) • DRAM/logic clock alignment (1 cycle) • Refresh (Halt from host) • Multicycle stall distribution

  16. DRAM macro Output CMB Input Stall Buf logic Stall Buf User logic Stall Buffers • Memory page is never stalled • Must buffer read data during stall • Must buffer requests during stall distribution

  17. Trumpet Test Chip • 0.25 DRAM, 0.4 logic • 2 Mbits + 64 LUTs • 125 MHz operation • 1 GB/sec peak bandwidth • 10 sec reconfiguration • 10 x 5 mm2 die • 1 W @ 125 MHz CMB Compute Page

  18. DRAM core Fuse Datapath SDRAM i/f DRAM Macro controller Datapath Controller CMB Logic clock CMB Area Breakdown • 13.95 mm2 total • 2 Mbits capacity 147 Kbits/mm2 average densityCompare to 700-900 Kbits/mm2 commodity DRAM

  19. Using a Custom Macro • Existing: • 13.95 mm2 • 147 Kbits/mm2 • Custom: • 9.4 mm2 • 218 Kbits/mm2

  20. Comparison to SRAM CMB • DRAM (custom macro) 218 Kb/mm2 • SRAM (equal area)  25 Kb/mm2 With typical SRAM core densities and:  No stall buffers  Simplified controller  Close to 1 order of magnitude density advantage for DRAM

  21. Performance • Configuration / state swap: peak 1 GB/s • User accesses: dependent on access patterns • Peak if high locality • Near peak for sequential patterns (62-93%) • Column latency exposed when dependencies exist, or on mixed R/W • Row latency exposed on random accesses

  22. Row Column Performance (example) 8 Input image 8 Scanline order Row: ~ 4 misses / DCT block 8x8 DCT block 1 Kbit = 1 DRAM row Col: 2 misses / DCT block  73% efficiency

  23. Refresh Overhead • 8 to 16 ms retention time expected • 2.5% to 5.0% bandwidth loss • Can reduce by refreshing only active part of memory • May skip refresh for short-lived data

  24. Conclusion • Q: Is on-chip DRAM advantageous to SRAM ? • Our experience so far: • User-friendly abstraction possible • Can maintain density advantage • Effect on application performance: • Large buffer space  less frequent reconfiguration • High bandwidth  faster reconfiguration • Effect on individual kernels often limited by DRAM core latency

More Related