1 / 30

Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University

A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University. Beyond Homogeneous Parallelism. General-Purpose Cores (CPU). Programmable

rollo
Download Presentation

Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David BrooksHarvard University

  2. Beyond Homogeneous Parallelism General-Purpose Cores (CPU) Programmable Accelerators (DSP, GPU) Application-Specific Accelerator (ASIP, ASIC) Energy Efficiency Flexibility Programmability Design Cost

  3. Today’s SoC OMAP 4 SoC

  4. Today’s SoC ARM Cores Face Audio GPU DSP Imaging DSP Video DMA USB SD System Bus USB DMA Secondary Bus Secondary Bus Tertiary Bus OMAP 4 SoC

  5. Today’s SoC Apple A7 Harvard VLSI-ARCH Group SoCTapeout

  6. Today’s SoC GPU/DSP CPU CPU Mem Inter- face Buses Acc Acc Acc Acc Acc Acc Acc Acc Acc

  7. Future Accelerator-Centric Architectures GPU/DSP Big Cores Small Cores Memory Interface Shared Resources Sea of Fine-Grained Accelerators Flexibility Design Cost Programmability How to decompose an application to accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources?

  8. Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Aladdin Unmodified C-Code Power/Area Accelerator Specific Datapath Private L1/ Scratchpad Accelerator Design Parameters (e.g., # FU, mem. BW) Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems “Design Assistant” Understand Algorithmic-HW Design Space before RTL Flexibility Programmability Design Cost

  9. Future Accelerator-Centric Architecture GPU/DSP Big Cores Small Cores Memory Interface Shared Resources Sea of Fine-Grained Accelerators

  10. Future Accelerator-Centric Architecture GPU/DSP Big Cores Small Cores Memory Interface Shared Resources Sea of Fine-Grained Accelerators Aladdin can rapidly evaluate large design space of accelerator-centric architectures.

  11. Aladdin Overview Optimization Phase Optimistic IR Idealistic DDDG Initial DDDG Power/Area Models Dynamic Data Dependence Graph (DDDG) C Code Performance Activity Resource Constrained DDDG Program Constrained DDDG Acc Design Parameters Power/Area Realization Phase

  12. Aladdin Overview Optimization Phase Optimistic IR Idealistic DDDG Initial DDDG Power/Area Models C Code Performance Activity Resource Constrained DDDG Program Constrained DDDG Acc Design Parameters Power/Area Realization Phase

  13. From C to Design Space C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i];

  14. From C to Design SpaceIR Dynamic Trace 0. r0=0 //i = 0 r4=load (r0 + r1) //load a[i] r5=load (r0 + r2) //load b[i] r6=r4 + r5 store(r0 + r3, r6) //store c[i] r0=r0 + 1 //++i r4=load(r0 + r1) //load a[i] r5=load(r0 + r2) //load b[i] r6=r4 + r5 store(r0 + r3, r6) //store c[i] r0 = r0 + 1 //++i … C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i];

  15. From C to Design SpaceInitial DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 5. i++ 1. ld a 2. ld b C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 6. ld a 7. ld b 3. + 11. ld a 12. ld b 8. + 4. st c 13. + 9. st c 14. st c

  16. From C to Design SpaceIdealistic DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 0. i=0 5. i++ 10. i++ 6. ld a 7. ld b 2. ld b 1. ld a 11. ld a 12. ld b 5. i++ 2. ld b 1. ld a C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 6. ld a 7. ld b 3. + 3. + 8. + 13. + 11. ld a 12. ld b 8. + 4. st c 4. st c 14. st c 9. st c 13. + 9. st c 14. st c

  17. From C to Design SpaceOptimization Phase: C->IR->DDDG • Include application-specific customization strategies. • Node-Level: • Bit-width Analysis • Strength Reduction • Tree-height Reduction • Loop-Level: • Remove dependences between loop index variables • Memory Optimization: • Memory-to-Register Conversion • Store-Load Forwarding • Store Buffer • Extensible • e.g. Model CAM accelerator by matching nodes in DDDG

  18. From C to Design SpaceOne Design Resource Activity Idealistic DDDG 0. i=0 0. i=0 5.i++ 15. i++ 10. i++ MEM MEM MEM MEM 1. ld a 2. ld b MEM MEM 1. ld a 6. ld a 16. ld a 17. ld b 7. ld b 11. ld a 2. ld b 12. ld b + 3. + 18. + 13. + 8. + 3. + 4. st c 19. st c 14. st c 4. st c 9. st c + 5.i++ 6. ld a 7. ld b • Acc Design Parameters: • Memory BW <= 2 • 1Adder + 8. + 9. st c Cycle

  19. From C to Design SpaceAnother Design Resource Activity Idealistic DDDG + 0. i=0 5.i++ 15. i++ 0. i=0 10. i++ 5.i++ MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM 1. ld a MEM 6. ld a 16. ld a 1. ld a 17. ld b 6. ld a 7. ld b 11. ld a 2. ld b 12. ld b 7. ld b 2. ld b + + 18. + 13. + 8. + 3. + 3. + 8. + 19. st c 14. st c 4. st c 9. st c 4. st c 9. st c + + 15. i++ 10. i++ • Acc Design Parameters: • Memory BW <= 4 • 2Adders 16. ld a 17. ld b 11. ld a 12. ld b + + 18. + 13. + 19. st c 14. st c Cycle

  20. From C to Design SpaceRealization Phase: DDDG->Estimates • Constrain the DDDG with program and user-defined resource constraints • Program Constraints • Control Dependence • Memory Ambiguation • Resource Constraints • Loop-level Parallelism • Loop Pipelining • Memory Ports • # of FUs (e.g., adders, multipliers)

  21. From C to Design SpacePower-Performance per Design • Acc Design Parameters: • Memory BW <= 4 • 2Adders Power • Acc Design Parameters: • Memory BW <= 2 • 1Adder Cycle

  22. From C to Design SpaceDesign Space of an Algorithm Power Cycle

  23. Aladdin Validation Aladdin C Code Power/Area Performance Design Compiler Verilog Activity ModelSim

  24. Aladdin Validation Aladdin C Code Power/Area Performance Design Compiler RTL Designer Verilog Activity Vivado HLS HLS C Tuning ModelSim

  25. Aladdin Validation

  26. Aladdin Validation

  27. Aladdin enables rapid design space exploration for accelerators. 7 mins Aladdin C Code Power/Area Performance 52 hours Design Compiler RTL Designer Verilog Activity Vivado HLS HLS C Tuning ModelSim

  28. Aladdin enablespre-RTL simulation of accelerators with the rest of the SoC. GPU DRAMSim2 GPGPU-Sim Big Cores MARSx86 ... Small Cores XIOSim… Memory Interface Shared Resources Cacti/Orion2 Sea of Fine-Grained Accelerators

  29. Modeling Accelerators in a SoC-like Environment Core Acc Core Cache Memory Acc Core Cache Memory

  30. Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Architectures with 1000s of accelerators will be radically different; New design tools are needed. Aladdin enables rapid design space exploration of future accelerator-centric platforms. You can find Aladdin at http://vlsiarch.eecs.harvard.edu/aladdin

More Related