1 / 32

SoC Subsystem A cceleration using Application-Specific Processors (ASIPs)

SoC Subsystem A cceleration using Application-Specific Processors (ASIPs). Markus Willems Product Manager Synopsys. SoC Design. What to do when the performance of your main processor is insufficient? Go multicore? Application mapping difficult, resource utilisation unbalanced

arich
Download Presentation

SoC Subsystem A cceleration using Application-Specific Processors (ASIPs)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SoC Subsystem Acceleration using Application-Specific Processors (ASIPs) Markus Willems Product Manager Synopsys

  2. SoC Design • What to do when the performance of your main processor is insufficient? • Go multicore? • Application mapping difficult, resource utilisation unbalanced • Add hardwired accelerators? • Balanced but inflexible

  3. SoC Design • What to do when the performance of your main processor is insufficient? ASIPs: application-specific processors • Anything between general-purpose P and hardwired data-path • Deploys classic hardware tricks (parallelism and customized datapaths) while retaining programmability – Hardware efficiency with software programmability

  4. Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions

  5. Architectural Optimization Space ASIP architectural optimization space Parallelism Speciali-zation

  6. Architectural Optimization Space Parallelism Instruction-level parallelism (ILP) Data-level parallelism Task-level parallelism Orthogonalinstruction set (VLIW) Encoded instruction set Vector processing (SIMD) Multicore Multi-threading

  7. Architectural Optimization Space Specialization App.-specificdata types App.-specificinstructions Pipeline Connectivity & storage matching application’s data-flow Integer, fractional, floating-point, bits, complex, vector… Distributed regs, sub-ranges Multiple mem’s, sub-ranges App.-spec. memory addressing App.-spec. data processing App.-spec. control processing Direct, indirect, post-modification, indexed, stack indirect… Any exoticoperator Jumps, subroutines, interrupts, HW do-loops, residual control, predication… Single or multi-cycle Relative or absolute, address range, delay slots…

  8. IP Designer: ASIP Design and Programming

  9. Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions

  10. Synopsys - Full Spectrum Processor Technology Provider

  11. 32-bitARC HS ProcessorsHigh-Performance for Embedded Applications • Over 3100 DMIPS @ 1.6 GHz* • 53 mW* of power; 0.12mm2 area in 28-nm process* • HS Family products • HS34 CCM, HS36 CCM plus I&D cache • HS234, HS236 dual-core • HS434, HS436 quad-core • Configurable so each instance can be optimized for performance and power • Custom instructions enable integration of proprietary hardware ARC Floating Point Unit JTAG User Defined Extensions ARCv2 ISA / DSP Real-Time Trace 10-stage pipeline MAC & SIMD Multi-plier ALU Divider Late ALU Memory Protection Unit Instruction CCM Data Cache Data CCM Instruction Cache *Worst case 28-nm silicon and conditions Optional

  12. Pedestrian Detection and HOG • Pedestrian detection • Standard feature in luxury vehicles • Moving to mid-size and compact vehicles in the next 5-10 years, also due to legislation efforts • Implementation requirements • Low cost • Low power (small form factor, and/or battery powered) • Programmable (to allow for in-field SW upgrades) • Most popular algorithm for pedestrian detection is Histogram of Oriented Gradients (HOG)

  13. Histogram Of Oriented Gradients Scale to Multiple Resolutions Use a fixed 64x128-pixel detection window. Apply this detection window to scaled frames. Gradient Computation Apply Sobel operators:and

  14. Histogram Of Oriented Gradients Histogram Computation The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply Gaussian weights and compute 4 histograms of orientation of gradients. Normalization of the Histograms (1) L2 Normalization (2) clipping (saturation) (3) L2 Normalization Support Vector Machine Linear classification of histogramsfor every 64x128 windows position. Non-Max Suppression Cluster multi-scale dense scan of detection windows and select unique

  15. HOG Functional Validation on ARC HS(640 x 480 pixels) 1 • OpenCV float profiling results: 2.6 G cycles per frame Fixed point profiling results: 2.4 G cycles per frame Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) … D D ASIP1 ASIP2 ASIPn AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl

  16. Profiling (640 x 480 pixels, at 30 FPS)

  17. Task Assignment #2 2 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D ASIP1 ASIP2 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM

  18. ASIP Example: HISTOGRAM • Vector-slot next to existing scalar instructions (VLIW) • 16x(8/16)-bit vector register files • 16x8-bit SRAM interface • 16x8-bit FIFO interfaces • Vector arithmetic instructions • Special registers and instructions to compute histograms 4x size increase & 200x speedup (relative to RISC template) Implemented in less than 1 week

  19. Task Assignment #3 3 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1 ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM

  20. Task Assignment #4 4 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1’ ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM

  21. Task Assignment #4 4’ Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1’ ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS L2 SRAM DCCM L3 Ext. DRAM

  22. Comparison 1 2 3 4

  23. Final Results • 1 ARC HS, 4ASIPs, AXI interconnect, private SRAM, L2 SRAM • 30 frames/second at 500 MHz • Functionally identical to OpenCV reference • TSMC 28nm • ASIP gate count: 330k gates • ASIP power consumption: ~130mW • Scaling due to multi-core, specialization and SIMD usage • Power/performance/area via ASIPs • Scaling due to multi-core, specialization and SIMD usage • Performance gains and power efficiency due to tailored instruction sets and dedicated memory architecture

  24. Scenario: Need for Flexible FEC Core • Existing and emerging standards use advanced FEC schemes like turbo coding, LDPC and Viterbi • Instead of duplication of FEC cores, need for re-configurable architecture at minimum power and area DVB-X? LDPC-A .11n LDPC-C .11n Vit FlexFEC (turbo/LDPC/Vit) .16e LDPC-D 3GPP-LTEturbo-A UMTS Turbo-B

  25. Architecture Refinement to Increase Throughput: Increased ILP from 2 to 6 ILP: 2 FU (scalar+vector unit) ILP: 6 FU (1 scalar+5 vector units) No duplication for arithmetic functionality For exploiting ILP to increase throughput 2 FUs for local memory access

  26. Fast Area/Performance Trade-off(40nm logical synthesis Processor only) 0.189 sqmm 0.177 sqmm

  27. Architectural ExplorationFU Utilization: 2  5 Vector slot separated in different FUs without overlapping functionality Local memory access congestion

  28. Architectural ExplorationMore Balanced FU Utilization: 5  6

  29. Highly Efficient C-compilationVast Majority of 6 FU Used

  30. Blox-LDPC ASIP Latest IP Available from IMEC Instances available ad

  31. Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions

  32. Conclusion • ASIPs enable programmable accelerators • IP Designer enables efficient design and programming of ASIPs • “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators • ASIPsenable balanced multicore SoCarchitectures

More Related