1 / 23

Introducing the ConnX D2 DSP Engine

Introducing the ConnX D2 DSP Engine. Introduced: August 24, 2009. Fastest Growing Processor / DSP IP Company. Customizable Dataplane Processor/DSP IP Licensing Leading provider of customizable Dataplane Processor Units (DPUs)

tyler
Download Presentation

Introducing the ConnX D2 DSP Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introducing theConnX D2 DSP Engine Introduced: August 24, 2009

  2. Fastest Growing Processor / DSP IP Company • Customizable Dataplane Processor/DSP IP Licensing • Leading provider of customizable Dataplane Processor Units (DPUs) • Unique combination of processor & DSP IP cores + software design tools • Customization enables improved power, cost, performance • Standard DPU solutions for audio, video/imaging & baseband comms • Dominant patent portfolio for configurable processor technology • Broad-Based Success • 150+ Licensees, including 5 of the top 10 semiconductor companies • Shipping in high volume today (>200M/yr rate) • Fastest growing Semiconductor Processor IP company (per Gartner, Jan-09) • 21% revenue growth in 2007, 25% in 2008 2

  3. Focus: Dataplane Processing Units (DPUs) DPUs: Customizable CPU+DSP delivering 10 to 100x higher performance than CPU or DSP and providing better flexibility & verification than RTL Embedded Controller For Dataplane Processing Main Applications CPU Tensilica focus: Dataplane Processors 3

  4. Communications DSP Trends / Challenges • Code Size Increases • Communications standards growing innumber & complexity • DSP algorithm code heavily integrated with more (and more complex) control code Maintenance and flexibility pushes DSP algorithms towards C-code • Development Teams Shrink • SOC development schedules tightening • Tightening resource constraints (do more with less) • Markets Changing Faster • Market requirements in flux as economy wobbles • Emerging standards evolve faster in the Internet age 4

  5. Trends Within Licensable DSP Architectures • 1st Generation Licensable DSP Cores • Modest/Medium performance (single/dual MAC) • Simple architecture (single issue, compound Instructions) • Limited or no compiler support (mostly hand coded) • 2nd Generation Licensable DSP Cores • Added RISC like architecture features (register arrays) • Improved compiler targets, but still assembly • Some offer wide VLIW for performance • Large area; code bloat • Some offer wide SIMD for performance • Good area/performance tradeoff • No performance when vectorization fails 5

  6. Vectorization Benefits (SIMD) After Vectorization Before Vectorization Data7 Data6 Data7 Data6 Data4 Data5 Data5 Data2 Data3 Data4 Data0 Data1 Data3 Data2 2-way SIMD Execution Data1 Data0 Single Execution • Loop counts can be reduced • Data computation can be done in parallel • Cheapest (hardware cost) method to get higher performance Example: 2-way SIMD performance benefit 6

  7. VLIW Technology Instruction #4 Instruction #3 Instruction #3 Instruction #4 Instruction #2 Instruction #1 Instruction #2 Instruction #1 VLIW Execution ALU2 VLIW Execution ALU1 Execution ALU Parallel execution of Instructions Effective use of multiple ALUs/MACs Compiler allocates instructions to VLIW slots Orthogonal allocation yields more flexibility

  8. Ideal 3rd Generation Licensable DSP • Ideal Characteristics • VLIW capability for good performance on general code • Parallelization of independent operations • SIMD capability for good performance on loop code • Data parallel execution • Good C compiler target • Reduce or eliminate need to assembly program • Productivity benefit • Small, compact size • Keep costs down in brutally competitive markets 8

  9. Tensilica - the Stealth DSP Company Comms Audio Video Xtensa: Other Markets Custom DSPs DSP Building Blocks ConnX BBE 16 MAC 8 MAC and more Xtensa TIE 388VDO ConnX 545CK DSP 8 MAC Double Precision Acceleration Floating Point HW ConnXVectra LX Single Precision Floating Point Unit Quad MAC DIV32 ConnX D2 HiFi 2 Dual MAC MUL32 MAC16 Single MAC 9

  10. ConnX D2 DSP Engine - Overview • Dual 16b MAC Architecture with Hybrid SIMD / VLIW • Optimum performance on a wide range of algorithms • SIMD offers high data computation rate for DSP algorithms • 2-way VLIW allows parallel instruction execution on SIMD and scalar code • “Out of the Box” industry standard software compatibility • TI C6x fixed-point C intrinsics supported • Fully bit for bit equivalent with TI C6x • ITU reference code fixed point C intrinsics directly supported • Goals: Ease of Use, Low Area/Cost • Click and go “Out of the Box” performance from standard C code • Standard C and fixed point data types - 16-bit, 32-bit and 40-bit • Advanced optimizing, vectorizing compiler • Less than 70K gates (under 0.2mm2 in 65nm) 10

  11. Target Applications: ConnX D2 General purpose 16-bit DSP for a wide range of applications • Embedded control • VoIP gateways, voice-over-networks (including VoIP codecs) • Femto-cell and pico-cell base stations • Next generation disk drives, data storage • Mobile terminals and handsets • Home entertainment devices • Computer peripherals, printers 11

  12. ConnX D2 DSP:An ingredient of an Xtensa DPU • Hardware Use Model • Click-button configuration option within Xtensa LX core • Part of the Tensilica configurable core deliverable package • Two reference configurations • Typical DSP solution for high performance • Small size for cost and power sensitive applications • Full tool support from Tensilica • High level simulators (SystemC), ISS and RTL • Debugger and Trace • Compiler, IDE and Operating Systems 12

  13. ConnX D2 Processor Block Diagram (Typical)

  14. ConnX D2 Engine Architecture AR Register Bank (32 bits) Local Memory and/or Cache Load Store Unit 32b 32-bits 32-bits 32b 32b XDU Alignment Registers (4 x 32 bits) XDD Register File (8 x 40-bits) 40-bit, 32-bit & 16-bit integer Overflow State 40-bit, 32-bit & 16-bit fixed Carry State 8-bit 8-bit 8-bit 8-bit 16-bit vector 16-bit vector 16-bit imaginary 16-bit real Hi / Lo 16-bit select 16-bits 16-bits 16-bits • Addressing Modes • Immediate • Immediate updating • Indexed • Indexed updating • Aligning updating • Circular (instruction) • Bit-reversed (instruction) • DSP specific instructions • Add-Bit-Reverse-Base and Add-Subtract : Useful for FFT implementation • Add-Compare-Exchange : Useful for Viterbi implementation • Add-Modulo : Circular buffer implementation. Useful for FIR implementation 16-bits 14

  15. ConnX D2 : Instruction Allocation Options 16-bit Instructions Base ISA 24-bit Instructions Base ISA or ConnX D2 VLIW Instructions (64-bits) Slot 0 ConnX D2 or Base ISA Slot 1 ConnX D2 or Base ISA (register moves & C ops on register data) • Flexible allocation of instructions available to compiler • Optimum use of VLIW slots (ConnX D2 or base ISA instructions) • Improved performance and no code bloat (reduced NOPs) • Reduce code size when algorithm is less performance intensive • Modeless switching between instruction formats 15

  16. ConnX D2 : SIMD with VLIW – Extra Performance Combining SIMD and VLIW can give 6 times performance Example : Energy Calculation SIMD Computation 127 A = ∑Xn* Xn n=0 128 iteration C algorithm Instruction Execution (Control) loopgtz a3,.LBB52_energy        # [3]        l16si   a3,a2,2                     # [0*II+0]  id:16 a+0x0        l16si   a5,a2,4                     # [0*II+1]  id:16 a+0x0        l16si   a6,a2,6                     # [0*II+2]  id:16 a+0x0        l16si   a7,a2,8                     # [0*II+3]  id:16 a+0x0        mul16s  a3,a3,a3                # [0*II+4]        mul16s  a5,a5,a5                # [0*II+5]        mul16s  a6,a6,a6                # [0*II+6]        mul16s  a7,a7,a7                # [0*II+7] addi.n  a2,a2,8                 # [0*II+8] add.n   a3,a4,a3                  # [0*II+9] add.n   a3,a3,a5                  # [0*II+10] add.n   a3,a3,a6                  # [0*II+11] add.n   a4,a3,a7                  # [0*II+12] Slot1 Slot0 • Vectorization and SIMD gives double data computation performance • VLIW gives 2 pipeline executions (one is SIMD) with auto-increment loads • ConnX D2 architecture gives this combination and performance 416 cycles Base Xtensa Configuration ConnX D2: 64 cycles loop {      # format XD2_FLIX_FORMAT        xd2_la.d16x2s.iu   xdd0,xdu0,a4,4;  xd2_mulaa40.d16s.ll.hh  xdd1,xdd0,xdd0 } One instruction (64-bit VLIW instruction) 16

  17. When Vectorization is Not PossiblePerformance for scalar code bases int energy(short *a, intcol, int cols, int rows) { inti; int sum=0; for (i=0; i<rows; i++) { sum += a[cols*i+col] * a[cols*i+col]; } return sum; } • Energy computation of column ‘col’ in 2-D array • Above code loop cannot be vectorized • Non–contiguous memory accesses thwarts vectorizers • Regular compilers can not map this code into traditional SIMD DSPs 17

  18. When Vectorization is Not PossiblePerformance for scalar code bases int energy(short *a, intcol, int cols, int rows) { inti; int sum=0; for (i=0; i<rows; i++) { sum += a[cols*i+col] * a[cols*i+col]; } return sum; } • Confirmed that ConnX D2 and TI C6x compilers can not vectorize this code • ConnX D2 compiler can however use VLIW to increase performance C-Code entry a1,32 blti a5,1,.Lt_0_2306  addx2 a2,a3,a2 slli a3,a4,1 addi.n a4,a5,-1 sub a2,a2,a3 { # format XD2_FLIX_FORMAT xd2_l.d16s.xu xdd0,a2,a3 xd2_movi.d40 xdd1,0 } loopgtz a4,.LBB43_energy  { # format XD2_FLIX_FORMAT xd2_l.d16s.xu xdd0,a2,a3 ; xd2_mula32.d16s.ll_s1 xdd1,xdd0,xdd0 } ………… Generated Assembly Code ConnX D2 : One cycle within loop Load scalar 16-bits xdd0 is loaded with memory contents defined in a2 register. a2 register value is updated by value in a3 MAC operation on lower 16-bits. Multiplies xdd0 with xdd0. Accumulated result is stored in xdd1 18

  19. Optimization with ITU / TI IntrinsicsPerformance for generic code bases Energy calculation loop 1000 looping, using L_mac ITU intrinsic #define ASIZE 1000 extern int a[ASIZE]; extern int red; void energy() { inti; int red_0 = red; for (i = 0; i < ASIZE; i++) { red_0 = L_mac(red_0, a[i], a[i]); } red = red_0; } L_mac maps to one ConnX D2 instruction Compiler further optimizes by using SIMD to accelerate loop VLIW allows further accelerates with parallel loads 1000 loop C algorithm optimized to 500 cycles loop Sustained 3 operations / cycle entry a1,32 l32r a2,.LC1_40_18 l32r a5,.LC0_40_17 xd2_l.d16x2s.iu xdd0,a2,4 test_arr_1+0x0 l32i.n a3,a5,0 test_global_red_0+0x0 { # format XD2_ARUSEDEF_FORMAT xd2_mov.d32.a32s xdd1,a3 movi a3,499 } loopgtz a3, { # format XD2_FLIX_FORMAT xd2_l.d16x2s.iu xdd0,a2,4; xd2_mulaa.fs32.d16s.ll.hh xdd1,xdd0,xdd0 } .......... Generated Assembly Code 19

  20. “Out of the Box” Performance - Results • Comparison to TI C55x • (TI C55x is an industry benchmark Dual-MAC, 2-way VLIW) • 20% more performance (256 point complex FFT) • Comparison to other DSP IP vendors • Almost twice the performance Why better? • FFT specific instructions • Dual write to Register Files • Advanced Complier • SIMD and VLIW performance Why better? • 1 to 1 mapping of ITU intrinsics • SIMD and VLIW performance • Flexibility in VLIW allocation • VLIW Performance for scalar code * - 2008, From CEVA published Whitepaper # - Dec 2008, www.ti.com 20

  21. Small, Low Power, & High Performance  • Optimized for low area / low cost applications • Less than 70,000 gates • 0.18mm2 in 65nm GP * Low power • 52uW/MHz power consumption • 65nm GP, measured running AMR-NB algorithm Very high performance • 600MHz in 65nm GP **   * - After full Place and Route, when optimized for area/power. Size is for the full Xtensa core including the D2 DSP option ** - After full Place and Route, when optimized for speed 21

  22. Flexible and Customizable • Configure memory subsystems to exact requirements • Up to 4 local memories • Instruction memory, data memory • RAM and ROM options • DMA path into these memories • Instruction and data cache configurations • MMU and memory region protection • Memory port interface • Option of dual load/store architecture Full customization • Instruction set extensions • Custom I/O Interfaces • TIE Ports, Queues and Lookup Memory interfaces 22

  23. ConnX D2 DSP Engine: Summary • Small size • Low power • Excellent performance on wide range of code • Easy to use – C programming centric • “Out of the Box” performance • Reduce development time – reduced cost • ITU and T.I. C intrinsic support – large existing code base • Bit equivalent to TI C6x • Take current TI code, port and get same functionality on ConnX D2 • Flexible & customizable 23

More Related