1 / 26

The CA1024: A Massively Parallel Processor for Cost-Effective HDTV

Learn about the CA1024, a programmable and scalable processor designed for HDTV video encoding, decoding and transcoding. With integral parallel machine technology, it offers a cost-effective solution for computational-intensive tasks.

Download Presentation

The CA1024: A Massively Parallel Processor for Cost-Effective HDTV

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The CA1024: A Massively Parallel Processor for Cost-Effective HDTV Connex Technology Proprietary and Confidential

  2. Company Background • Fabless semiconductor company in Silicon Valley • VC funded (series A & B) • In the product-development stage with 26+ employees • Deep experience with video algorithms, processor design, and digital-video system software • Core asset: ConnexArrayTM vector-processor architecture • Architecture verified in CA4096 test chip • Six patent applications on Connex vector-processor technology • 1 US patent granted, 3 US patents pending, 2 US provisional • Granted and pending patents also filed in China, Taiwan, Korea, EEC, Japan, Singapore • Initial market focus on DTV Connex Technology Proprietary and Confidential

  3. Presentation Agenda • Why a massively parallel processor (MPP)? • How is MPP integrated in an SoC? • Processor performance • Project status Connex Technology Proprietary and Confidential

  4. Challenges • HDTV codec & post-processing are computationally intensive • Computation is dominated by data-parallel processes • HDTV is a fast-evolving domain • ASICs are a very costly solution Connex Technology Proprietary and Confidential

  5. Our Solution:Integral Parallel Machine • Data-parallel computation • Time-parallel computation (supported by speculative parallelism) • I/O process is transparent to the computational process Connex Technology Proprietary and Confidential

  6. Key Technology • Fully programmable solution for HDTV video encoding, decoding, and transcoding at the system and algorithm levels • Simple programming model • Silicon-efficient architecture; die size competitive with similar function ASICs • Re-use of transistors • Minimal dedicated hard-wired blocks • Sufficient performance to enable multistandard, multichannel, high-definition DTV • Linearly scalable Connex Technology Proprietary and Confidential

  7. Index Select The Connex Architecture 255 254 Sequencer 16-bit RAM CA1024-PVP: m = n = 32 for a 1,024-PE Connex Machine Test Chip: m = n = 64 for a 4,096-PE Connex Array; sequencer and I/O control in an FPGA I/O Controller Connex Array 0 1 1 Address 0 R7 R6 R5 R4 R3 AUX R2 I/O R1 Connex R0 n m 0 1 2 3.2 GByte/sec I/O channelin parallel with code running on the Connex Array 16 bit ALU Connex Technology Proprietary and Confidential

  8. Connex Cell Architecture 255 254 • PE (Processing Element) has eight accumulator registers, including Connex, Aux, and I/O special-function registers • Select flag enables or disables instruction processing • Index is a unique cell number used to direct certain instructions • Bidirectional 16-bit bus to 256 RAM locations • Connex register includes connections for shifts to/from adjacent PE • Aux and I/O registers dedicated to specific instruction functions RAM 1 Address 0 R7 R6 R5 R4 R3 AUX R2 I/O R1 Connex R0 Select Index 16 bit ALU Connex Technology Proprietary and Confidential

  9. 255 255 254 254 1 1 0 0 R7 R7 R6 R6 R5 R5 R4 R4 R3 R3 R2 R2 R1 R1 R0 R0 ConnexArray Structure • Replicated Connex cells each include PE and local RAM • Linear interconnect of neighbor registers • Conditional execution based on state of select bit or index value • All selected cells execute the same instruction stream 255 254 1 0 R7 R6 R5 R4 R3 R2 R1 R0 On Off On 1023 0 1 16 bit ALU 16 bit ALU 16 bit ALU Connex Technology Proprietary and Confidential

  10. Connex Data-Array Structure 0 Element n 1023 0 16-bit data operands Line m 255 256 lines with 1024 16-bit elements per line 1GByte data I/O in parallel with computation operations Connex Technology Proprietary and Confidential

  11. Full Line Operations:Operate On All Elements in Parallel 0 1023 0 Line i +, -, *, XOR, etc. Line j = Line k 255 Line k = Line i OP Line j Line k = Line i OPscalar value (repeated for all elements) Connex Technology Proprietary and Confidential

  12. Columns Active Based On Repeating Patterns 0 1023 0 Line i +, -, *, XOR, etc. Line j = Line k 255 Example: Mark all odd columns active. Or mark every third column active. Or mark every third and fourth column active, etc. Connex Technology Proprietary and Confidential

  13. Columns Active Based On Results of Previous Operations 0 1023 0 Line i +, -, *, XOR, etc. Line j = Line k 255 Example: Apparently random columns are active, marked, based on Data-dependent results of previous operations. This enables selective processing based on data content. Connex Technology Proprietary and Confidential

  14. Outer-Loop Parallelism:Program in context of 128+ data-structure instancesExample: 8x8 DCT 0 7 1023 0 8x8 8x8 8x8 8x8 …….. 7 Line i Line j 255 Example: 128 sets of 8x8 run in parallel in a 1024-cell array Connex Technology Proprietary and Confidential

  15. I/O System Switch Fabric Connex Array IS I/O Plane IOC Interrupts DRAM DDR-DRAM Controller DRAM DRAM DRAM Connex Technology Proprietary and Confidential

  16. Computational-IntensiveArchitecture • All forms of parallelism are strongly segregated • Connex Array for data-parallel computation • Speculative Array for time-parallel computation • The granularity perfectly fits the application domain • 16-bit processing elements • no MACs, no FPUs, no multipliers… Connex Technology Proprietary and Confidential

  17. High I/O Bandwidth • External I/O: 3.2 GBytes/sec • Serial access and random access with similar performance • Internal I/O: 400 GBytes/sec Connex Technology Proprietary and Confidential

  18. Area & Power Efficiency • 2 GOPS/mm2 (peak performance) • GOPS/Watt is 25–50 times greater than a mature sequential technology Connex Technology Proprietary and Confidential

  19. CPL (Connex Programming Language) is anextension of C with C/C++ syntax Code that operates on scalar data is written inregular Cnotation Connex-specific operatorsdefined for features not available in C, e.g. operations on vectors, selections CPL usessequential operators and control structures on vector and select datatypes Using CPL, the Connex Machine is programmed the same way asconventional sequential machines Hides the complexities of the parallel execution hardware Complete SDK { ... const short OFFSET = 15; ... short vector x, y; short vector min, max; ... sel = all; x += OFFSET; ... min = (x < y)? x : y; max = (x > y)? x : y; ... } Vectors are arrays of scalar components. Selections are arrays of Boolean values that dictate which vector components are active. Programming Connex Connex Technology Proprietary and Confidential

  20. Performance • DCT: 0.35 clock cycle per pixel • SAD: 0.0025 clock cycle per pixel Connex Technology Proprietary and Confidential

  21. H.264 Dual HD Stream Decoding Allowed clock cycles per macroblock (2-channel 1080i): 409 cycles Connex Technology Proprietary and Confidential

  22. H.264 CABAC (SA) Decoding • Targeted profile and level: 4.1 Main Profile • Bit-rate/stream considered: 35Mbps (45Mbps maximum) • Number of bins to decode using CABAC : 47M/sec • Number of clock cycles per bin: 1 cycle • Cycles to decode bins/stream: 50MHz • Typical bit-rate expected for DVB: 10Mbps • Cycles to decode bins for typical stream (DVB): 15MHz Connex Technology Proprietary and Confidential

  23. DDR-DRAM Ctrl (400 MHz Data Rate) GPIO I2C JTAG Video Out Video In Video Out Video In Audio Out Audio In Audio Out Audio In HOST I/F Ext. Bus 64-bit Wide DRAM Test ICE Switch Fabric BT.656/1120 BT.656/1120 ConnexArray™ Programmable Media Processor Multi-Codec Processing Pre-Analysis 3D Filter Scaling Graphics Processing Video Merge/Blend Motion Adaptive De-interlacing BT.656/1120 BT.656/1120 I/O Controller 5x-I2S 2x-I2S or S/PDIF Switch Fabric Switch Fabric S/PDIF 2x-I2S or S/PDIF 1xI2S Instruction Sequencer PCI v2.2 or Generic Flash Switch Fabric Host CPU CA1024 SA Audio CPU TS/Sec CPU Video CPU Connex Technology Proprietary and Confidential

  24. MIPS MIPS MIPS PCI DDR ACF MIPS SA CA256 CA256 CA256 CA256 CWOA CA1024 Project Status • TSMC 0.13 micron • 676-pin PBGA • Samples Q3 2006 • sales@connextechnology.com Connex Technology Proprietary and Confidential

  25. In Summary….. • Fully programmable processor • Computational-intensive architecture • High-bandwidth I/O • Connex Programming Language & SDK • Die-area and power-efficient architecture Connex Technology Proprietary and Confidential

  26. Thank You ! Connex Technology Proprietary and Confidential

More Related