1 / 92

Computing Platforms for Multimedia

Computing Platforms for Multimedia. Marilyn Wolf Dept. of EE Princeton University wolf@princeton.edu. Topics. Pop quiz. Embedded computing. DSP architectures. Multiprocessor systems-on-chips. Hardware/software co-design. Power and energy consumption.

katelin
Download Presentation

Computing Platforms for Multimedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing Platforms for Multimedia Marilyn Wolf Dept. of EE Princeton University wolf@princeton.edu © 2006 Marilyn Wolf

  2. Topics • Pop quiz. • Embedded computing. • DSP architectures. • Multiprocessor systems-on-chips. • Hardware/software co-design. • Power and energy consumption. • Performance analysis and simulation.

  3. Software performance pop quiz • Which of these operations is faster? +, * • How fast is this code? for (i=0; i<N; i++) c[i] = a[i] * b[i]; • How long does it take to execute this line of code? p1 = calloc(1,sizeof(my_struct));

  4. How long does each operation take? Register file shifter * ALU

  5. a[0] b[0] a[0] c[0] c[1] a[1] b[1] a[1] a[2] a[2] b[2] c[2] a[3] b[3] c[3] a[3] b[0] b[1] b[2] b[3] c[0] c[1] c[2] c[3] How many clock cycles does it take to execute this code? cache

  6. malloc() • Arbitrary-sized block management. • Dynamic coalescing. memory Block 2 Block 1

  7. Approximate market segments

  8. Consumer electronics prices Best Buy November 2003:

  9. Characteristics of embedded systems • Very high performance. • Vision + compression + speech + networking all on the same platform. • Multiple task, heterogeneous. • Real-time. • Often low power. • Highly reliable. • I reboot my piano every 4 months, my PC every day.

  10. Mudge et al: Mobile supercomputing • Future mobile platform: • Speech recognition. • Cryptography. • Augmented reality. • Typical applications (email, etc.). • Requires 16x 2 GHz Pentium 4. • Peak power must not exceed 75 mW. • Assumes 5% battery improvement per year.

  11. Mudge et al: Performance trends for desktop processors © 2004 IEEE Computer Society

  12. Mudge et al: Power trends for desktop processors © 2004 IEEE Computer Society

  13. Why multiple platforms? • People still care about cost. • People care about power consumption. • Sufficiently general solutions don’t fit on one chip.

  14. CPU vs. DSP • AT&T DSP-16: • On-board multiplier/accumulator. • Harvard architecture. • Today, DSP is largely a marketing term.

  15. Performance: Pipeline. Specialized instructions. Cache. Power consumption: Clock gating. Low power modes. Performance and power features

  16. C5x family • Fixed-point DSP. • Modified Harvard architecture: • 1 program memory bus. • 3 data memory busses. • 40-bit ALU. • Multiple implementations: • 1, 2 instructions/cycle.

  17. C5409: 100 MIPS @ 100 MHz. Address space sizes: Data 64K words. Program 8M words. On-board memory: RAM 16K words. ROM 32K words. 3 buffered serial ports. 1 timer. C5510: 400 MIPS @ 200 MHz. Total address space: 8 M Words. On-board memory: RAM 160K words. ROM 16K words. 3 McBSP serial ports. 2 timers. Sample configurations

  18. B bus C, D busses D bus Instruction fetch Writes Dual operand read Single operand read Data read from memory Dual-multiply coefficient C55x organization 3 data read busses 16 3 data read address busses 24 program address bus 24 program read bus Instruction unit Program flow unit Address unit Data unit 32 2 data write busses 16 2 data write address busses 24

  19. Busses and accesses

  20. Busses and accesses, cont’d

  21. Image/video hardware extensions • Available in 5509 and 5510. • Equivalent C-callable functions for other devices. • Available extensions: • DCT/IDCT. • Pixel interpolation • Motion estimation.

  22. C55 DCT/IDCT coprocessor extensions • Load, compute, transfer to accumulators: • ACy=copr(k8,ACx,Xmem,Ymem) • Compute, transfer, mem write: • ACy=copr(k8,ACx,ACy), Lmem=ACz • Special: • ACy=copr(k8,ACx,ACy)

  23. Iteration i Iteration i+1 op_i(0), load_i+1(0,1) Dual_load Dual_load op_i(1), store_i-1(0,1) 4 empty 4 empty op_i(2), store_i-1(2,3) op_i(2), store_i-1(4,5) 3 Dual_load 3 Dual_load op_i(2), store_i-1(6,7) op_i(2), load_i+1(2,3) … 8 compute 8 compute empty empty 4 Long_store 4 Long_store Software pipelined load/compute/store for DCT Iteration i-1 Dual_load 4 empty 3 Dual_load 8 compute empty 4 Long_store

  24. Executes several instructions per cycle. Statically scheduled. Instr Instr Instr Instr VLIW processors Register file Instruction register

  25. TI C62/C67 • Up to 8 instructions/cycle. • 32 32-bit registers. • Function units: • Two multipliers. • Six ALUs. • All instructions execute conditionally.

  26. C6x block diagram Data RAM 512K bits Program RAM/cache 512K bits JTAG bus timers Execute DMA Serial Data path 2/ Reg file 2 Data path 1/ Reg file 1 PLL

  27. C6x data paths • General-purpose register files (A and B, 16 words each). • Eight function units: • .L1, .L2, .S1, .S2, .M1, .M2, .D1, .D2 • Two load units (LD1, LD2). • Two store units (ST1, ST2). • Two register file cross paths (1X and 2X). • Two data address paths (DA1 and DA2).

  28. C6x function units • .L • 32/40-bit arithmetic. • Leftmost 1 counting. • Logical ops. • .S • 32-bit arithmetic. • 32/40-bit shift and 32-bit field. • Branches. • Constants. • .M • 16 x 16 multiply. • .D • 32-bit add, subtract, circular address. • Load, store with 5/15-bit constant offset.

  29. Configurable vs. reconfigurable • Configurable: • CPU architectural features are selected at design time. • Reconfigurable: • Hardware can be reconfigured in the field. • May be dynamically reconfigured during execution.

  30. Tensilica configurable processors • Configurability: • Processor parameters (cache size, etc.) • Instructions. • Result: • HDL model for processor. • Software development environment.

  31. Xtensa configurability • Instruction set: • ALU extensions, coprocessors, wide instructions, DSP-style, function unit implementation. • Memory: • I cache config, D cache config, memory protection/translation, address space size, mapping of special-purpose memories, DMA access. • Interface: • Bus width, protocol, system register access, JTAG, queue interfaces to other processors. • Peripherals: • Timers, interrupts, exceptions, remote debug.

  32. TIE extensions • TIE language used to define instruction set defintions. • State declarations. • Instruction encodings and formats. • Operation descriptions.

  33. TIE example (Rowen) Regfile LR 16 128 l Operation add128 {out LR sr, in LR ss, in LR st} { assign sr = st + ss;} Register file 16 x 128 wide Operation name Declarations Operations

  34. Using instructions in C main() { int i; LR src1[256], src2[256], src3[256]; for (i=0; i<256; i++) dest[i] = add128(src1[i],src2[i]);

  35. Performance improvement • Compare Xtensa optimized vs. Xtensa out-of-the-box: • Compare performance/MHz. • EEMBC ConsumerMark: • Xtensa optimized: 2.02. • Xtensa out-of-the-box: 0.66. • EEMBC TeleMark: • Xtensa optimized: 0.47. • Xtensa out-of-the-box: 0.23. • EEMBC NetMarks: • Xtensa optimized: 0.123. • Xtensa out-of-the-box: 0.03.

  36. Philips Viper set-top-box platform Off-chip SDRAM Trimedia MIPS MMI bus Bus ctrl PCI MBS Bus ctrl MC bridge TC bridge Clocks, DMA, Reset, debug SPDIF 2D AICP GPIO I2C, Smcard USB, 1394 MPEG C bridge

  37. TI OMAP OMAP 5910: • Targets communications, multimedia. • Multiprocessor with DSP, RISC. C55x DSP MPU interface bridge MMU I/O System DMA control Memory ctrl ARM9

  38. OMAP HW/SW architecture applications Symbian Linux Palm WinCE DSP OS app-specific protocol DSP gateway Real-time tasks DSP/BIOS bridge SW DSP manager DSP manager server Link Driver Link Driver Hardware arbitration layer Hardware arbitration layer HW ARM9 C55x

  39. C55x vs. ARM9E processing time • From Pace Soft Silicon/TI (Mcycles/sec). • Faster processing time means longer shutdown for battery savings.

  40. OMAP5912 block diagram DSP MMU TMS320C55x DSP timer, Intr handler DSP Public Devices Flash. SRAM Memory Interface Traffic Controller MPU Peripheral Bridge Shared I/O Devices System DMA SDRAM Frame buffer ARM926EJS ARM Private Devices ARM Public Devices MPU IF LCD IF

  41. OMAP software platform MM services, plug-ins, protocols Multimedia APIs MM OS server High- Level OS App- specific DSP RTOS Gateway components DSP SW components DSP Bridge API DDAPI DDAPI Device Drivers DSP/BIOS Bridge Device Drivers CSLAPI ARM CSL (OS-independent) DSP CSL (OS-independent)

  42. DSPBridge • Abstracts the DSP software architecture for the general-purpose software environment. • APIs include driver interfaces and application interfaces: • Initiate and control DSP tasks. • Exchange messages with DSP. • Stream data to/from DSP. • Check status.

  43. heterogeneous multiprocessors ST Nomadik • Targets mobile multimedia. • A multiprocessor-of-multiprocessors. ARM9 Memory system I/O bridges Audio accelerator Video accelerator

  44. ST MMDSP+ • Embedded processor core used in multiple chips: • Runs at 175 MHz. • 1 cycle per instruction. • 2-level instruction cache. • 16/24-bit fixed point. • 32-bit floating point. • C programmed • Fully synthesizable.

  45. Nomadik video accelerator instr RAM data RAM MMDSP+ Xbus Video codec Picture input processing Picture post processing Interrupt controller Local data bus Master AHB DMA

  46. Nomadik audio processor Timers, GPIO, etc. Slave AHB X Bus ARM DMA MMDSP+ Instr cache X RAM Y RAM DMA1 DMA2 Y Bus Master AHB

  47. Nomadik support • Supports OMAPI standard. • Standard includes mid-level driver APIs. • Nomadik defines: • Base operating system. • Base drivers. • Multimedia accelerators and their drivers.

  48. Embedded vs. scientific applications • Embedded applications provide task-level parallelism. • Embedded applications run many different types of algorithms at once. CPU CPU CPU matrix network mem mem mem Scientific multiprocessor

  49. Standards-based embedded systems • Many product categories rely on standards. • Standards body provides reference implementation. • Reduces development time. • Don’t want to introduce bugs. • Reference implementation may not be well-suited to implementation: • No task structure; • Not optimized.

  50. Design challenges in standards-driven markets • Design and verify methods within the standard. • Standards allow differentiation. • Design and verify methods outside the standard’s scope. • User interface, etc. • Design and verify interfaces. • Within standard, connection to extra-standard elements.

More Related