Low Power Multimedia Reconfigurable Platforms

Low Power Multimedia Reconfigurable Platforms Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab. http://vada.skku.ac.kr

Communication bandwidth [Hansen’s law] µprocessor integration density (1.2/year) What are the Challenges ? [ST microelectronics, MorphICs, Dataquest, eASIC] factor 2 Integration density (1.4/year) [Moore’s law] 4y 1 months 10 12 18 0

Reconfigurable System • Reconfigurable systems are suitable for the dynamic application and communication environment of wireless multimedia devices such as SDR. • A hierarchical system model is used in which Quality of Service and energy consumption play a crucial role. • Dynamically partition tasks of an application.

Reconfigurable SOC • As technology (supply voltage) scales down, logic (transistor) is virtually free while the interconnect becomes the bottleneck and power consuming. • Parallel execution of nested Do loop algorithms by an array of localized processing elements at moderate clock frequency is a viable solution. • It can compromise the three orthogonal issues: design time, power consumption, and performance.

Context • SoC and Customizable Platform Based-Design DSP Reconfigurable Hardware (Fine Grain) Specifications Processing power Area Power consumption etc. ASIC 2 Reconfigurable Hardware (Coarse Grain) ASIC 1 We need metrics to compare !

Prog Mem m P MAC Addr Unit Gen (lpArm) First choose the right architecture … Jan Rabaey .5-5 MIPS/mW 10-100 MOPS/mW Flexibility Embedded Processor DSP (e.g. TI 320CXX ) 100-1000 MOPS/mW Reconfigurable Processors (Maia) Embedded Factor of 100-1000 FPGA Direct Mapped Area or Power Hardware

Design Space of Reconfigurable Architecture RECONFIGURABLE ARCHITECTURES (R-SOC) MULTI GRANULARITY (Heterogeneous) FINE GRAIN (FPGA) COARSE GRAIN (Systolic) Tile-Based Architecture Processor + Coprocessor Island Topology Hierarchical Topology Coarse Grain Coprocessor Fine Grain Coprocessor Mesh Topology Linear Topology Hierarchical Topology • RAW • CHESS • MATRIX • KressArray • Systolix Pulsedsp • Xilinx Virtex • Xilinx Spartran • Atmel AT40K • Lattice ispXPGA • Altera Stratix • Altera Apex • Altera Cyclone • Chameleon • REMARC • Morphosys • Pleiades • Garp • FIPSOC • Triscend E5 • Triscend A7 • Xilinx Virtex-II Pro • Altera Excalibur • Atmel FPSIC • aSoC • E-FPFA • Systolic Ring • RaPiD • PipeRench • DART • FPFA

“Mainstream Silicon Application Makimoto’s Wave is switching every 10 Years” new breed needed software people standard TTL 2007 1967 1987 µproc., memory reconfigurable 1957 1977 1997 LSI, MSI ASICs, accel’s custom hardware people new breed (M&C) Semiconductor Revolutions instruction streams data streams structured VLSI design 1st design crisis 2nd design crisis Communication gap: Terminology clean-up

hardware people new breed needed CS people µproc., memory LSI, MSI ASICs, accel’s 3 different mind sets TTL FPGAs soft CPUs 2007 1967 1987 1957 1977 1997 coarse grain Common terminology needed

data-stream machine memory data address M Flowware generator (data sequencer) asM* I/O data stream Software Configware DPU or rDPU embedded memory architecture* memory M M M M M M M M M M I/O I/O (r)DPA (r)DPU Machine paradigms • von Neumann • instruction stream machine M instruction stream DPU I/O CPU instruction sequencer

Architecture Choices forReal-time Embedded System Greg Delagi, TI

Fine-Grained RSOCs Xilinx Virtex II-Pro • Xilinx, Inc., San Jose, CA • Up to 4 PowerPC 405 Processor Cores • Up to 160k Reconfigurable Logic Cells (4-i/p 1-o/p Lookup Table) • Up to 216 18-bit x 18-bit Dedicated Multipliers • Up to 216 18-kbit On-Chip Distributed Memory Blocks • Up to 852 I/O Pins • www.xilinx.com

Xilinx의 Xtreme

Fine-Grained RSOCs Altera Excalibur Altera, San Jose, CA 32-bit ARM9 Based Microprocessor @200 MHz Up to 256kbytes SRAM Up to 1M programmable logic gates 200 MHz Bus Built-in SDRAM Controller

Fine-Grained RSOCs: Triscend A7 CSOC A7 Family, Triscend, 32-bit ARM 7 with 8kB Cache 3200 logic cells max. (40K gates) Up to 3800 flip-flops Up to 300 Prog. I/O pins www.triscend.com

Chameleon StructureCoarse-Grained RSOCs Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit, Martinus Bos, Paul M. Heysters 32-bit ARC control processor Up to 84 32-bit Datapath Units (DPU) DPU=a 32-bit ALU+a 32-bit barrel shifter Up to 24 of 16x24-bit multipliers Up to 48 of 128x32-bit local memory modules Up to 160 Prog. I/O pins Targeted at 3rd gen. wireless basestation, wireless local loop, SW radio, etc. www.chameleonsystems.com Chameleon Systems Inc.

Memory Set memory parameters Add DCT and Huffman blocks for a JPEG app Memory RegFile RegFile Configuration FU FU FU DCT HUF FU ICache ICache Scott Weber University of California at Berkeley Architectural Rationale and Motivation • Configurable processors have shown orders of magnitude performance improvements • Tensilica has shown ~2x to ~50x performance improvements • Specialized functional units • Memory configurations • Tensilica matches the architecture with software development tools

PE PE PE PE PE PE Memory PE PE PE RegFile PE PE PE DCT HUF FU FU FU PE PE PE Memory ICache ...configurable VLIW PEs and network topology... PE PE PE PE PE PE RegFile PE PE FU FU DCT HUF FU PE ICache PE PE Architectural Rationale and Motivation • In order to continue this performance improvement trend • Architectural features which exploit more concurrency are required • Heterogeneous configurations need to be made possible • Software development tools support new configuration options ...concurrent processes are required in order to continue performance improvement trend... ...begins to look like a VLIW... ...generic mesh may not suit the application’s topology...

SDRAM Ctrl MicroEng PCI Interface Hash Engine MicroEng SA Core ICache MicroEng IX Bus Interface MicroEng DCache Scratch Pad SRAM MicroEng Mini DCache MicroEng SRAM Ctrl IXP1200 Network Processors • Six micro-engines • Support 24 contexts • Hash instructions • StrongArm core • Bus and memory controllers • Example of an architecture we want to be able to configure to IXP1200 Network Processor (Intel)

Architecture Goals • Provide template for the exploration of a range of architectures • Retarget compiler and simulator to the architecture • Enable compiler to exploit the architecture • Concurrency • Multiple instructions per processing element • Multiple threads per and across processing elements • Multiple processes per and across processing elements • Support for efficient computation • Special-purpose functional units, intelligent memory, processing elements • Support for efficient communication • Configurable network topology • Combined shared memory and message passing

Memory Memory Memory Memory RegFile RegFile FU FU FU FU FU RegFile RegFile DCT HUF FU FU FU ICache DCT HUF FU FU FU ICache ICache Architecture Template • Prototyping template for array of processing elements • Configure processing element for efficient computation • Configure memory elements for efficient retiming • Configure the network topology for efficient communication ...configure memory elements... ...configure PE... ...configure PEs and network to match the application...

Programmer’s Model uArch Compiler gen Designer Estimation gen Simulator .o Architecture Template • Templates provide prototyping platform for constrained refinement • Estimators feedback system performance and guide configuration • System designer refines configuration or the process is automated • Refined elements have a compatible interface in the system

Synthesis of Architectures • Not inventing new architectures • We are providing a tool for the prototyping and synthesis of a family of architectures • Gives a micro-architecture, ISA, compiler, and simulator • Refine within an instance to improve characteristics of the design • Most existing architectures are a point in the architecture spectrum • We want to allow a wide range of architectures to be realized • Each coupled with supporting software development tools

Memory System Register File FU FU FU FU SFU Instruction Cache Initial Processing Element • VLIW class architecture • HPL-PD architecture • Exploit ILP • Malleable elements • Memory size • Cache size • Register file size • Number of functional units • Specialized functional units

Future Processing Element • Specialized memory systems for efficient memory utility • Multi-ported, banked, levels, and intelligent memory • Split register file allows greater register bandwidth to FUs • Groups of functional units have dedicated register files • Sticky state for specialized FUs saves register file reads and writes • Multiple contexts for a processing element provide latency tolerance • Hardware for efficient context switching to fill empty instruction slots • Specialized functional units and processing elements • SIMD instructions • Re-configurable fabrics for bit-level operations • Re-use IP blocks for more efficient computation • Custom hardware for the highest performance

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Initial Distributed Architecture • Array of concurrent PEs and supporting network • Malleable network topology • Topology matches application • Efficient communication

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Initial Distributed Architecture • Array of concurrent PEs and supporting network • Malleable network and PEs • Topology matches application • Refine to meet system constraints • Memory organized around a PE • Each PE has physical memory • Message passing between PEs

Future Distributed Architecture • Multiple processing elements share a memory space • Shared memory communication • Snooping cache coherency protocol • Directory based protocol required if PEs in a shared memory space is large • Introspective processing elements • Use processing elements to analyze the computation or communication • Identify dynamic bottlenecks and remove them on the fly • Reschedule and bind tasks as the introspective elements report

Communication Models • Shared memory • Hardware handles loads and stores from PEs to a common memory • Synchronization is separate from communication • Interacting threads on a single or group of processing elements • Message passing • Hardware to send and receive messages and invoke a handler • Synchronization and communication are together • Interacting processes between single or group of processing elements

Memory Model • Relax the consistency model • Hardware implements lock and unlock mutex instructions • Synchronization instructions inserted in program • Loads and stores before a lock must complete before loads and stores after the lock are started • Relaxes the ordering of reads and writes in order to increase memory utility • Compiler is constrained on reordering around synchronization barriers

Memory System Register File FU Instruction Cache Range of Architectures • Scalar Configuration • EPIC Configuration • EPIC with special FUs • Mesh of HPL-PD PEs • Customized PEs, network • Supports a family of architectures • Plan to extend the family with the micro-architectural features presented

Memory System PE PE PE Register File PE PE PE FU FU FU FU FU PE PE PE Instruction Cache Range of Architectures • Scalar Configuration • EPIC Configuration • EPIC with special FUs • Mesh of HPL-PD PEs • Customized PEs, network • Supports a family of architectures • Plan to extend the family with the micro-architectural features presented

Memory System Register File FU FU DES DCT FFT Instruction Cache Range of Architectures • Scalar Configuration • EPIC Configuration • EPIC with special FUs • Mesh of HPL-PD PEs • Customized PEs, network • Supports a family of architectures • Plan to extend the family with the micro-architectural features presented

Memory System PE PE PE Register File PE PE FU FU DES DCT FFT PE PE PE Instruction Cache Range of Architectures • Scalar Configuration • EPIC Configuration • EPIC with special FUs • Mesh of HPL-PD PEs • Customized PEs, network • Supports a family of architectures • Plan to extend the family with the micro-architectural features presented PE

PE PE PE PE PE PE PE PE Range of Architectures • Scalar Configuration • EPIC Configuration • EPIC with special FUs • Mesh of HPL-PD PEs • Customized PEs, network • Supports a family of architectures • Plan to extend the family with the micro-architectural features presented

SDRAM Ctrl MicroEng PCI Interface Hash Engine MicroEng SA Core ICache MicroEng IX Bus Interface MicroEng DCache Scratch Pad SRAM MicroEng Mini DCache MicroEng SRAM Ctrl Range of Architectures (Future) • Template support for such an architecture • Prototype architecture • Software development tools generated • Generate compiler • Generate simulator IXP1200 Network Processor (Intel)

Architecture Microarchitecture Component Assembly and Synthesis The Research Playground Application Algorithm What is the Programmer’s Model? Software Implementation Compilation and SW Environment Verification and Manufacture Test

Mescal Compiler Manish Vachharajani Princeton University

Outline • Compiler goals • Compiler research issues • Compiler infrastructure requirements • Trimaran 2.0 compiler infrastructure • Ongoing work • Summary

So What’s Different? • General purpose compiler hand tuned to: • SPEC benchmarks • A particular general purpose machine • Need compiler tuned to: • Specific application • A particular application specific machine • And… • Meet code density, real-time, and power constraints • Do this automatically for a range of applications/architectures

So What’s Different? • Traditional application hw/sw design requires • Hand selection of traditional general purpose OS components • Hand written customization of • device drivers • memory management… • Instead… • Application specific synthesis of traditional OS components • scheduling • synchronization… • Automatic synthesis of hardware specific code from specifications • device drivers • memory management…

Compiler Goals • Develop a retargetable compiler infrastructure that enables a set of interesting applications to be efficiently mapped onto a family of fully programmable architectures and microarchitectures. • 10 Year Vision: • Will have fully automatically-retargetable compilation, OS synthesis, and simulation for a class of architectures consisting of multiple heterogeneous processing elements with specialized functional units / memories • Compiled code size and performance will be within 10% of hand-coding

Compiler Research Issues • Synthesis of RTOS elements in the compiler • On the application side: Generation of an efficient application-specific static/run-time scheduler and synchronization • On the hardware side: Generation of device drivers, memory management primitives, etc. using hardware specifications • Automatic retargetability for family of target architectures while preserving aggressive optimization • Automatic application partitioning • Mapping of process/task-level concurrency onto multiple PEs using programmer guidance in programmer’s model • Effective visualization for family of target architectures

Compiler Infrastructure Requirements • High level of usability • good documentation, well coded • Large suite of machine-independent code optimizations • Significant level of retargetability • Strong support for instruction-level parallelism • Support for memory as a first-class citizen • Simulation tools • Preferably • visualization tools • a good support team

IMPACT/ELCOR features strong VLIW data structure and algorithm support Data structures basic, hyper, super blocks loop analysis procedure analysis miscellaneous, e.g. lists, sets Algorithms if-conversion software pipelining scheduling/register allocation Trimaran 2.0 Compiler Overview www.trimaran.org C IMPACT Front-End U. of Illinois IMPACT Group HP Labs CAR Group ELCOR Back-End MDES NYU ReaCT-ILP Group Simulator & Visualization

Trimaran 2.0 Overview:Simulator and Visualization Tools • Cycle-level simulator easily extensible to support new specialized operations • Simply augment table specifying operation semantics • Visualization tools visualize assortment of useful static / dynamic information • Instruction schedule • Data-dependency graphs • Total cycles per function / region • Percentage of total function operations that are branches, loads, stores, integer ALU, floating-point ALU, etc.

Trimaran 2.0 Overview:Machine Description (MDES) C TRIMARAN • Target specified in high-level machine-description language • Translated into low-level language • ELCOR supports Playdoh • Parameterized non-clustered VLIW architecture • Support for speculative/predicated execution, software pipelining • User may modify following playdoh parameters: • number of registers • number of integer, floating-point, memory, branch FUs • operation latencies IMPACT Front-End High-level PlayDoh MDES Low-level PlayDoh MDES ELCOR Back-End Simulator & Visualization

Extensions to Trimaran 2.0:Support for Multiple PEs MESCAL Machine Description • ELCOR does not provide MDES and data structure support for multiple Playdoh PEs • New MDES format has been devised to support multiple PEs with varying connectivity • Array of MDES data structures maintained, one per PE • Each code region must be associated with an MDES PE prior to code generation • Communication channels between PEs currently not modeled PE1: machine description PE2: machine description . . . PEm: machine description Channel1: from PE1 to PE2 Channel2: from PE1 to PE3 . . . Channeln: from PEi to PEj

Low Power Multimedia Reconfigurable Platforms

Low Power Multimedia Reconfigurable Platforms

Presentation Transcript

Low – power testing

Truly Understanding Low-Power Multimedia Chip Design

Computing Platforms for Multimedia

Reconfigurable Ultra Low Power LNA for 2.4GHz Wireless Sensor Networks

Low Power Processors

Low Power Clocking

Low-power Multimedia Wireless Communication Systems

Reconfigurable Computing Platforms

Platforms and Design Aids for Low Power Mobile Computing

Low-Power and Area-Efficient Carry Select Adder on Reconfigurable Hardware

Design Technology for Networked Reconfigurable FPGA Platforms

DRIM: A Low Power Dynamically Reconfigurable Instruction Memory Hierarchy for Embedded Systems

Multimedia Access Platforms

High Performance, Low Power Reconfigurable Processor for Embedded Systems

Low Voltage Low Power Dram

SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms

CrystalMobile: multimedia framework for mobile platforms

An Ultra Low Power Reconfigurable Task Processor for Space

Multimedia Chat Platforms

Low Power Clocking

Software Engineering Methodology for Reconfigurable Platforms

Reconfigurable Low Profile Antenna