Embedded Systems in Silicon TD5102 Introduction and overview

Embedded Systems in SiliconTD5102Introduction and overview Henk Corporaal http://www.ics.ele.tue.nl/~heco/courses/EmbSystems Technical University Eindhoven DTI / NUS Singapore 2005/2006

Contents • Trends • Platforms • Application mapping • Design flow • Summary H.C. TD5102

Observation 1:The 3 Cs • Convergence of 3 Cs computers, communications and consumer electronics • The computer enters the 3rd fase computing power - networking - intelligent processing • The world is one network wherever, whenever, all information and communication available We get a smart environment H.C. TD5102

System Behaviour Structure Algorithm R/T Logic circuit Physical Observation 2: Current design practise Y-Chart (Gajski-Kuhn) • Design Flow is path in Y chart • Till RT-level largely manual flow H.C. TD5102

System people Task Task Task Paper spec vhdl C verilog ASM Hardware people Software people Integration Observation 3: Informal system specification H.C. TD5102

complexity Process technology + 58% 103 HW gap 102 HW design productivity +21 % SW gap 101 SW productivity + 8 % 4 8 12 16 year Observation 4: design productivity • Yes, we can fabricate the ICs, but … • Can we design them ? • Can we program them ? H.C. TD5102

Load (Sequence: weather, VO1, binary shape, 10Hz, 112 kbit/s, QCIF) 100 % Factor 2 75 % 50 % 25 % 0 % 0 50 100 150 200 250 300 Frame (IPPP ...) Rel. CPU-load for 15 fps 1200% 1000% 800% Order of Magnitude 600% 400% 200% 0% Obervation 5:More dynamic applications * Video P. Kuhn, G. Diebel, “Complexity Analysis of the MPEG-4 VM 8.0,” ISO/IEC JTC1/SC29/WG11/MPEG97/m2862, Fribourg, October 1997 * 3D H.C. TD5102

Processor-Memory Performance Gap:(grows 50% / year) Observation 6: Memory problem Performance µProc: 55%/year 1000 CPU 100 “Moore’s Law” 10 DRAM: 7%/year DRAM 1 1980 1985 1990 1995 2000 Time [Patterson] H.C. TD5102

What do we learn from these observations? We need: • Short Time-to-Market • reuse • short design time • Flexible solution • programmability • reconfigurability • Scalability • Low power • Low cost • QoS control At sufficient performance ! H.C. TD5102

Solution ? • Platforms • HW and SW IP reuse • Standardization (interfaces) • QoS (quality of service) hooks • Advanced Design Flow for Platforms • Raise abstraction level • Tool support • Modeling of Power, Cost, Performance • Predictable design H.C. TD5102

Lecture 1: Introduction • Trends • Platforms • Application mapping • Design flow • Summary H.C. TD5102

What is a platform? A platform is a generic, but domain specific information processing (sub-)system In future available as single chip (SoC), or package (SiP) H.C. TD5102

What is a platform? • HW properties: • One or more programmable processors • Advanced memory organization • Programmable communication network • I/O (highly domain dependent) • Possible extra HW features: • Reconfigurable logic • Domain specific accelerators H.C. TD5102

What is a platform? • SW components: • Standardized RTOS • Proper tooling for platform system design • Compilers, Models, Exploration, Debugging, Simulation, … • Possible extra SW features • Middleware layer on top of OS for features like: • QoS • Domain specific protocols • Domain specific SW interfaces • Control reconfigurable logic • Library components • Distributed / Active network processing • Billing • Security H.C. TD5102

Philips Nexperia Example Platform: Philips Nexperia Available in the Billion Transistor Era • E.g. TI OMAP, Sony Cell, Philips Nexperia, TRIPS, Xilinx Virtex-4 Pro, … H.C. TD5102

Future platforms Example: Smart Networked Devices active packets Virtual Machine Protocols Multimedia (MPEG 21) Network OS library accelerator hardware reconfig. hardware programmable hardware radio H.C. TD5102

Future platform: architecture concept Reconfigurable HW blocks Reconfigurable HW blocks CPUs Accelerators CPUs Accelerators Reconfigurable HW blocks Accelerators CPUs Communication network Memory Memory I/O Level 0 Communication network Level 1 Communication network I/O Level N Memory H.C. TD5102

NoC realization Future platforms Network interface On-chip Network IP core • IP - Isles: • 32 RISC microprocessor ~ 20 Kgates • MPEG decoding ~ 100 Kgates • Wavelet filtering ~ 40 Kgates • SRAM • DRAM • FPGA block H.C. TD5102

Platform and platform design Applications SDT system design technology Design technology Platform PDT platform design technology Enabling technologies H.C. TD5102

What is the system designers problem ? Idea Specification Implementation Find for an application (idea/specification) an efficient mapping/implementation on a given realization space, under given constraints (cost, P, E, T, E*D, Throughput, #pins, ..) H.C. TD5102

Processor datapath Data Memory r0 Function Unit(s) r1 Function Unit(s) Load- Store Unit r2 Register file Instruction Memory Decode logic Instruction register Processor control A (single) processor: how does it look inside? H.C. TD5102

b a 2 * * d + + z y e f - + r x Data Dependence Graph (DDG) Mapping: placing operations in space and time d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; H.C. TD5102

cycle 1 * 2 * 3 + 4 + 5 - 6 + How to map these operations? • Architecture 1: • One Function Unit • All operations single cycle latency b a 2 * * d + + z y e f + - x r H.C. TD5102

b a 2 * * d Mul Add-sub + + cycle z 1 y * + e f + 2 * + - 3 x + r 4 - 5 6 How to map these operations? • Architecture 2: • One Add-Sub and one Mul unit • All operations single cycle latency H.C. TD5102

b a 2 * * d Mul Add-sub + + cycle z 1 y * + e f + 2 - 3 x * + r 4 5 + 6 - How to map these operations? • Architecture 3: • One Add-sub and one Mul unit • Add/Sub 1 cycle, Mul 2 cycles H.C. TD5102

x Pareto curve (solution space) x x x T execution x x Specific architecture and code schedule x x x x x x x x x x x x x x x x x x x x x x x x x x 0 Cost There are many mapping solutions Let S be the solution space containing solutions x = (xi), then: x = Pareto point  x  S, and  y  S i xi < yi H.C. TD5102

Can we do better? Yes !! • Much better !! • transforming the specification • a different architecture • a different mapping • speculative execution • …… be creative ……….. H.C. TD5102

+ + + + + + Transforming the specification (1) Example: tree height reduction Based on associativity of + operation a + (b + c) = (a + b) + c H.C. TD5102

1 b y z a << + - x r Transforming the specification (2) r = f – e = 2*b + d – (a + d) = 2*b – a; x = z + y; d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; H.C. TD5102

+ + + Changing the architecture: adding more complex units: + + + 4-input adder why is this faster? H.C. TD5102

Changing the architecture: adding more complex units In the extreme case put everything into one unit! Spatial mapping - no control flow H.C. TD5102

Control Flow Graph (CFG) -a- cond? -b- -c- -d- More complex control flow Program part: -a- ; If cond Then -b- Else -c- ; -d- ; H.C. TD5102

Mapping the CFG example: 3 options: what's the best? -a- br c -a- br b -a- br c -b- jmp d -c- jmp d -b- -b- -c- -d- -d- -d- -c- jmp d H.C. TD5102

Why not removing the control flow ? H.C. TD5102

If conversion shortens the schedule -a- br c -a- -b- jmp d cond -b- !cond -c- -c- -d- -d- Using guarded instructions like: r3: add r1,r2,r5; !r3: mul r4,r5,#3 H.C. TD5102

Speculative execution makes it even shorter! -a- br c -a- -b- -c- -b- jmp d -d- -c- -d- Why not executing -d- in parallel? H.C. TD5102

However: Real life much more complex E.g.: MPEG-4 : multimedia Huge requirements: > 10 GOP/s > 6 GB/s > 10 MB storage Software specification: - more than 200 000 lines C - hundreds of files - written by approx. 80 teams H.C. TD5102

Nowadays implementations: - small images - decoding only - not real-time - several W - single task - limited dynamism Can we handle this? Wanted features: - large images (HDTV) - encoding and decoding - real-time - 100 mW (mobile) - multiple tasks - dealing with dynamism H.C. TD5102

Embedded system design How to map your application graph A(L,A,D) to hardware graph (L,N,C) L: design level (e.g. architecture, implementation or realization level) A: application components (e.g. tasks, operations, data structures) D: dependences between application components N: hardware components (e.g. processors, ASICs, FPGA,memories) C: connections between hardware components H.C. TD5102

Abstraction levels Level specification System specification level Inter-level transformation: languages: Level 0: Requirements English Idea Is modeled by ES/RT-UML, Esterel, SDL Level 1: Architecture Is implemented by C++, JAVA, Level 2: Implementation C, VHDL, SystemC Compiles into Machine code, Level 3: Realization Hardware modules Exploration search area H.C. TD5102

Design space exploration Level n-1 Design point Cost LT(n-1,n) Exploration at level n Exploration search area Realization global optimum space Exploration search area Design transformation H.C. TD5102

Design space exploration framework- another Y-chart H.C. TD5102

Design flow steps and constraints idea high abstraction level Refinement steps Architecture / Platform constraints Transformation low abstraction level realization H.C. TD5102

Step n Step n+1 Step n Step n+1  Step n+1 Step n In which order should we perform the steps? Decision trees H.C. TD5102

Well-known phase ordering examples • Concurrency versus Data management • e.g. loop partitioning versus array partitioning for a multiprocessor • Scheduling versus Register allocation • Logic synthesis versus Placement and Routing H.C. TD5102

Rule of thumb! • Perform steps with biggest impact first • Biggest impact: • depends on your interest (= cost function) • min. E, P, E*D, D, Area, Npins, ... H.C. TD5102

J c o l u m n s I r o w s Phase ordering example:Why fix data storage/transfer before concurrency management issues? Recursive image processing algorithm on local neighborhoods: (for i : 0 .. I-1 ) :: (for j : 0 .. J-1 ) :: img[i][j]= f(img[i][j-k], old_img[i][j]); H.C. TD5102

J c o l u m n s 2 I 14.4mm (0.7um) r o w s Why fix data storage/transfer before concurrency mngnt issues? • Unrolling outerloop (i) M times: • needed M J-word FIFOs (image lines) • M data paths H.C. TD5102

Embedded Systems in Silicon TD5102 Introduction and overview

Embedded Systems in Silicon TD5102 Introduction and overview

Presentation Transcript

Introduction to embedded systems and realtime systems

Introduction to Embedded Systems

Introduction to Embedded Systems

Embedded Systems in Silicon TD5102 Other Architectures

Introduction to embedded Systems

Embedded Systems Introduction

Introduction to Embedded Systems

TD5102 Embedded System in Silicon

Embedded Systems in Silicon TD5102 MIPS Instruction Set Architecture

Introduction to Embedded Systems

Introduction to Embedded Systems

Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation

Introduction to Embedded Systems

Embedded Systems Introduction

Embedded Systems an introduction

Embedded Systems Overview

Embedded Systems - Introduction

Embedded Systems: Introduction

Introduction to Embedded Systems

Introduction to Embedded Systems

Introduction to Embedded Systems