Introduction

Integration of High-Performance ASICs into Reconfigurable Systems Providing Additional Multimedia Functionality This material is based on Paper of H. Blume, H.-M. Blüthgen, C. Henning and P. Osterloh, in 2000. VLSI Algorithmic Design Automation Lab.

Approaches : CardBus-based coprocessor board Integration of additional high-performance multimedia component into computer system using reconfigurable coprocessor board Reconfigurable computing Adaptation to a range of different application varying processing parameter (e.g. coefficient) Dedicated ASIC providing enough computational power Acceptable response times Constitution : EPLD (Embedded Programmable logic) or FPGA Allowing in-system programmability For controlling functionality Memory device CardBus interface Connected to PCI bus Up to 132 Mbytes/s Small and ideal for mobile computer system Hot plug-in : insertion into running system Dynamic reconfigurable Coprocessor Mounted on a socket on the board Computational component like DSP Introduction VLSI Algorithmic Design Automation Lab.

Cardbus based evaluation System First step for realization: • Cardbus interface • Control and data transmission • EPLD • Controller on the coprocessor board exchanging data b/w CardBus and other on-board components • Configuration of EPLD • Configuration flash memory • Via a JTAG • ASIC mounted on a socket Next step: • Dedicated coprocessor board • Removed Socket and directly mounted ASIC • Ball Grid Array • Flash memory replace SDRAM • Hybrid reconfigurable platform • ASIC can be used to relieve the DPSs • EPLD • control ASIC, DSP, and on-board data flow • Execution of basic application-specific task VLSI Algorithmic Design Automation Lab.

ASICs - highly optimized macro Histogram Processor: • Scalable with respect to throughput rate and power consumption • By a suitable choice of stage number and stage sizes Two-dimensional Transversal filter : • Parameterizable concerning sample and coefficient wordlength, window size • High utilization by time-sharing VLSI Algorithmic Design Automation Lab.

Performance Analysis Text processor ASIC : • Text search • Classical edit distance computation • Handling of wildcard • Recoding of the text to handle special idiomatic properties • Integration of multi-token matching Benchmark – software & Hardware : • 1MByte text file, 8 search words with 8chracters • General-purpose processor : Ultra Sparc I, 167MHz • VLIW signal processor : Philip TRIMEDIA TM-1000, 100MHz : ILP (Instruction Level Parallelism) of 3 • Next generation processor : TRIMEDIA, 64 bit, 166 MHz, ILP of 5 • PLD-based implementation of system for searching DNA sequence in genome database • Text processor ASIC • ASIC : sufficient throughput and adequate flexibility VLSI Algorithmic Design Automation Lab.

Partitioning methodology for dynamically reconfigurable embedded systems This material is based on Paper of J. Harkin, T.M. McGinnity and L.P. Maguire presented in IEE Proceeding 2000. VLSI Algorithmic Design Automation Lab.

Approaches to the Partitioning : Partitioning : allocation of the system resources Hard-wired ASIC to improve implementation efficiency Introduction of FPGAs to embedded system Higher levels of performance and flexibility Increase computational power by customizing the reconfigurable platform H/W & S/W partitioning issue: Automation of approach Granularity level of partitioning Flexibility of implementing different types of operations Memory requirement, method of obtaining runtime execution value, profiling level Target hardware Introduction Methodologies: • Partitioning application • Estimating performance • Resource-limited embedded system VLSI Algorithmic Design Automation Lab.

Related Work Strategy: • Column labeled “desire” • Codesign stage when no H/W design have been performed • Best speedup without increasing system resource : the use of RTR • Runtime reconfigurable of noncached FPGA VLSI Algorithmic Design Automation Lab.

Method -I Assumption: • Target embedded system • One fixed processing device(166MHz pentium) • One reconfigurable device(Xilinx XC6216) • Partitioning approach is only valid for H/W • Single candidate can reside • The benefits of global RTR • H/W parallelism within a candidate • Concurrent H/W and S/W execution is not considered • Do not deal with preemptive scheduling • Non-reactive embedded system VLSI Algorithmic Design Automation Lab.

Method -II Detection of candidates and software runtime : • Hardware candidate : identify at the high language level, C++ • Performance estimation at abstract level • C/C++ to Verilog or VHDL synthesis at later date • Detection process • Textually scanning for nested FOR and WHILE loops : coarse approach • Timer • Determination of execution time in software Memory analysis and cost evaluation: • Three different memory location • Access time overhead : Main memory > local memory > stored (hard-wired) within reconfig. device • Textually scanning : the number of memory access • Internal (data used exclusively within candidate) • External (data accessed external to candidate) VLSI Algorithmic Design Automation Lab.

Method - III Hardware execution and reconfigurable time: • Estimation by modeling each line of code (instruction) • In terms of simple temporary parallel macro • 32-bit full adder on XC6216 • Assumption • All arithmetic operations can be realized through adder and register • CORDIC Estimate of application speedup: • Modified version of Amdahl’s speedup metric • Automatic local clock gating & 3 user-controlled idle modes • The use of Global RTR • Reduce the memory latency : Tm setup time • Potential speedup : value for Tr • Partial reconfiguration VLSI Algorithmic Design Automation Lab.

Result • The effect of global RTR • Improvement of speedup : commonality of memory data b/w adjacent candidate in the sequence reduce the latency • Best speedup : all candidates are partitioned to hardware • Near optimal speedup selecting a sequence (not all candidate) • Large design cost in hardware • Exhaustive search of all the possible combination loosely coupled and tightly coupled VLSI Algorithmic Design Automation Lab.

Local RTR: Partial reconfiguration Upper limit of speedup Reducing memory latency Reducing configuration latency Local RTR by exploiting the commonality among scheduled candidate FPGA circuit design. If overhead is reduced to zero, then upper limit speedup Conclusion: Global RTR Exploit functional density of the limited hardware resources Local RTR Further improvement in performance Exploit the programing of hardware resources through local RTR Result & Conclusion VLSI Algorithmic Design Automation Lab.

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction