ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

ECE 697FReconfigurable ComputingLecture 19Reconfigurable Coprocessors

Overview • Focus on Processor and Array hybrids. • Motivation • Compute Models: how to fit into computation • Examples: Garp, Prism, Remarc, OneChip, Prisc • Some lecture material taken with permission from Dehon lecture on reconfigurable computing.

Compression Techniques • Processors efficient at sequential codes, regular arithmetic operations. • FPGA efficient at fine-grained parallelism, unusual bit-level operations. • Tight-coupling important: allows sharing of data/control • Converging technologies: SRAM being migrated to same die as processor anyway. Why not integrate?

Motivational: Other Viewpoints • Replace interface glue logic. • I/O pre/post processing • Handle real-time responsiveness • Provide powerful, application specific operation. • Allow migration of function /performance over time.

Compute Models • Glue logic for buses, adapters. • Dedicated I/O processor • Instruction augmentation • Special instructions/coprocessor ops • VLIW/microcoded extension to processor • Configurable vector unit • Autonomous co/stream processor

Interfacing • Logic replaces: • ASIC customization • External FPGA/CPLD • Example • Bus protocols • Peripherals • Sensors, actuators • Argument • Need customization • Modern chips have capacity • Reduce part count • Migrate to system-on-a-chip • Performance/power

Triscend E5 Architecture

I/O Processor • Array dedicated to servicing to I/O channel • Sensor, LAN, WAN, peripheral • Many protocols, services • Provides protocol handling • Stream computation • Compression, encrypt • Effectively looks like I/O peripheral to processor. • Don’t need all at same time • Offload function from processor.

I/O Processing • Single threaded processor created in reconfigurable logic. • No support for multiple data pipes or multiple contexts. • Need some minimal, local control to handle events. • For performance or real-time guarantees, may need to service rapidly. • Checksum and acknowledge packets, for example

a31 a30………. a0 Swap bit positions b31 b0 Instruction Augmentation • Processor can only describe a small number of basic computations in a cycle • I bits -> 2I operations • Recall that for Boolean function a total of ______ operations could be performed on 2 W-bit words. • ALU implementations restrict execution of some simple operations. • e. g. bit reversal

Instruction Augmentation • Provide a way to augment the processor instruction set for an application. • Avoid mismatch between hardware/software What’s Required? • Fit augmented instructions into data and and control stream. • Create a functional unit for augmented instructions. • Compiler techniques to identify/use new functional unit.

Chimaera • Start from Prisc idea. • Integrate as a functional unit • No state • RFU Ops (like expfu) • Stall processor on instruction miss • Add • Multiple instructions at a time • More than 2 inputs possible • Hauck: University of Washington

Chimaera Architecture • Live copy of register file values feed into array • Each row of array may compute from register of intermediates • Tag on array to indicate RFUOP

Chimaera Architecture • Array can operate on values as soon as placed in register file. • Logic is combinational • When RFUOP matches • Stall until result ready • Drive result from matching row

Chimaera Timing R5 R3 R2 R1 • If R1 presented last then stall • Might be helped by instruction reordering • Physical implementation an issue.

Chimaera Results • Three Spec92 benchmarks • Compress 1.11 speedup • Eqntott 1.8 • Life 2.06 • Small arrays with limited state • Small speedup • Perhaps focus on global router rather than local optimization.

Garp • Integrate as coprocessor • Similar bandwidth to processor as functional unit • Own access to memory • Support multi-cycle operation • Allow state • Cycle counter to track operation • Configuration cache, path to memory

Garp – UC Berkeley • ISA – coprocessor operations • Issue gaconfig to make particular configuration present. • Explicitly move data to/from array • Processor suspension during coproc operation • Use cycle counter to track progress • Array may directly access memory • Processor and array share memory • Exploits streaming data operations • Cache/MMU maintains data consistency

Garp Instructions • Interlock indicates if processor waits for array to count to zero. • Last three instructions useful for context swap • Processor decode hardware augmented to recognize new instructions.

Garp Array • Row-oriented logic • Dedicated path for processor/memory • Processor does not have to be involved in array-memory path

Garp Results • General results • 10-20X improvement on stream, feed-forward operation • 2-3x when data dependencies limit pipelining • [Hauser-FCCM97]

PRISC/Chimaera vs. Garp • Prisc/Chimaera • Basic op is single cycle: expfu • No state • Could have multiple PFUs • Fine grained parallelism • Not effective for deep pipelines • Garp • Basic op is multi-cycle – gaconfig • Effective for deep pipelining • Single array • Requires state swapping consideration

Common Theme • To overcome instruction expression limits: • Define new array instructions. Make decode hardware slower / more complicated. • Many bits of configuration… swap time. An issue -> recall tips for dynamic reconfiguration. • Give array configuration short “name” which processor can call out. • Store multiple configurations in array. Access as needed (DPGA)

ReMarc • Miyamori/Olukotun – Stanford • Array of “nano-processors” • 16b, 32 instructions each • VLIW –like instruction • Coprocessor interface (similar to Garp) • No direct array -> memory

ReMarc Architecture • 8x8 array of nanoprocessor • Reminiscent of DPGA except that processing element is ALU

Nanoprocessor Tile • Each tile has own instruction RAM • Communication with near-neighbor tiles • Global sequence specifies non-PC • 16 bit output.

ReMarc Results • ReMarc 60X smaller than FPGA • Performance comparable

Observation • All coprocessors have been single-threaded • Performance improvement limited by application parallelism • Potential for task/thread parallelism • DPGA • Fast context switch • Concurrent threads seen in discussion of IO/stream processor • Added complexity needs to be addressed in software.

Scalability? • Can scale…. • Number of inactive contexts. • Similar to cache model • Number of PFUs in PRISC/Chimaera • Still limited by single execution thread. • Exacerbate pressure/complexity of reconfigurable logic/interconnect • Cannot scale? • Amount of active resources. • Perhaps take coarser-grain focus to parallel processing.

Parallel Computation: Processor and FPGA • What would it take to let the processor and FPGA run in parallel? Modern Processors Deal with: • Variable data delays • Dependencies with data • Multiple heterogeneous functional units Via: • Register scoreboarding • Runtime data flow (Tomasulo)

OneChip -> Toronto • Allow array to have more memory-memory operations • Want to fit into programming model/ISA without forcing exclusive processor/FPGA operation. • Also allow decoupled processor/array execution. • Allow interlocking of data in special “scoreboard” area.

0x0 0x1000 FPGA Proc 0x10000 Indicates usage of data pages like virtual memory system! OneChip Innovations • FPGA operates on certain memory regions only • Makes regions explicit to processor issue. • Scoreboard memory blocks

OneChip • Basic Op is FPGAMem -> Mem • No state between ops • Ops must appear sequential • Could have multiple/parallel FPGA compute units • Scoreboard between all • Multiprocessing?

Summary • Several different models and uses for “reconfigurable processor” • Some move towards parallel computing. Others towards single processors • Exploit density and expressiveness of fine-grained, parallel operations. • Number of ways to integrate. Need to work around limitations.

ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Presentation Transcript

Reconfigurable Computing

Reconfigurable Computing

ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement

Reconfigurable Computing

Reconfigurable Computing

Reconfigurable Computing

Reconfigurable Computing

Reconfigurable computing

ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture

Reconfigurable Computing

ECE 636 Reconfigurable Computing Lecture 11 Reconfigurable Computing Applications

ECE 697F Reconfigurable Computing Lecture 5 Technology Mapping: Packing Logic into LUTs

ECE 697F Reconfigurable Computing Lecture 6 Mapping to Embedded Memory and PLAs

ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors

Reconfigurable Computing

ECE 697F Reconfigurable Computing Lecture 11 FPGA-Based System Design

Reconfigurable Computing

Reconfigurable Computing Lecture 14

ECE 697F Reconfigurable Computing Lecture 4 Contrasting Processors: Fixed and Configurable

Reconfigurable Computing

Reconfigurable Computing