1 / 35

ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors. Overview. Focus on Processor and Array hybrids. Motivation Compute Models: how to fit into computation Examples: Garp, Prism, Remarc, OneChip, Prisc

lane
Download Presentation

ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 697FReconfigurable ComputingLecture 19Reconfigurable Coprocessors

  2. Overview • Focus on Processor and Array hybrids. • Motivation • Compute Models: how to fit into computation • Examples: Garp, Prism, Remarc, OneChip, Prisc • Some lecture material taken with permission from Dehon lecture on reconfigurable computing.

  3. Compression Techniques • Processors efficient at sequential codes, regular arithmetic operations. • FPGA efficient at fine-grained parallelism, unusual bit-level operations. • Tight-coupling important: allows sharing of data/control • Converging technologies: SRAM being migrated to same die as processor anyway. Why not integrate?

  4. Motivational: Other Viewpoints • Replace interface glue logic. • I/O pre/post processing • Handle real-time responsiveness • Provide powerful, application specific operation. • Allow migration of function /performance over time.

  5. Compute Models • Glue logic for buses, adapters. • Dedicated I/O processor • Instruction augmentation • Special instructions/coprocessor ops • VLIW/microcoded extension to processor • Configurable vector unit • Autonomous co/stream processor

  6. Interfacing • Logic replaces: • ASIC customization • External FPGA/CPLD • Example • Bus protocols • Peripherals • Sensors, actuators • Argument • Need customization • Modern chips have capacity • Reduce part count • Migrate to system-on-a-chip • Performance/power

  7. Triscend E5 Architecture

  8. Triscend E5 Architecture

  9. I/O Processor • Array dedicated to servicing to I/O channel • Sensor, LAN, WAN, peripheral • Many protocols, services • Provides protocol handling • Stream computation • Compression, encrypt • Effectively looks like I/O peripheral to processor. • Don’t need all at same time • Offload function from processor.

  10. I/O Processing • Single threaded processor created in reconfigurable logic. • No support for multiple data pipes or multiple contexts. • Need some minimal, local control to handle events. • For performance or real-time guarantees, may need to service rapidly. • Checksum and acknowledge packets, for example

  11. a31 a30………. a0 Swap bit positions b31 b0 Instruction Augmentation • Processor can only describe a small number of basic computations in a cycle • I bits -> 2I operations • Recall that for Boolean function a total of ______ operations could be performed on 2 W-bit words. • ALU implementations restrict execution of some simple operations. • e. g. bit reversal

  12. Instruction Augmentation • Provide a way to augment the processor instruction set for an application. • Avoid mismatch between hardware/software What’s Required? • Fit augmented instructions into data and and control stream. • Create a functional unit for augmented instructions. • Compiler techniques to identify/use new functional unit.

  13. Chimaera • Start from Prisc idea. • Integrate as a functional unit • No state • RFU Ops (like expfu) • Stall processor on instruction miss • Add • Multiple instructions at a time • More than 2 inputs possible • Hauck: University of Washington

  14. Chimaera Architecture • Live copy of register file values feed into array • Each row of array may compute from register of intermediates • Tag on array to indicate RFUOP

  15. Chimaera Architecture • Array can operate on values as soon as placed in register file. • Logic is combinational • When RFUOP matches • Stall until result ready • Drive result from matching row

  16. Chimaera Timing R5 R3 R2 R1 • If R1 presented last then stall • Might be helped by instruction reordering • Physical implementation an issue.

  17. Chimaera Results • Three Spec92 benchmarks • Compress 1.11 speedup • Eqntott 1.8 • Life 2.06 • Small arrays with limited state • Small speedup • Perhaps focus on global router rather than local optimization.

  18. Garp • Integrate as coprocessor • Similar bandwidth to processor as functional unit • Own access to memory • Support multi-cycle operation • Allow state • Cycle counter to track operation • Configuration cache, path to memory

  19. Garp – UC Berkeley • ISA – coprocessor operations • Issue gaconfig to make particular configuration present. • Explicitly move data to/from array • Processor suspension during coproc operation • Use cycle counter to track progress • Array may directly access memory • Processor and array share memory • Exploits streaming data operations • Cache/MMU maintains data consistency

  20. Garp Instructions • Interlock indicates if processor waits for array to count to zero. • Last three instructions useful for context swap • Processor decode hardware augmented to recognize new instructions.

  21. Garp Array • Row-oriented logic • Dedicated path for processor/memory • Processor does not have to be involved in array-memory path

  22. Garp Results • General results • 10-20X improvement on stream, feed-forward operation • 2-3x when data dependencies limit pipelining • [Hauser-FCCM97]

  23. PRISC/Chimaera vs. Garp • Prisc/Chimaera • Basic op is single cycle: expfu • No state • Could have multiple PFUs • Fine grained parallelism • Not effective for deep pipelines • Garp • Basic op is multi-cycle – gaconfig • Effective for deep pipelining • Single array • Requires state swapping consideration

  24. Common Theme • To overcome instruction expression limits: • Define new array instructions. Make decode hardware slower / more complicated. • Many bits of configuration… swap time. An issue -> recall tips for dynamic reconfiguration. • Give array configuration short “name” which processor can call out. • Store multiple configurations in array. Access as needed (DPGA)

  25. ReMarc • Miyamori/Olukotun – Stanford • Array of “nano-processors” • 16b, 32 instructions each • VLIW –like instruction • Coprocessor interface (similar to Garp) • No direct array -> memory

  26. ReMarc Architecture • 8x8 array of nanoprocessor • Reminiscent of DPGA except that processing element is ALU

  27. Nanoprocessor Tile • Each tile has own instruction RAM • Communication with near-neighbor tiles • Global sequence specifies non-PC • 16 bit output.

  28. ReMarc Results • ReMarc 60X smaller than FPGA • Performance comparable

  29. Observation • All coprocessors have been single-threaded • Performance improvement limited by application parallelism • Potential for task/thread parallelism • DPGA • Fast context switch • Concurrent threads seen in discussion of IO/stream processor • Added complexity needs to be addressed in software.

  30. Scalability? • Can scale…. • Number of inactive contexts. • Similar to cache model • Number of PFUs in PRISC/Chimaera • Still limited by single execution thread. • Exacerbate pressure/complexity of reconfigurable logic/interconnect • Cannot scale? • Amount of active resources. • Perhaps take coarser-grain focus to parallel processing.

  31. Parallel Computation: Processor and FPGA • What would it take to let the processor and FPGA run in parallel? Modern Processors Deal with: • Variable data delays • Dependencies with data • Multiple heterogeneous functional units Via: • Register scoreboarding • Runtime data flow (Tomasulo)

  32. OneChip -> Toronto • Allow array to have more memory-memory operations • Want to fit into programming model/ISA without forcing exclusive processor/FPGA operation. • Also allow decoupled processor/array execution. • Allow interlocking of data in special “scoreboard” area.

  33. 0x0 0x1000 FPGA Proc 0x10000 Indicates usage of data pages like virtual memory system! OneChip Innovations • FPGA operates on certain memory regions only • Makes regions explicit to processor issue. • Scoreboard memory blocks

  34. OneChip • Basic Op is FPGAMem -> Mem • No state between ops • Ops must appear sequential • Could have multiple/parallel FPGA compute units • Scoreboard between all • Multiprocessing?

  35. Summary • Several different models and uses for “reconfigurable processor” • Some move towards parallel computing. Others towards single processors • Exploit density and expressiveness of fine-grained, parallel operations. • Number of ways to integrate. Need to work around limitations.

More Related