Computing System Element Choices

General Purpose Processors Application Specific Processors ASICs Computing System Element Choices Programmability / Flexibility Reconfigurable Computing Also known as Custom Computing Machines (CCMs) Utilize hardware devices customized to match computation Using: FPGAs (Fine grain) or Micro-coded arrays of simple processors (coarse grain) GPPs Superscalar VLIW Re-configurable Hardware DSPs Network Processors Graphics Processors ….. Co-Processors Specialization , Development cost/time Performance/Chip Area/Watt (Computational Efficiency)

Spatial vs. Temporal Computing Spatial Temporal (using software/program) (using hardware) Processor Instructions Defined by fixedfunctionality and connectivity of hardware elements Processor running programs written using a pre-defined fixed set of instructions (ISA)

Computes one function (e.g. FP-multiply, divider, DCT) Function defined at fabrication time e.g ASICs Parameterizable Hardware: Performs limited “set” of functions Computing Element ProgrammabilityDefining Terms Fixed Function: Programmable: • Computes “any” computable function (e.g. Processor, DSPs, FPGAs) • Function defined after fabrication e.g. Co-Processors

Computing Element Choices Observation • Generality and efficiency are in some sense inversely related to one another: • The more general-purpose a computing element is and thus the greater the number of tasks it can perform, the less efficient it will be in performing any of those specific tasks. • Design decisions are therefore almost always compromises; designers identify key features or requirements of applications that must be met and and make compromises on other less important features. • To counter the problem of computationally intense problems for which general purpose machines cannot achieve the necessary performance: • Special-purpose processors, attached processors, and coprocessors have been built for many years, especially in such areas as image or signal processing (for which many of the computational tasks can be very well defined). • The problem with such machines is that they are special-purpose; as problems change or new ideas and techniques develop, their lack of flexibility makes them problematic as long-term solutions. • Reconfigurable computing or Custom Computing Machines (CCMs) using FPGAs (Field Programmable Gate Arrays, first introduced in 1986 by Xilinx) or other reconfigurable (customizable) hardware can offer an attractive alternative to other computing element choices. Due to fixed ISA FPGAs originally developed for hardware design verification, rapid-prototyping, and potential ASIC-replacement

What is Reconfigurable Computing? • Utilize reconfigurable hardware devices: (spatially-programmed connections of hardware processing elements) tailored to application: • Customizing hardware to match computations needed/present in a particular • application by changing hardware functionality on the fly. • Reconfigurable Computing Goal: Using reconfigurable hardware devices to build systems with advantages over conventional computing solutions in terms of: • - Flexibility - Performance - Power - Time-to-market - Life cycle cost Computational Efficiency • “Hardware” customized to specifics of problem. • Direct map of problem specific dataflow, control. • Circuits “adapted” as problem requirements change. Hardware customization/reconfigurablity, how? Change both function of hardware cells (elements) and their connectivity to match requirements of Computation/application Still spatial computing but both functionality and connectivity of hardware elements are not fixed

Conventional Programmable ProcessorsVs. Configurable devices Conventional Programmable Processors: • Moderately wide datapath which have been growing larger over time (e.g. 16, 32, 64, 128 bits). • Support for large on-chip instruction caches which have also been been growing larger over time that can now hold thousands of instructions. • High bandwidth instruction distribution so that several instructions may be issued per cycle at the cost of dedicating considerable die area for instruction fetch/distribution/issue/scheduling. • A single thread of computation control per processor core. (SMT changes this) Configurable devices (such as FPGAs): • Narrow datapath (e.g. almost always one bit), • On-chip space for only one instruction per compute element -- i.e. the single instruction which tells the FPGA array cell (Configurable Logic Block, CLB) what function to perform and how to route its inputs and outputs (connectivity to other cells). • Minimal die area dedicated to instruction distribution such that it takes hundreds of thousands of compute cycles to change the active set of array instructions (e.g From one FPGA configuration to another) . • Can handle regular and bit-level computations more efficiently than processors.

Why Reconfigurable Computing? • To improve performance (including predictability) and computational energy efficiency over a software implementation. • e.g. signal processing applications in configurable hardware. • Provide powerful, application-specific operations. • To improve product flexibility and development cost/time compared to hardware (ASIC) • e.g. encryption, compression or network protocols handling in configurable hardware • To use the same hardware for different purposes at different points in the computation (lowers cost). • Given sufficient use of each configuration to tolerate potentially long reconfiguration latency/overheads

Benefits of Reconfigurable Logic Devices • Non-permanent customization and application development after fabrication • “Late Binding” • Economies of scale (amortize large, fixed design costs) • Shorter time-to-market than ASICs (dealing with evolving requirements and standards, new ideas) Customization achieved by changing both function of hardware elements and their connectivity to match requirements of application Potential Disadvantages: • Efficiency penalty (area, performance, power) compared to ASICs. • Need for correctness Verification. • (common to all hardware-based solutions)

Spatial/Configurable Hardware Benefits/Drawbacks • Potentially, an order of magnitude (10x) or higher raw computational density advantage over processors. • Potential for fine-grained (bit-level) control/parallelism --- can offer another order of magnitude benefit. • Locality. Spatial/Configurable Drawbacks • Each compute/interconnect resource dedicated to single function. • Must dedicate resources for every computational subtask. • Infrequently needed portions of a computation sit idle --> inefficient use of resources (but much better than processors)

Configurable Computing Application Areas In general many types of applications with few computationally intensive “kernels” (inner-loops?) that can done more efficientlyin hardware • Digital signal processing • Encryption • Image processing • Telemetry Data processing (remote-sensing) • Data/Image/Video compression/decompression • Low-power (through hardware "sharing") • Scientific/Engineering physical system modeling (e.g. finite-element computations). • Network applications (e.g. reconfigurable routers) • Variable precision arithmetic • Logic-intensive applications • In-the-field hardware enhancements • Adaptive (learning) hardware elements • Rapid system prototyping • Verification of processor and ASIC designs • …... Original applications of FPGAs

Technology Trends Driving Configurable Computing • Increasing gap between "peak" performance of general-purpose processors and "average actually achieved" performance. • Most programmers don't write code that gets anywhere near the peak performance of current superscalar CPUs • Improvements in FPGA hardware: capacity and speed: • FPGAs use standard SRAM processes and "ride the commodity technology" curve (e.g. VLSI technology) • Volume pricing even though customized solution • Improvements in synthesis and FPGA mapping/routing software • Increasing number of transistors on a (processor) chip (one billion+): How to use them efficiently? • Bigger caches (Most popular)? • Multiple processor cores? (Chip Multiprocessors - CMPs) • SMT support? • IRAM-style vector/memory? • DSP cores or other application specific processors? • Reconfigurable logic (FPGA or other reconfigurable logic)? A Combination of the above choices? Heterogeneous Computing System on a Chip?

Configurable Computing Architectures • Configurable Computing architectures combine elements of general-purposecomputing and application-specific integrated circuits (ASICs). • The general-purpose processor operates with fixed circuits that perform multiple tasks under the control of software. • An ASIC contains circuits specialized to a particular task and thus needs little or no software to instruct it. • The configurable computer can execute software commands that alter its configurable devices (e.g FPGA circuits) as needed to perform a variety of jobs. i.e to change both functionality and connectivity of hardware elements (cells)

Levels of the Reconfigurable Computational Elements(according to grain size) e.g FPGAs Reconfigurable Logic Reconfigurable Datapaths Reconfigurable Arithmetic Reconfigurable Control Configurable Processors Real-Time Operating Systems (RTOS): Process management Bit-Level Operations e.g. encoding Dedicated data paths e.g. Filters, AGU Arithmetic kernels e.g. Convolution Finer Grain Coarser Grain

Hybrid-Architecture Computer • Combinesgeneral-purpose processors (GPPs) and reconfigurable devices, commonly: • FPGA chips (Fine-grain reconfigurable hardware) , or • Micro-coded arrays of simple processors (Coarse-grain reconfigurable hardware) . • A controller FPGA may load circuit configurations stored in memory onto the processor FPGA in response to the requests of the operating program. • If the memory does not contain a requested circuit, the processor FPGA sends a request to the PC host, which then loads the configuration for the desired circuit. • Common Hybrid Configurable Architecture Today: • One or more FPGAs on board connected to host via I/O bus (e.g PCI) • Possible Future Hybrid Configurable Architecture: • Integrate a region of configurable hardware (FPGA or something else) onto processor chip itself as reconfigurable functional units or coprocessors • Integrate configurable hardware onto DRAM chip=> Flexible computing without memory bottleneck Current Hybrid-Architecture on a chip: Hybrid FPGAs: Integrate one or more hard-wired GPPs with an FPGA on the same chip Example: Xilinx Vertex-II Pro, Virtex-4 FX (FPGA with one or two PowerPC cores)

Hybrid-Reconfigurable Computer: Levels of Coupling Different levels of coupling in a hybrid reconfigurable system. Reconfigurable logic is shaded. Loose Coupling Tight Coupling ISA Support Function Calls External standalone processing unit (e.g. Via network/IO interface) Reconfigurable functional units (on chip) Attached (e.g. via PCI) reconfigurable processing unit (Most common today) Reconfigurable coprocessor (on or off chip) Future direction

Sample Configurable Computing Application:Prototype Video Communications System • Uses a single FPGA to perform four functions that typically require separate chips. • A memory chip stores the four circuit configurations and loads them sequentially into the FPGA. • Initially, the FPGA's circuits are configured to acquire digitized video data. • The chip is then rapidly reconfigured to transform the video information into a compressed form and reconfigured again to prepare it for transmission. • Finally, the FPGA circuits are reconfigured to modulate and transmit the video information. • At the receiver, the four configurations are applied in reverse order to demodulate the data, uncompress the image and then send it to a digital-to-analog converter so it can be displayed on a television screen.

Early Configurable (or Custom) Computing Successes • DEC Programmable Active Memories, PAM (1992): • A universal hardware FPGA-based co-processor closely coupled to a standard host computer developed at DEC's Paris Research Laboratory • Fast RSA decryption implementation on a reconfigurable machine (10x faster than the fastest ASIC at the time) • Splash2 (1993): • Attached Processor System using Xilinx FPGAs as processing elements developed at Center for Computing Sciences. • Performs DNA Sequence matching 300x Cray2 speed, and 200x a 16K Thinking Machines CM2 speed • Many modern processors and ASICs are verified using FPGA emulation systems • For many digital signal processing/filtering (e.g FIR, IIR) algorithms, single chip FPGAs outperform DSPs by 10-100x. (More on Splash 2 in lecture handout)

Fine-grain Reconfigurable Hardware Devices: Programmable Circuitry: FPGAs • Field-Programmable Gate Array (FPGA) introduced by Xilinx (1986). • Original target applications: hardware design verification, rapid-prototyping, and potential ASIC-replacement. • Programmable circuits can be created or removed by sending signals to gates in the logic elements (configuration bit stream). • A built-in grid of circuits arranged in columns and rows allows the designer to connect a logic element to other logic elements or to an external memory or microprocessor. • The logic elements are grouped in Configurable Logic Blocks (CLBs) that perform basic binary operations such as AND, OR and NOT • Firms, including Xilinx and Altera, have developed devices with the capability of 4,000,000 or more equivalent gates. • Recently, in addition to “ general-purpose” or generic FPGAs, more specialized FPGA families targeting specific areas such as DSP applications have been developed with hard-wired functional units (e.g. MAC units).

Fine-grain Reconfigurable Hardware Devices: Field Programmable Gate Arrays (FPGAs) • Chip contains many small building blocks that can be configured to implement different functions. • These building blocks are known as CLBs (Configurable Logic Blocks) • FPGAs typically "programmed" by having them read in a stream of configuration information from off-chip • Typically in-circuit programmable (As opposed to EPLDs -Electrically Programmable Logic Devices- which are typically programmed by removing them from the circuit and using a PROM programmer) • 25% of an FPGA's gates are application-usable • The rest control the configurability, interconnects, etc. • As much as 10X clock rate degradation compared to fully custom hardware implementations (ASICs) • Typically built using SRAM fabrication technology. • Since FPGAs "act" like SRAM or logic, they lose their program when they lose power. • Configuration bits need to be reloaded on power-up. • Usually reloaded from a PROM, or downloaded from memory via an I/O bus.

Mem Out In2 In1 Fine-grain Reconfigurable Hardware Devices: FPGAs Look-Up Table (LUT) • K-LUT -- K input lookup table • Any function of K inputs by programming table In Out 00 0 01 1 10 1 11 0 2-LUT 2-LUT 4-LUT

Fine-grain Reconfigurable Hardware Devices: FPGAs Conventional FPGA Tile ~ 75% of FPGA area K-LUT (typical k=4) w/ optional output Flip-Flop ~ 25% of FPGA area 4-LUT Or configurable Logic Block (CLB)

Fine-grain Reconfigurable Hardware Devices: FPGAs A Generic Island-style FPGA Routing Architecture One Tile 64 CLBs (8x8) CLB Customization achieved by changing both function of hardware elements (CLBs here) and their connectivity to match requirements of application

Fine-grain Reconfigurable Hardware Devices: FPGAs Xilinx XC4000 Interconnect Customization achieved by changing both function of hardware elements (CLBs here) and their connectivity to match requirements of application

Fine-grain Reconfigurable Hardware Devices: FPGAs Xilinx XC4000 Configurable Logic Block (CLB) Cascaded 4 LUTs (2 4-LUTs -> 1 3-LUT)

Fine-grain Reconfigurable Hardware Devices: FPGAs FPGAs vs. RISC ProcessorsComputational Density Comparison 10X FPGAs RISC Processors

Fine-grain Reconfigurable Hardware Devices: FPGAs Processor vs. FPGA Area FPGA Processor

Programming/Configuring FPGAs • (1) Hardware Design Specification: A hardware design to realize the selected hardware-bound computationally-intensive portion of the application is specified using RTL/HDL/logic diagrams. • Synthesis & Layout: Vendor supplied device-specific software tools are used to convert the hardware design to netlist format. • (2) Partition the design into logic blocks (CLBs) : LUT Mapping • Then find a good (3) placement for each block and (4) routing between them • Then the serial configuration bitstream is generated (5) and fed down to the FPGAs themselves • The configuration bits are loaded into a "long shift register" on the FPGA. • The output lines from this shift register are control wires that control the behavior of all the CLBs on the chip. Result of Hardware-Software Partitioning (co-design)

(1) Hardware Design RTL (4) Routing between CLBs • (2) Partition the • design CLBs (3) Placement for each CLB Tech. Indep. Optimization LUT Mapping Placement Routing Bitstream Generation Config. Data (5) configuration bitstream generation Programming/Configuring FPGAs

Reconfigurable Processor Tools Flow(Hardware/Software Co-design Process Flow) Portion to be done in Reconfigurable hardware (e.g FPGA) Portion be done in software (1) Hardware Design Specification Customer Application / IP (C code) RTL HDL (2) Partitioning (3) Placement (4) Routing C Compiler Synthesis & Layout ARC Object Code Configuration Bits Linker (5) configuration bitstream generation Executable Development Board C Model Simulator C Debugger Hybrid System

RTL t=A+B Reg(t,C,clk); Logic Oi=AiÅBiÅCi Ci+1 =AiBiÚBiCiÚAiCi Programming/Configuring FPGAs Starting Point: (1) Hardware Design Specification RTL/HDL/logic diagrams

Programming/Configuring FPGAs (2) Partition the design into logic blocks (CLBs) : LUT Mapping

Programming/Configuring FPGAs (3) Placement of CLBs • Maximize locality • minimize number of wires in each channel • minimize length of wires • (but, cannot put everythingclose) • Often start by partitioning/clustering • State-of-the-art finish via simulated annealing

Programming/Configuring FPGAs (3) Placement of CLBs

Programming/Configuring FPGAs (4) Routing Between CLBs • Often done in two passes: • Global to determine channel. • Detailed to determine actual wires and switches. • Difficulty is: • Limited available channels. • Switchbox connectivity restrictions.

Programming/Configuring FPGAs (4) Routing Between CLBs

Overall Configurable Hardware Approach • Select critical portions or phases of an application where hardware customizations will offer an advantage: e.g. computationally intensive portion “kernel(s)” of application. • Map those application phases to FPGA hardware: • Hand hardware design/RTL/VHDL • VHDL => synthesis & layout • If it doesn't fit in FPGA, re-select application phase (smaller) and try again. • Perform timing analysis to determine rate at which configurable design can be clocked. • Write interface software for communication between main processor (GPP) and configurable hardware: • Determine where input / output data communicated between software and configurable hardware will be stored • Write code to manage its transfer (like a procedure call interface in standard software) • Write code to invoke configurable hardware (e.g. memory-mapped I/O) • Compile software (including interface code) • Send configuration bits to the configurable hardware • Run program. Hardware-Software Partitioning

Configurable Hardware Application Challenges • This process turns applications programmers into: • Part-time hardware designers. • Performance analysis problems => what should we put in hardware? • Hardware-Software Co-design problem • Choice and granularity of computational elements. • Choice and granularity of interconnect network. • Synthesis problems • Testing/reliability problems.

Issues in Using FPGAs for Reconfigurable Computing • Hardware-Software Partitioning (co-design) • Run-timereconfiguration latency/overhead • Time to load configuration bitstream – may take seconds (improving) • Reconfiguration latency hiding techniques. • I/O bandwidth limitations: Need for tight coupling. • Speed, power, cost, density (improving) • High-level language support (improving) • Performance, space estimators • Design verification • Partitioning and mapping across several FPGAs • Partial reconfiguration • Configuration caching. e.g Hybrid-FPGAs Supported in some recent high-end FPGAs

PRISM (Brown) PRISC (Harvard) RC-1 DPGA-coupled uP (MIT) GARP (RC-3), Pleiades, … (UCB) OneChip (Toronto) RC-2 RAW (MIT) RC-4 REMARC (Stanford) RC-5 CHIMAERA RC-6 (Northwestern) Example Reconfigurable Computing Research Efforts • DEC PAM • Splash 2 • NAPA (NSC) • E5 etc. (Triscend)

Hybrid-Architecture RC Compute Models • Unaffected by array logic: Interfacing • Triscend E5 • Dedicated IO Processor. • NAPA 1000 • Instruction Augmentation: (Tight Coupling) • Special Instructions / Coprocessor Ops - PRISM (Brown, 1991) - PRISC (Harvard, 1994) - Chimaera (Northwestern, 1997) - GARP (Berkeley, 1997) - Virtex-4 FX (Xilinx) • VLIW/microcoded arrays extension to processor - REMARC (Stanford, 1998) - Raw (MIT, 1997) - - - MorphoSys (UC Irvine, 2000) - MATRIX (MIT, 1997) - RaPiD (Reconfigurable Pipelined Datapaths) (University of Washington, 1996) - PipeRench (Carnegie Mellon, 1999) - DAPDNA-2 (IPFlex Inc., 2004?) ……… • Autonomous co/stream processor • OneChip (Toronto , 1998) Usually FPGA-based Usually arrays of Simple processors See DAPDNA Handout

Logic used in place of ASIC environment customization External FPGA/PLD devices Example bus protocols peripherals sensors, actuators Hybrid-Architecture RC Compute Models:Interfacing • Case for: • Always have some system adaptation to do • Modern chips have capacity to hold processor + glue logic • reduce part count • Glue logic vary • valued added must now be accommodated on chip (formerly board level)

Triscend E5 Example: Interface/Peripherals

Array dedicated to servicing IO channel sensor, lan, wan, peripheral Provides flexible protocol handling flexible stream computation compression, encrypt (in-place) Looks like IO peripheral to processor Hybrid-Architecture RC Compute Models:IO Processor • Case for: • many protocols, services • only need few at a time • dedicate attention, offload processor

TBT ToggleBusTM Transceiver System Port CR32 CompactRISCTM 32 Bit Processor RPC Reconfigurable Pipeline Cntr ALP Adaptive Logic Processor CIO Configurable I/O PMA Pipeline Memory Array BIU Bus Interface Unit External Memory Interface SMA Scratchpad Memory Array CR32 Peripheral Devices Reconfigurable IO Processor Example: NAPA 1000 NAPA 1000 Block Diagram

Reconfigurable IO Processor Example: NAPA 1000 NAPA 1000 as IO Processor SYSTEM HOST Application Specific Sensors, Actuators, or other circuits System Port NAPA1000 CIO Memory Interface ROM & DRAM

Hybrid-Architecture RC Compute Models:Instruction Augmentation • Observation: Instruction Bandwidth • Processor can only describe a small number of basic computations in a cycle • I bits 2I operations • This is a small fraction of the operations one could do even in terms of www Ops • w22(2w) operations • Processor could have to issue w2(2 (2w) -I) operations (instructions) just to describe some computations • An a priori selected base set of functions (via ISA instructions) could be very bad for some applications • Motivation for application-specific processors/ISAs i.e per instruction i.e Fixed ISA ASPs I = opcode size W = operand word size

Hybrid-Architecture RC Compute Models: Instruction Augmentation • Idea: • Provide a way to augment the processor’s instruction set (Base ISA) with operations needed by a particular application. • Close semantic gap / avoid mismatch between fixed ISA and application computational operations needed. • What’s required: • Some way to fit augmented instructions into stream • Execution engine for augmented instructions: • If programmable, has own instructions • FPGA or array of simple micro-coded processors • Interconnect to augmented instructions.

Instruction Augmentation First Effort In Instruction Augmentation:PRISM (Brown, 1991) • Processor Reconfiguration through Instruction Set Metamorphosis (PRISM) • FPGA on bus (similar to Splash 2) • Access as memory mapped peripheral • Explicit context management • PRISM-1 • 68010 (10MHz) + XC3090 • can reconfigure FPGA in one second • 50-75 clocks for operations

PRISM-1 Results Raw kernel speedups (IO configuration time not included?)

Instruction Augmentation PRISC (Harvard, 1994) PRISC = PRogrammable Instruction Set Computers • Takes next step • What if we put it on chip? • How to integrate into processor ISA? • Architecture: • Couple into register file as “superscalar” functional unit • Flow-through array (no state) Tight Coupling PFU = Programmable Functional Unit (paper RC-1)

Computing System Element Choices