High level programming issues for reconfigurable computing systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

High-Level Programming Issues for Reconfigurable Computing Systems PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

High-Level Programming Issues for Reconfigurable Computing Systems. Mark Jones ECE Virginia Tech Blacksburg, Virginia [email protected] www.ccm.ece.vt.edu. The Virginia Tech Configurable Computing Lab.

Download Presentation

High-Level Programming Issues for Reconfigurable Computing Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


High level programming issues for reconfigurable computing systems

High-Level Programming Issues for Reconfigurable Computing Systems

Mark Jones

ECE

Virginia Tech

Blacksburg, Virginia

[email protected]

www.ccm.ece.vt.edu


High level programming issues for reconfigurable computing systems

The Virginia TechConfigurable Computing Lab

  • Focus on devices, architectures, applications, and programming issues for configurable computing

  • 30+ undergraduates,graduates, and post-docs

  • Variety of public andprivate sponsors

  • Peter Athanas &

  • Mark Jones


Overview

Overview

  • Run-time reconfiguration (RTR)

  • Obstacles to RTR

  • Recent developments enabling RTR

    • New hardware

    • New bitstream generation tools

    • New runtime control software

  • RTR applications

  • Summary and predictions

  • Disclaimer: In the interests of time, I am not mentioning all of the relevant projects.


Run time reconfiguration

Run-Time Reconfiguration

  • Adaptive computing devices (e.g., FPGAs)

    • Hardware configurations can be changed

    • Speed of reconfiguration varies by device

  • Reconfigure during the runtime of an applications(s) – less than 1 ms

  • Goals of the DARPA ACS program include

    • Development of hardware supporting fast RTR

    • Creation of software to control RTR hardware

    • Applications that demonstrate the computational benefits of RTR in size, weight, and power


Types of rtr

Types of RTR

  • Virtual Hardware

    • Provide programmer with an abstraction of unlimited hardware, similar to Virtual Memory

    • Useful abstraction which, like virtual memory, provides portability between devices

    • OS is responsible for directing the chip to context-switch user “hardware” (may include multiple processes)

    • Requires fast context-switching capability and software to effectively partition user hardware

    • Virtual co-processor work (e.g. DISC @ BYU) can be thought of in a similar fashion


Types of rtr continued

Types of RTR (continued)

  • Data-driven RTR

    • Based on the data encountered, the hardware is reconfigured to process it

      • e.g., for a given DES key, the hardware is reconfigured to a DES core specific to the key

    • Can provide increased speed in a small package

    • Hardware must be able to reconfigure quickly and (in most cases) direct its own reconfiguration based on data encountered


Device reconfiguration methods

Device Reconfiguration Methods

  • Entire device via a single bitstream

    • e.g. Xilinx 4K series

    • Long reconfiguration times

  • Logic-unit addressable reconfiguration

    • e.g. Xilinx 6200

    • Significant chip area devoted to this function

  • Context-based reconfiguration

    • Sanders CSRC chip

    • Significant chip area devoted to this function


Device reconfiguration methods continued

Device Reconfiguration Methods (continued)

  • Partial reconfiguration

    • e.g. Xilinx Virtex

    • Must reconfigure column at a time

  • Stream-based reconfiguration

    • e.g, Colt/Stallion

    • Appropriate for stream-based computation

  • Pipeline-oriented reconfiguration

    • e.g, PipeRench

    • Appropriate for deeply pipelined applications


Types of reconfigurable apps

Types of Reconfigurable Apps

  • Stream-oriented applications

    • Intelligent network devices, software radios, video processing

    • Reconfiguration must occur quickly enough and w/o disruption of hardware to avoid losing data in stream (buffering too expensive in many situations)

  • “Batch”-type applications

    • Number-crunching simulations, off-line analysis of data

    • Reconfiguration must simply be cost-effective when trading off processing for reconfiguration


Prior obstacles to rtr

Prior Obstacles to RTR

  • Lack of hardware devices that support RTR in an appropriate fashion

    • Provide fast reconfiguration without sacrificing performance

  • Lack of software to support RTR

    • Generate and modify bitstream configurations during runtime

  • The following slides will survey projects which are overcoming these obstacles

    • These projects really represent evolutionary advances on previous research projects


Virtual hardware piperench cmu

Virtual Hardware:PipeRench (CMU)

  • Many applications, particularly stream-based applications, can be deeply pipelined to improve performance

  • PipeRench is built as a reconfigurable pipeline n units

  • The programmer views PipeRench as a programmable pipeline of m units where m is arbitrarily large


Piperench cmu

PipeRench (CMU)

  • PipeRench supports this Virtual Hardware abstraction by reconfiguring the physical pipeline through the abstract pipeline


Piperench cmu1

PipeRench (CMU)

  • Only one stage must be reconfigured at each step

    • Allows for fast reconfiguration because only part of chip must be reconfigured

  • Defines a scalable architecture series

    • No changes to code are needed as hardware increases in size

  • Realization in VLSI exists as well as compiler tools


Runtime generation of bitstreams loki project xilinx and virginia tech

Runtime Generation of Bitstreams: Loki Project(Xilinx and Virginia Tech)

APPLICATION

PROGRAM

NEW STATE

FUNCTIONALITY

PLACE & ROUTE

STATE

CONNECTIVITY

RESOURCES


Loki project continued

Loki Project (continued)

  • JBits provides an API to the Xilinx bitstream for the 4K and Virtex parts

    • Java-based API at the LUT/pip level

    • Executing a Java program with the JBits API can create or modify a bitstream

  • The Loki project builds on this API to provide a design environment

    • Focus is on Run-Time Parameterizable cores


Loki project continued1

Loki Project (continued)

  • RTP cores (tens of cores at this point)

    • Finite state machines

    • KCMs

    • CAMs

  • Execution time for customizing bitstreams

    • Milliseconds (or less) for modification of LUTs in an existing bitstream

    • Challenge is to provide similar speeds when routing is required


Loki project continued2

Loki Project (continued)

  • The RTP core-based approach provides a hierarchical approach

    • Routing & placement is handled within the core, a full chip-wide P&R is not required

  • The JBits & RTP-based approach in the Java environment make development of new tools much easier

    • Simulator for Virtex devices

    • Visualization of routing delays

    • Visualization of core layout and runtime execution


High level programming issues for reconfigurable computing systems

BoardScope Core View

Output Shift

Register

(Vertical)

3 Input Shift Registers.

(Horizontal)

Center Register Highlighted.

Evolved

Synchronous Circuit


Runtime hardware control slaac dracs sanders virginia tech usc isi east

Runtime Hardware Control: SLAAC & DRACS (Sanders, Virginia Tech, USC/ISI-East)

  • The new hardware that supports fast RTR requires new runtime control software to reduce/eliminate the software overhead associated with reconfiguration

  • Need to provide the programmer with an abstraction for RTR that is easy to use, yet doesn’t incur runtime overhead


Runtime hardware control target hardware

Runtime Hardware Control: Target Hardware

  • The SLAAC-1V board

    • 3 Virtex 1000 chips capable of partial reconfiguration

    • On-board configuration controller (Virtex 100) with a local memory cache

  • The Sanders RCM board

    • 2 CSRC chips capable of context-switching

    • PowerPC and Xilinx 4085 with local memory cache


Runtime hardware control virtual hardware

Runtime Hardware Control: Virtual Hardware

  • Consider an OS that is swapping hardware configurations in/out of chip (microseconds)

    • Partial configurations in and out of the Virtex parts on the SLAAC-1V

    • Switching contexts on the RCM board

  • Cannot afford to have the configurations sent by the OS to board on every configuration swap

    • Overwhelm the microsecond cost


Runtime hardware control virtual hardware continued

Runtime Hardware Control: Virtual Hardware (continued)

  • Most programs exhibit temporal locality

    • Exploit this in way similar to virtual memory

  • Both the SLAAC-1V and the Sanders RCM provide the memory and the control capability to build a configuration cache

    • Instead of sending configurations to the board, control signals are sent invoking reconfiguration from the cache

  • Transparent to the programmer


Runtime hardware control data driven rtr

Runtime Hardware Control: Data-Driven RTR

  • Data-driven RTR requires extremely fast reconfiguration and virtually no overhead in the control of RTR

    • Little benefit to clock-cycle RTR (CSRC) if the control software takes longer

    • Must execute control of RTR near the chip

  • Need an abstraction for programmers to target


Runtime hardware control data driven rtr continued

Runtime Hardware Control: Data-Driven RTR (continued)

  • Using a Finite State Machine (FSM) provides a suitable solution

    • The FSM monitors the data encountered, triggering changes in state

    • State change in the FSM reconfigures the chip from the configuration cache

  • FSM can execute in small space (e.g., fraction of Xilinx 4085) local to board

  • Interface familiar to most programmers


Application des core xilinx

Application: DES Core (Xilinx)

  • The circuitry for DES computation can be significantly reduced if a specific key is “folded into” the circuitry

    • This reduction allows for a smaller, faster hardware realization of DES

  • Of course, a DES implementation that is specific to a single key isn’t useful unless it can be reconfigured…


Des core continued

DES Core (continued)

  • A DES core was implemented using JBits

    • A new core for each key is generated at runtime

    • Requires only changes to LUTs to configure for a new key

  • This implementation is faster than the current ASIC DES champion from Sandia

  • Technique being exploited for other encryption methods at Xilinx


High level programming issues for reconfigurable computing systems

EPIC View of 16 Rounds

Courtesy Cameron Patterson


High level programming issues for reconfigurable computing systems

Comparing Fully Unrolled and Pipelined Designs

Courtesy Cameron Patterson


Application number crunching virginia tech

Application: Number Crunching (Virginia Tech)

  • Traditional “numerical-analysis” style computation has focused on the use of IEEE-compliant floating-point arithmetic on general purpose CPUs

  • Two trends are forcing a refocus

    • Intel (and others) do not focus design on this market

    • Embedded processing is becoming increasingly complex, requiring more “number-crunching”


Number crunching cont

Number Crunching (cont.)

  • Cannot do away with key features of IEEE-compliant arithmetic (too many algorithms depend on it)

    • Floating-point units, however, are large and expensive

  • Can customize hardware to provide performance in reasonable package

    • Reconfiguration is a key


Number crunching cont1

Number Crunching (cont.)

  • Use constant floating-point multipliers

    • e.g., as coefficients in an FIR

  • These multipliers are smaller and faster than two-input multipliers

    • analytical analysis provides bounds on size of IEEE-compliant implementations


Summary

Summary

  • Obstacles to practical RTR are being overcome

  • New hardware devices, experimental and commercial, are now available

  • New software is coming online to allow run-time bitstream generation

  • And now for some predictions…


Rtr predictions

RTR Predictions

  • Security of reconfigurable devices come into question and changes are made to address this issue

  • APIs to commercial FPGA bitstreams become commonplace, allowing more widespread innovation in RTR software

  • RTR hardware becomes essential aspect of SOC solutions which, by their nature, avoid the “scale by adding more hardware” aspect of PCs

    • Will proliferate in industries that need low-cost, low-power, small solutions (e.g., cellular phones)


  • Login