High level programming issues for reconfigurable computing systems
Download
1 / 34

High-Level Programming Issues for Reconfigurable Computing Systems - PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on

High-Level Programming Issues for Reconfigurable Computing Systems. Mark Jones ECE Virginia Tech Blacksburg, Virginia [email protected] www.ccm.ece.vt.edu. The Virginia Tech Configurable Computing Lab.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' High-Level Programming Issues for Reconfigurable Computing Systems ' - orpah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
High level programming issues for reconfigurable computing systems

High-Level Programming Issues for Reconfigurable Computing Systems

Mark Jones

ECE

Virginia Tech

Blacksburg, Virginia

[email protected]

www.ccm.ece.vt.edu


The Virginia Tech Systems Configurable Computing Lab

  • Focus on devices, architectures, applications, and programming issues for configurable computing

  • 30+ undergraduates,graduates, and post-docs

  • Variety of public andprivate sponsors

  • Peter Athanas &

  • Mark Jones


Overview
Overview Systems

  • Run-time reconfiguration (RTR)

  • Obstacles to RTR

  • Recent developments enabling RTR

    • New hardware

    • New bitstream generation tools

    • New runtime control software

  • RTR applications

  • Summary and predictions

  • Disclaimer: In the interests of time, I am not mentioning all of the relevant projects.


Run time reconfiguration
Run-Time Reconfiguration Systems

  • Adaptive computing devices (e.g., FPGAs)

    • Hardware configurations can be changed

    • Speed of reconfiguration varies by device

  • Reconfigure during the runtime of an applications(s) – less than 1 ms

  • Goals of the DARPA ACS program include

    • Development of hardware supporting fast RTR

    • Creation of software to control RTR hardware

    • Applications that demonstrate the computational benefits of RTR in size, weight, and power


Types of rtr
Types of RTR Systems

  • Virtual Hardware

    • Provide programmer with an abstraction of unlimited hardware, similar to Virtual Memory

    • Useful abstraction which, like virtual memory, provides portability between devices

    • OS is responsible for directing the chip to context-switch user “hardware” (may include multiple processes)

    • Requires fast context-switching capability and software to effectively partition user hardware

    • Virtual co-processor work (e.g. DISC @ BYU) can be thought of in a similar fashion


Types of rtr continued
Types of RTR (continued) Systems

  • Data-driven RTR

    • Based on the data encountered, the hardware is reconfigured to process it

      • e.g., for a given DES key, the hardware is reconfigured to a DES core specific to the key

    • Can provide increased speed in a small package

    • Hardware must be able to reconfigure quickly and (in most cases) direct its own reconfiguration based on data encountered


Device reconfiguration methods
Device Reconfiguration Methods Systems

  • Entire device via a single bitstream

    • e.g. Xilinx 4K series

    • Long reconfiguration times

  • Logic-unit addressable reconfiguration

    • e.g. Xilinx 6200

    • Significant chip area devoted to this function

  • Context-based reconfiguration

    • Sanders CSRC chip

    • Significant chip area devoted to this function


Device reconfiguration methods continued
Device Reconfiguration Systems Methods (continued)

  • Partial reconfiguration

    • e.g. Xilinx Virtex

    • Must reconfigure column at a time

  • Stream-based reconfiguration

    • e.g, Colt/Stallion

    • Appropriate for stream-based computation

  • Pipeline-oriented reconfiguration

    • e.g, PipeRench

    • Appropriate for deeply pipelined applications


Types of reconfigurable apps
Types of Reconfigurable Apps Systems

  • Stream-oriented applications

    • Intelligent network devices, software radios, video processing

    • Reconfiguration must occur quickly enough and w/o disruption of hardware to avoid losing data in stream (buffering too expensive in many situations)

  • “Batch”-type applications

    • Number-crunching simulations, off-line analysis of data

    • Reconfiguration must simply be cost-effective when trading off processing for reconfiguration


Prior obstacles to rtr
Prior Obstacles to RTR Systems

  • Lack of hardware devices that support RTR in an appropriate fashion

    • Provide fast reconfiguration without sacrificing performance

  • Lack of software to support RTR

    • Generate and modify bitstream configurations during runtime

  • The following slides will survey projects which are overcoming these obstacles

    • These projects really represent evolutionary advances on previous research projects


Virtual hardware piperench cmu
Virtual Hardware: Systems PipeRench (CMU)

  • Many applications, particularly stream-based applications, can be deeply pipelined to improve performance

  • PipeRench is built as a reconfigurable pipeline n units

  • The programmer views PipeRench as a programmable pipeline of m units where m is arbitrarily large


Piperench cmu
PipeRench (CMU) Systems

  • PipeRench supports this Virtual Hardware abstraction by reconfiguring the physical pipeline through the abstract pipeline


Piperench cmu1
PipeRench (CMU) Systems

  • Only one stage must be reconfigured at each step

    • Allows for fast reconfiguration because only part of chip must be reconfigured

  • Defines a scalable architecture series

    • No changes to code are needed as hardware increases in size

  • Realization in VLSI exists as well as compiler tools


Runtime generation of bitstreams loki project xilinx and virginia tech
Runtime Generation of Bitstreams: Loki Project Systems (Xilinx and Virginia Tech)

APPLICATION

PROGRAM

NEW STATE

FUNCTIONALITY

PLACE & ROUTE

STATE

CONNECTIVITY

RESOURCES


Loki project continued
Loki Project (continued) Systems

  • JBits provides an API to the Xilinx bitstream for the 4K and Virtex parts

    • Java-based API at the LUT/pip level

    • Executing a Java program with the JBits API can create or modify a bitstream

  • The Loki project builds on this API to provide a design environment

    • Focus is on Run-Time Parameterizable cores


Loki project continued1
Loki Project (continued) Systems

  • RTP cores (tens of cores at this point)

    • Finite state machines

    • KCMs

    • CAMs

  • Execution time for customizing bitstreams

    • Milliseconds (or less) for modification of LUTs in an existing bitstream

    • Challenge is to provide similar speeds when routing is required


Loki project continued2
Loki Project (continued) Systems

  • The RTP core-based approach provides a hierarchical approach

    • Routing & placement is handled within the core, a full chip-wide P&R is not required

  • The JBits & RTP-based approach in the Java environment make development of new tools much easier

    • Simulator for Virtex devices

    • Visualization of routing delays

    • Visualization of core layout and runtime execution


BoardScope Core View Systems

Output Shift

Register

(Vertical)

3 Input Shift Registers.

(Horizontal)

Center Register Highlighted.

Evolved

Synchronous Circuit


Runtime hardware control slaac dracs sanders virginia tech usc isi east
Runtime Hardware Control: SLAAC & DRACS (Sanders, Virginia Tech, USC/ISI-East)

  • The new hardware that supports fast RTR requires new runtime control software to reduce/eliminate the software overhead associated with reconfiguration

  • Need to provide the programmer with an abstraction for RTR that is easy to use, yet doesn’t incur runtime overhead


Runtime hardware control target hardware
Runtime Hardware Control: Target Hardware Tech, USC/ISI-East)

  • The SLAAC-1V board

    • 3 Virtex 1000 chips capable of partial reconfiguration

    • On-board configuration controller (Virtex 100) with a local memory cache

  • The Sanders RCM board

    • 2 CSRC chips capable of context-switching

    • PowerPC and Xilinx 4085 with local memory cache


Runtime hardware control virtual hardware
Runtime Hardware Control: Tech, USC/ISI-East)Virtual Hardware

  • Consider an OS that is swapping hardware configurations in/out of chip (microseconds)

    • Partial configurations in and out of the Virtex parts on the SLAAC-1V

    • Switching contexts on the RCM board

  • Cannot afford to have the configurations sent by the OS to board on every configuration swap

    • Overwhelm the microsecond cost


Runtime hardware control virtual hardware continued
Runtime Hardware Control: Tech, USC/ISI-East)Virtual Hardware (continued)

  • Most programs exhibit temporal locality

    • Exploit this in way similar to virtual memory

  • Both the SLAAC-1V and the Sanders RCM provide the memory and the control capability to build a configuration cache

    • Instead of sending configurations to the board, control signals are sent invoking reconfiguration from the cache

  • Transparent to the programmer


Runtime hardware control data driven rtr
Runtime Hardware Control: Tech, USC/ISI-East)Data-Driven RTR

  • Data-driven RTR requires extremely fast reconfiguration and virtually no overhead in the control of RTR

    • Little benefit to clock-cycle RTR (CSRC) if the control software takes longer

    • Must execute control of RTR near the chip

  • Need an abstraction for programmers to target


Runtime hardware control data driven rtr continued
Runtime Hardware Control: Tech, USC/ISI-East)Data-Driven RTR (continued)

  • Using a Finite State Machine (FSM) provides a suitable solution

    • The FSM monitors the data encountered, triggering changes in state

    • State change in the FSM reconfigures the chip from the configuration cache

  • FSM can execute in small space (e.g., fraction of Xilinx 4085) local to board

  • Interface familiar to most programmers


Application des core xilinx
Application: DES Core (Xilinx) Tech, USC/ISI-East)

  • The circuitry for DES computation can be significantly reduced if a specific key is “folded into” the circuitry

    • This reduction allows for a smaller, faster hardware realization of DES

  • Of course, a DES implementation that is specific to a single key isn’t useful unless it can be reconfigured…


Des core continued
DES Core (continued) Tech, USC/ISI-East)

  • A DES core was implemented using JBits

    • A new core for each key is generated at runtime

    • Requires only changes to LUTs to configure for a new key

  • This implementation is faster than the current ASIC DES champion from Sandia

  • Technique being exploited for other encryption methods at Xilinx


EPIC View of 16 Rounds Tech, USC/ISI-East)

Courtesy Cameron Patterson


Comparing Fully Unrolled and Pipelined Designs Tech, USC/ISI-East)

Courtesy Cameron Patterson


Application number crunching virginia tech
Application: Number Crunching (Virginia Tech) Tech, USC/ISI-East)

  • Traditional “numerical-analysis” style computation has focused on the use of IEEE-compliant floating-point arithmetic on general purpose CPUs

  • Two trends are forcing a refocus

    • Intel (and others) do not focus design on this market

    • Embedded processing is becoming increasingly complex, requiring more “number-crunching”


Number crunching cont
Number Crunching (cont.) Tech, USC/ISI-East)

  • Cannot do away with key features of IEEE-compliant arithmetic (too many algorithms depend on it)

    • Floating-point units, however, are large and expensive

  • Can customize hardware to provide performance in reasonable package

    • Reconfiguration is a key


Number crunching cont1
Number Crunching (cont.) Tech, USC/ISI-East)

  • Use constant floating-point multipliers

    • e.g., as coefficients in an FIR

  • These multipliers are smaller and faster than two-input multipliers

    • analytical analysis provides bounds on size of IEEE-compliant implementations


Summary
Summary Tech, USC/ISI-East)

  • Obstacles to practical RTR are being overcome

  • New hardware devices, experimental and commercial, are now available

  • New software is coming online to allow run-time bitstream generation

  • And now for some predictions…


Rtr predictions
RTR Predictions Tech, USC/ISI-East)

  • Security of reconfigurable devices come into question and changes are made to address this issue

  • APIs to commercial FPGA bitstreams become commonplace, allowing more widespread innovation in RTR software

  • RTR hardware becomes essential aspect of SOC solutions which, by their nature, avoid the “scale by adding more hardware” aspect of PCs

    • Will proliferate in industries that need low-cost, low-power, small solutions (e.g., cellular phones)


ad