1 / 18

A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead

A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead. Heng Tan and Ronald F. DeMara University of Central Florida. Agenda. Introduction Previous work on reducing Partial Reconfiguration overhead Develop a multi-step physical area management strategy

Download Presentation

A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Physical Resource Management Approach toMinimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMaraUniversity of Central Florida

  2. Agenda • Introduction • Previous work on reducing Partial Reconfiguration overhead • Develop a multi-step physical area management strategy • Experimental case studies: • 4 simple circuits: serial, parallel, block multiplier resource,high fanout carry chain LUT multiplier • 2 large circuits: SECDED, MD5 step function (new) • Conclusion and future work

  3. Motivation For partial reconfiguration operations, especially those running on a SOC configuration, • storage space and • reconfiguration speed are crucial.  Find way to optimize them . . .

  4. Previous Work • Architecture-level [Ganesan2000] • Pipelining • overlap execution of one temporal partition with reconfiguration of another • Logical level [Raghuraman2005] • Minimization of frames • relating the number of frames that need to be downloaded into FPGAs to the number of minterms of a specially constructed logic function • Hardware level [Compton2002][Hauck1999] • Requires dedicated HW component • Compression or defragmentation performed externally at runtime • Physical-level resource management? • Not specifically addressed for partial reconfiguration modules in literature • But can be done … and … also cascaded with above techniques for multiplicative benefit

  5. Module Based Flow Basic concept and capability for PR proposed by Xilinx • Including Fixed modules and PR modules • Using Bus Macro • Suitable for potential full automation • Primary PR technique to study

  6. Column-Level Configuration Format • For Virtex II/Pro series • Including IOB, IOI, CLB, GCLK, BlockRAM, and BlockRAM Interconnect • Labeling with 32-bit addresses composed of a Block Address (BA), a Major Address (MJA), a Minor Address (MNA), and a byte number • Only contents of CLB column are studied in this research

  7. Bitstream Compression • CLB columns program the configurable logic blocks, routing, and most interconnect resources. • Each CLB column contains 2 columns of slices. • 22 frames are utilized within the bitstream for each column of slices, describing the logic and routing information respectively. • Each frame occupies 424 bytes. • In bitstream of PR modules, a compression technique is already used by Xilinx to represent the unused CLB frame. • For unused CLB frames, only 10 bytes are used instead of 424 to describe empty contents.

  8. Optimization Strategy • Primary goal: minimize the number of columns of slices utilized, including routing resources • Secondary goal: do not incur increase in propagation delay after area optimization • Physical area management procedure: • Region Allocation: define size and boundary • Pin Assignment: top/bottom edge preferred • Column Alignment: fill column of slices at a time • Choke-Point Elimination: resolve high fanout cases • Repeat

  9. Case Study • The hardware platform is Xilinx Virtex II Pro VP7 device. • Module-based partial reconfiguration flow is adopted to generate the partial reconfiguration bitstream. • The Xilinx ISE 6.3 is used to support the module based flow. • The area constraints are entered directly into User Constrain File (.ucf)before map and routing. • Four representative small cases and two larger size cases studies are tested for the strategy. • Similar or identical external pin arrangement. • Hypothesis: larger circuits can achieve more savings

  10. 4-LUT Design Optimization (parallel) • Simple 4-LUT elements • Parallel logic path with direct input from external signals • LUTs feed outputs straight though to flip flops • Best strategy is by locating them in a single column close to the external pins

  11. Shifter Optimization (serial) • All logic elements are cascaded in contiguous string of CLBs • Attributes of this serial circuit functionality will have best arrangement from input to output in single column serially

  12. Block Multiplier Optimization • Block multiplier resource involved • Requires balancing the routing between two paths. • One path is between the block multiplier and the LUTs. The other path is from the LUTs to the external pins. • Decreased savings compared to use of LUT-only circuits

  13. LUT Multiplier Optimization • High fan-out occurs because of the carry chains • Single column style is not optimal • Arrange related LUTs around each other using adjacent columns

  14. Larger Case Studies • SECDED as full PR module and MD5 step functions were developed as PR module vs. SHA family • During the optimization process, not every slice has been specifically placed because of the large number of resources involved • Only the slices on the critical path are constrained. • These are comparatively larger modules, increased bitstream savings of 33% and 30% are achieved

  15. Benchmark Results

  16. Area Optimization Results • A physical resource area management strategy is proposed to minimize the reconfiguration overhead. • Experiments show that up to one-third of size reduction can be achieved for partial reconfiguration modules. • The maximum propagation delay has also been decreased slightly in most cases (not increased). • On the other hand, the larger the module is, the more complicated and time consuming the process becomes. • Providing autonomous area optimization capabilities is future work for integrating into our Multi-Layer Runtime Reconfiguration Architecture FGPA framework

  17. Multi-Layer Runtime Reconfiguration Arch. (MRRA)

  18. Current Work:Direct Bitstream Management • Change one-bit full adder to a one-bit full subtracter • Both have three one-bit inputs and two one-bit outputs. • Both used 2 LUTs with identical logic interconnections between LUTs and I/O signals. • Only difference between them is actually only one truth table stored inside one LUT, changing from 0xE8 to 0x8E Combined MD5 / SHA-1 Step Function – Area Utilization Original AlgorithmModule Based Frame Based SHA-1 192(slices) 65(slices)32(slices) MD5 881(slices) 168(slices) 32(slices) Combined 1068(slices) 324(slices) 32(slices)

More Related