1 / 22

Conjoining Soft-Core FPGA Processors

Conjoining Soft-Core FPGA Processors. David Sheldon a , Rakesh Kumar b , Frank Vahid a* , Dean Tullsen b , Roman Lysecky c a Department of Computer Science and Engineering University of California, Riverside * Also with the Center for Embedded Computer Systems at UC Irvine

bustillos
Download Presentation

Conjoining Soft-Core FPGA Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conjoining Soft-Core FPGA Processors David Sheldona, Rakesh Kumarb, Frank Vahida*, Dean Tullsenb , Roman Lyseckyc aDepartment of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine bDepartment of Computer Science and Engineering University of California, San Diego cDepartment of Electrical and Computer Engineering University of Arizona This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations from Xilinx

  2. FPGA Soft Core Processors HDL Description • Soft-core Processor • HDL description • Flexible implementation • FPGA or ASIC • Technology independent FPGA ASIC Spartan 3 Virtex 2 Virtex 4 David Sheldon, UC Riverside

  3. FPGA FPGA Soft Core Processors • Soft Core Processors can have configurable options • Datapath units • Cache • Bus architecture • Current commercial FPGA Soft-Core Processors • Xilinx Microblaze • Altera Nios μP FPU MAC Cache David Sheldon, UC Riverside

  4. Conjoined FPU unit Conjoinment Overview • Add necessary units to both processors Application 1 Application 2 Base micro-processor Base micro-processor FPU FPU FPU FPU FPU • Conjoin the FPU Unit “Conjoining” David Sheldon, UC Riverside

  5. Conjoinment Background • Conjoinment proposed for multicore desktop processing (Kumar 2004) • Reduces size with reasonable performance overhead • e.g., cache conjoinment overhead: 1%-13% ICache Sharing DCache Sharing David Sheldon, UC Riverside

  6. size perf ? Outline • Conjoinment for soft-core FPGA processors • Area savings • Performance overhead • Tuning heuristic for two configurable soft-cores with conjoin option David Sheldon, UC Riverside

  7. Barrel Shifter Divider Area Savings • Significant potential area savings • Limitations • Does not consider multiplexing costs • Due to absence of FPGA synthesis tools supporting conjoinment • But good potential justifies further investigation Multiplier 32% Base MicroBlaze 23% 6% FPU 4% Unit Size Multiplier 1331 Barrel Shifter 228 Divider 122 FPU 2738 David Sheldon, UC Riverside

  8. size perf ? Outline • Conjoinment for soft-core FPGA processors • Area savings • Performance overhead • Tuning heuristic for two configurable soft-cores with conjoin option David Sheldon, UC Riverside

  9. trace1 trace2 Access stall Contention stall Performance Overhead • No simulator exists for conjoined processors • We developed our own • Trace-based conjoined processor simulator • Simulation uses pessimistic performance assumptions • Kumar's techniques can improve • Simulator outputs contention information • Final cycles can be compared to unconjoined to determine performance overhead app1 app2 Xilinx simulator brev Conj. simulator bitmnp David Sheldon, UC Riverside

  10. brev bitmnp Performance Overhead • Speedup: Application time on optimally configured processor / avg. app. time on base processor • Compared configuration with conjoinment versus without • Performance overhead usually small, averaged just 4.2% • Overhead caused by access delays and contention of the hardware units 2.4% 17% David Sheldon, UC Riverside

  11. size perf ? Outline • Conjoinment for soft-core FPGA processors • Area savings • Performance overhead • Tuning heuristic for two configurable soft-cores with conjoin option David Sheldon, UC Riverside

  12. Multiplier Multiplier Multiplier Barrel Shifter Divider Tuning Heuristic • 5 choices per unit • e.g., FPU – no unit, 1 only, 2 only, 1 & 2, and conjoined • 4 units  54 = 625 possible configurations • Simulation: ~30 minutes per configuration • Need search heuristic to tune Base MicroBlaze 2 Base MicroBlaze 1 NO FPU FPU 2 NO FPU FPU 1 FPU conjoined David Sheldon, UC Riverside

  13. Synthesis Synthesis FPU Barrel Shifter Multiplier Divider FPU App perf perf perf perf size size size size Base MicroBlaze MicroBlaze Map to 0-1 Knapsack Problem Creating the model BS FPU MUL DIV Perf increment 1.1 0.9 1.2 1.0 Size increment 1.4 2.7 1.8 1.1 Perf/Size 0.96 0.34 0.63 0.93 David Sheldon, UC Riverside

  14. Map to 0-1 Knapsack Problem • First consider tuning without conjoinment • Problem of instantiating units to limited FPGA size can be mapped to the 0-1 knapsack problem • Add items, each with weight and benefit, to weight-constrained knapsack such that profit maximized FPU 2 FPU 1 MUL 2 Items: MUL 1 2 2 1 1 Weights: 1331 228 121 1331 228 121 2738 2738 Benefits: 0.08 0.62 0.00 0.22 0.76 0.00 0.00 0.00 MUL 1 Base MicroBlaze Base MicroBlaze FPU 1 Note: Mapping inexact – weights/benefits not strictly additive MUL 2 Available FPGA Knapsack David Sheldon, UC Riverside

  15. Disjunctively Constrained Knapsack • Problem: If conjoined unit included, can't also include standalone unit • Solution: Map to disjunctively-constrained 0-1 knapsack • Yanada T., “Heuristic and Exact Algorithms for the Disjunctively Constrained Knapsack Problem”, 2002 • Prohibits specific item pairs from being in the knapsack • ILP solution, running time is pseudo polynomial FPU 2 FPU 1 MUL 2 Items: MUL 1 2 2 1 1 FPU C MUL C C C Base MicroBlaze Base MicroBlaze Available FPGA Knapsack David Sheldon, UC Riverside

  16. Disjunctively Constrained Knapsack FPU 2 • Conjoined benefits shows a small decrease in benefit from the unconjoined unit FPU 1 MUL 2 Items: MUL 1 MUL 1 2 2 1 1 Weights: 1331 228 121 1331 228 121 2738 2738 Benefits: 0.08 0.62 0 0.22 0.76 0 0 0 FPU C MUL C MUL C C C Weights: 1331 228 121 2738 Benefits 1: 0.06 0.54 0 0 Benefits 2: 0.21 0.71 0 0 • Conjoined units provide benefits to both processors Base MicroBlaze Base MicroBlaze Available FPGA Knapsack David Sheldon, UC Riverside

  17. Disjunctively Constrained Knapsack • Running Time • Modeling • 5 Synthesis runs for each Processor • At most 4 runs of the conjoined Simulator • Disjunctively Constrained 0-1 Knapsack • NP-complete problem • Solved with a heuristic • Heuristic takes < 1 min David Sheldon, UC Riverside

  18. Results • Data gathered for the Xilinx Microblaze Soft-core Processor • 10 EEMBC and Powerstone benchmarks • aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk • Obtained results for all possible pairwise conjoinment • We only show conjoinment data when both applications use unit • To avoid making conjoinment appear better than it is David Sheldon, UC Riverside

  19. Results Knapsack approach finds near-optimal in most cases David Sheldon, UC Riverside

  20. Results • Knapsack heuristic finds near-optimal in most cases (versus exhaustive with conjoinment) • Runs in seconds • One example had sub-optimal results (2.9 times slower) • Performance overhead due to conjoinment just a few percent on average David Sheldon, UC Riverside

  21. Results • On average the knapsack approach yields the same size as the exhaustive with conjoinment • Average size savings of 16% David Sheldon, UC Riverside

  22. Conclusions • Conjoining two soft-core FPGA processors reduces average size by 16% • Performance overhead just a few percent in most cases • Disjunctively constrained 0-1 knapsack approach finds near-optimal in most cases • But could be improved for some examples • Future • Consider multiplexing size and delay overheads • Apply Kumar's advanced conjoining techniques to reduce overheads David Sheldon, UC Riverside

More Related