Conjoining Soft-Core FPGA Processors

Conjoining Soft-Core FPGA Processors David Sheldona, Rakesh Kumarb, Frank Vahida*, Dean Tullsenb , Roman Lyseckyc aDepartment of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine bDepartment of Computer Science and Engineering University of California, San Diego cDepartment of Electrical and Computer Engineering University of Arizona This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations from Xilinx

FPGA Soft Core Processors HDL Description • Soft-core Processor • HDL description • Flexible implementation • FPGA or ASIC • Technology independent FPGA ASIC Spartan 3 Virtex 2 Virtex 4 David Sheldon, UC Riverside

FPGA FPGA Soft Core Processors • Soft Core Processors can have configurable options • Datapath units • Cache • Bus architecture • Current commercial FPGA Soft-Core Processors • Xilinx Microblaze • Altera Nios μP FPU MAC Cache David Sheldon, UC Riverside

Conjoined FPU unit Conjoinment Overview • Add necessary units to both processors Application 1 Application 2 Base micro-processor Base micro-processor FPU FPU FPU FPU FPU • Conjoin the FPU Unit “Conjoining” David Sheldon, UC Riverside

Conjoinment Background • Conjoinment proposed for multicore desktop processing (Kumar 2004) • Reduces size with reasonable performance overhead • e.g., cache conjoinment overhead: 1%-13% ICache Sharing DCache Sharing David Sheldon, UC Riverside

size perf ? Outline • Conjoinment for soft-core FPGA processors • Area savings • Performance overhead • Tuning heuristic for two configurable soft-cores with conjoin option David Sheldon, UC Riverside

Barrel Shifter Divider Area Savings • Significant potential area savings • Limitations • Does not consider multiplexing costs • Due to absence of FPGA synthesis tools supporting conjoinment • But good potential justifies further investigation Multiplier 32% Base MicroBlaze 23% 6% FPU 4% Unit Size Multiplier 1331 Barrel Shifter 228 Divider 122 FPU 2738 David Sheldon, UC Riverside

trace1 trace2 Access stall Contention stall Performance Overhead • No simulator exists for conjoined processors • We developed our own • Trace-based conjoined processor simulator • Simulation uses pessimistic performance assumptions • Kumar's techniques can improve • Simulator outputs contention information • Final cycles can be compared to unconjoined to determine performance overhead app1 app2 Xilinx simulator brev Conj. simulator bitmnp David Sheldon, UC Riverside

brev bitmnp Performance Overhead • Speedup: Application time on optimally configured processor / avg. app. time on base processor • Compared configuration with conjoinment versus without • Performance overhead usually small, averaged just 4.2% • Overhead caused by access delays and contention of the hardware units 2.4% 17% David Sheldon, UC Riverside

Multiplier Multiplier Multiplier Barrel Shifter Divider Tuning Heuristic • 5 choices per unit • e.g., FPU – no unit, 1 only, 2 only, 1 & 2, and conjoined • 4 units  54 = 625 possible configurations • Simulation: ~30 minutes per configuration • Need search heuristic to tune Base MicroBlaze 2 Base MicroBlaze 1 NO FPU FPU 2 NO FPU FPU 1 FPU conjoined David Sheldon, UC Riverside

Synthesis Synthesis FPU Barrel Shifter Multiplier Divider FPU App perf perf perf perf size size size size Base MicroBlaze MicroBlaze Map to 0-1 Knapsack Problem Creating the model BS FPU MUL DIV Perf increment 1.1 0.9 1.2 1.0 Size increment 1.4 2.7 1.8 1.1 Perf/Size 0.96 0.34 0.63 0.93 David Sheldon, UC Riverside

Map to 0-1 Knapsack Problem • First consider tuning without conjoinment • Problem of instantiating units to limited FPGA size can be mapped to the 0-1 knapsack problem • Add items, each with weight and benefit, to weight-constrained knapsack such that profit maximized FPU 2 FPU 1 MUL 2 Items: MUL 1 2 2 1 1 Weights: 1331 228 121 1331 228 121 2738 2738 Benefits: 0.08 0.62 0.00 0.22 0.76 0.00 0.00 0.00 MUL 1 Base MicroBlaze Base MicroBlaze FPU 1 Note: Mapping inexact – weights/benefits not strictly additive MUL 2 Available FPGA Knapsack David Sheldon, UC Riverside

Disjunctively Constrained Knapsack • Problem: If conjoined unit included, can't also include standalone unit • Solution: Map to disjunctively-constrained 0-1 knapsack • Yanada T., “Heuristic and Exact Algorithms for the Disjunctively Constrained Knapsack Problem”, 2002 • Prohibits specific item pairs from being in the knapsack • ILP solution, running time is pseudo polynomial FPU 2 FPU 1 MUL 2 Items: MUL 1 2 2 1 1 FPU C MUL C C C Base MicroBlaze Base MicroBlaze Available FPGA Knapsack David Sheldon, UC Riverside

Disjunctively Constrained Knapsack FPU 2 • Conjoined benefits shows a small decrease in benefit from the unconjoined unit FPU 1 MUL 2 Items: MUL 1 MUL 1 2 2 1 1 Weights: 1331 228 121 1331 228 121 2738 2738 Benefits: 0.08 0.62 0 0.22 0.76 0 0 0 FPU C MUL C MUL C C C Weights: 1331 228 121 2738 Benefits 1: 0.06 0.54 0 0 Benefits 2: 0.21 0.71 0 0 • Conjoined units provide benefits to both processors Base MicroBlaze Base MicroBlaze Available FPGA Knapsack David Sheldon, UC Riverside

Disjunctively Constrained Knapsack • Running Time • Modeling • 5 Synthesis runs for each Processor • At most 4 runs of the conjoined Simulator • Disjunctively Constrained 0-1 Knapsack • NP-complete problem • Solved with a heuristic • Heuristic takes < 1 min David Sheldon, UC Riverside

Results • Data gathered for the Xilinx Microblaze Soft-core Processor • 10 EEMBC and Powerstone benchmarks • aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk • Obtained results for all possible pairwise conjoinment • We only show conjoinment data when both applications use unit • To avoid making conjoinment appear better than it is David Sheldon, UC Riverside

Results Knapsack approach finds near-optimal in most cases David Sheldon, UC Riverside

Results • Knapsack heuristic finds near-optimal in most cases (versus exhaustive with conjoinment) • Runs in seconds • One example had sub-optimal results (2.9 times slower) • Performance overhead due to conjoinment just a few percent on average David Sheldon, UC Riverside

Results • On average the knapsack approach yields the same size as the exhaustive with conjoinment • Average size savings of 16% David Sheldon, UC Riverside

Conclusions • Conjoining two soft-core FPGA processors reduces average size by 16% • Performance overhead just a few percent in most cases • Disjunctively constrained 0-1 knapsack approach finds near-optimal in most cases • But could be improved for some examples • Future • Consider multiplexing size and delay overheads • Apply Kumar's advanced conjoining techniques to reduce overheads David Sheldon, UC Riverside

Conjoining Soft-Core FPGA Processors

Conjoining Soft-Core FPGA Processors

Presentation Transcript

Conjoining Soft-Core FPGA Processors

Multi-core Processors and Virtualization

Heterogeneous Multi-Core Processors

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs

Multi-core processors

Soft Vector Processors with Streaming Pipelines

Application-Specific Customization of Parameterized FPGA Soft-Core Processors

The Microarchitecture of FPGA-Based Soft Processors

FPGA Calculator Core

FPGA Calculator Core

The Microarchitecture of FPGA-Based Soft Processors

FPGA Multi-core

Application-Specific Customization of FPGA Soft-core Processors

Microblaze Soft Processor Core

Improving Pipelined Soft Processors with Multithreading

Network Processors A generation of multi-core processors

FPGA Multi-core

Network Processors A generation of multi-core processors

Multi-core processors

Heterogeneous Multi-Core Processors

Multi-core processors