1 / 30

Architecture and Compilation for Reconfigurable Processors

Architecture and Compilation for Reconfigurable Processors. Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004. Outline. Motivation Application-specific instruction set compilation Register file data bandwidth problem

keene
Download Presentation

Architecture and Compilation for Reconfigurable Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004

  2. Outline • Motivation • Application-specific instruction set compilation • Register file data bandwidth problem • Architecture extension – shadow registers • Shadow register binding • Conclusions

  3. Reconfigurable Processor Core Bus CPU Reconfigurable Processor Platform • Reconfigurable processor (RP) core + programmable fabric • RP core supports: Basic instruction set + customized instructions • Programmable fabric implements the customized instructions • Either runtime reconfigurable or pre-synthesized • Example: Nios / Nios II from Altera • Stratix version supported by Nios 3.0 system • 5 extended instruction formats • Up to 2048 instructions for each format

  4. a b c 2 5 * * * + + + Motivational Example t1 = a * b; t2 = b * 2;; t3 = c * 5; t4 = t1 + t2; t5 = t2 + t3; t6 = t5 + t4; • t1 = extop1(a, b, 2); • t2 = extop2(b, c, 2, 5); • t3 = t1 + t2; extop2 extop1 *: 2 clock cycles +: 1 clock cycle Extended Instruction Set: Iextop1 expop2 Execution time: 9 clock cycles Execution time: 5 clock cycles Speedup: 1.8

  5. Problem Statement Given: • Application program in CDFG G(V, E) • A processor with basic instruction set I • Pattern constraints: • Number of inputs less than Nin; • 1 output; • Total area no more than A Objective: • Generate a pattern library P • Map G to the extended instruction set IP, so that the total execution time is minimized.

  6. C ASIP constraints Compilation Pattern Generation / CDFG Pattern Selection Pattern Library Application Mapping Mapped CDFG Instruction Implementation / ASIP Synthesis Simulation Implementation Proposed ASIP Compilation Flow • Extended Instruction Candidates Generation • Satisfying I/O constraints • Extended Instruction Selection • Select a subset to maximize the potential speedup while satisfying the resource constraint • Code Generation • Graph covering • Minimize the total execution time

  7. n1 n3 n2 n4 n5 n6 a b c 2 5 * * * + + + Step 1. Pattern Enumeration • Each pattern is a Nin-feasible cone • Cut enumeration is used to enumerate all the Nin-feasible cones [cong et al, FPGA’99] • Basic idea: In topological order, merge the cuts of fan-ins and discards those cuts not Nin-feasible 3-feasible cones: n1: {a, b} n2: {b, 2} n3: {c, 5} n4: {n1, n2}, {n1, b, 2}, {n2, a, b}, {a, b, 2}

  8. a b c 2 5 * * * + + + Step 2. Pattern Selection • Basic idea: simultaneously consider speed up, occurrence frequency and area. • Speedup Tsw(p) = total execution time with basic instructions Thw(p)= length of the critical path of scheduled p Speedup(p) = Tsw(p) / Thw(p) • Occurrence • Some pattern instances may be isomorphic • Graph isomorphism test [ Nauty Package ] • Small subgraphs, isomorphism test is very fast Gain(p) = Speedup(p)  Occurrence(p) • Selection under area constraint can be formulated as a 0-1 knapsack problem n3 n2 n1 n4 n5 n6 Pattern *+ Tsw= 3 Thw= 2 Speedup = 1.5

  9. Step 3. Application Mapping • Assume execution on an in-order, single-issue processor • Cover each node in G(V, E) with the extended instruction set to minimize the execution time. • Trivial pattern – software execution time • Nontrivial pattern – hardware execution time • Total execution time = Sum of execution time of instance patterns after application mapping • Theorem: The application mapping problem is equivalent to the library-based minimum-area technology mapping problem.

  10. Speedup and Resource Overhead on NIOS

  11. Simulation Environment • Simplescalar v3.0 • Benchmarks • From Mediabench suite • Machine Configuration • Single issue in-order processor (ARM like) • DL1: 8KB, 4-way, 1 cycle • IL1: 8KB, direct mapped, 1 cycle • Unified L2: 256KB, 4-way, 8 cycle • Functional units: 2 IntAdd, 1 IntMult, 1 FPAdd, 1 FPMult • Reconfigurable units • critical path latency of the collapsed instructions

  12. Pattern Distribution Most of the patterns have less than 7 nodes inside

  13. Ideal Speedup under Different Input Size Constraints

  14. Outline • Motivation • Application-specific instruction set compilation • Register file data bandwidth problem • Architecture extension – shadow registers • Shadow register binding • Conclusions

  15. Register File Bandwidth Problem • Most of the speedup comes from clusters with more than two inputs • 2-port register file in embedded processors • Need extra cycles to transfer data for extended instructions with more than 2 inputs • Speedup drop due to communication overhead

  16. Speedup Drop with Different Input Constraints • Move operation takes one cycle • 46% speedup drop on average

  17. Outline • Motivation • Application-specific instruction set compilation • Register file data bandwidth problem • Architecture extension – shadow registers • Shadow register binding • Conclusions

  18. Architecture Extensions • Existing Solutions • Dedicated Data Link • Avoid potential resource contention through bus • Need extra cycles for communication • Employed in Microblaze from Xilinx • Multiport Register File • Low utilization when executing basic instructions • Area and power grows cubically • Register File Replication • Predetermined one-to-one correspondence • Resource waste in terms of area and power • Limit compiler optimization

  19. Our Approach – Shadow Registers • Core registers are augmented by an extra set of shadow registers • Conditionally written • Used only by the custom logic

  20. Shadow Registers • Controlling the shadow register • Advantages and limitations • Cost-efficient for small number of shadow registers • Only need a few control signals to be added • Opportunity for compiler optimization • Require extra control bits

  21. Outline • Motivation • Application-specific instruction set compilation • Register file data bandwidth problem • Architecture extension – shadow registers • Shadow register binding • Conclusions

  22. Internal Representation 2-level CDFG representation • 1st level: control flow graph • 2nd level: data flow graph • Computation node latency & scheduled time slot • Data edge lifetime • Variable lifetime 1 i1 = …; i2 = ext1 (…, i1, …); i3 = …; i4 = ext2 (…, i1, …); i5 = ext3 (…, i3, …); i6 = ext4 (…, i3, …); e1 2 e2 3 e3 4 e4 5 6 Life time e1 = [2, 2] Life time e2 = [2, 4] Life time i1 = [2, 4]

  23. Observation • 2-port register file • 3-input extended instruction • Without shadow register 4 additional moves • Binding for 1 register 1 i1 = …; i2 = ext1 (…, i1, …); i3 = …; i4 = ext2 (…, i1, …); i5 = ext3 (…, i3, …); i6 = ext4 (…, i3, …); e1 2 e2 3 e3 4 e4 5 Binding 1: either i1 or i3 in shadow register save 2 moves 6 Binding 2: save 3 moves

  24. Register Binding • Which operands should be bound? • Each input could be a candidate • Binding different candidates leads to different savings • Unaffordable to try all the combinations

  25. One Shadow Register Binding Problem • Problem formulation: • Given A scheduled DFG and one shadow register • Objective Bind variables to shadow register Minimize the number of moves

  26. Algorithm for Binding One Shadow Register • Weighted compatibility graph • Vertex <-> data edge in the DFG • Weight <-> # saves if the value is kept in the register • Edge <-> lifetimes don’t overlap • Theorem: • Binding problem is equivalent to find a maximum weighted chain in the compatibility graph • Can be optimally solved in time O(|V’| + |E’|) • Extension to K-shadow registers

  27. Experimental Results (1) Speedup under different number of shadow registers for 3-input extended instructions

  28. Experimental Results (2) Speedup under different number of shadow registers for 4-input extended instructions

  29. Conclusions • Proposed and developed complete compilation flow • Observed and quantitatively analyzed data bandwidth problem • Proposed novel architecture extension and efficient register binding algorithm

  30. Thank You

More Related