1 / 28

xPilot  A Platform-Based Behavioral Synthesis System

xPilot  A Platform-Based Behavioral Synthesis System. Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005. Supported by NSF, GSRC, Altera, Xilinx. Outline. Motivation xPilot system framework Overview of the synthesis engine Scheduling

briar
Download Presentation

xPilot  A Platform-Based Behavioral Synthesis System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. xPilot A Platform-Based Behavioral Synthesis System Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005 Supported by NSF, GSRC, Altera, Xilinx.

  2. Outline • Motivation • xPilot system framework • Overview of the synthesis engine • Scheduling • Resource binding • Experimental results

  3. Motivation (1) • Design Complexity is outgrowing the traditional RTL method • Feasible to build SoC device with 500M transistors; Billion-transistor chips are on the horizon • Behavioral synthesis  a critical technology for enabling the move to higher level of abstraction • Reasons for previous failures • Lack of a compelling reason: design complexity is still manageable a decade of ago • Lack of a solid RTL foundation • Lack of consideration of physical reality

  4. Motivation (2) • Behavioral Synthesis provides combined advantages • Better complexity management • Code size: RTL design ~300KL  Behavioral design 40KL [NEC, ASPDAC04] • Shorter verification/simulation cycle • Simulation speed 100X faster than RTL-based method • Rapid system exploration • Quick evaluation of different hardware/software boundaries • Fast exploration of multiple micro-architecture alternatives • Higher quality of results • Full consideration of physical reality

  5. xPilot: Platform-Based Behavioral to RTL Synthesis Flow • Presynthesis optimizations • Loop unrolling/shifting • Strength reduction / Tree height reduction • Bitwidth analysis • Memory analysis … Behavioral spec. in C/SystemC Platform description Frontendcompiler • Core synthesis optimizations • Scheduling • Resource binding, e.g., functional unit binding register/port binding SSDM • Arch-generation & RTL/constraints generation • Verilog/VHDL/SystemC • FPGAs: Altera, Xilinx • ASICs: Magma, Synopsys, … RTL FPGAs/ASICs

  6. System-level Synthesis Data Model • SSDM (System-level Synthesis Data Model) • Hierarchical netlist of concurrent processes and communication channels • Each leaf process contains a sequential program which is represented by an extended LLVM IR with hardware-specific semantics • Port / IO interfaces, bit-vector manipulations, cycle-level notations

  7. Platform Modeling & Characterization • Target platform specification • High-level resource library with delay/latency/area/power curve for various input/bitwidth configurations • Functional units: adders, ALUs, multipliers, comparators, etc. • Connectors: mux, demux, etc. • Memories: registers, synchronous memories, etc. • Chip layout description • On-chip resource distributions • On-chip interconnect delay/power estimation

  8. Scheduling  Goals • A highly versatile scheduling engine • Applicable to a wide range of application domains • Computation-intensive, data/memory-intensive, control-intensive, etc. • Mixed behavioral & RTL • Amenable to a rich set of scheduling constraints • Data dependency constraints • Resource constraints: IO ports constraints, memory ports constraints, Functional unit constraints, etc. • Timing constraints: Frequency constraint, Latency constraints, etc. • Relative IO timing constraints: Cycle-fixed mode, superstate-fixed mode, free-floating mode, etc. • Retargetable to a variety of design objectives • High performance, small area, low power, etc.

  9. Scheduling  Optimization Capabilities • Offers a variety of optimization techniques in a unified framework • Combinational/Sequential non-pipelined/pipelined multi-cycle operation • Unconditional/Conditional operation chaining • Relative scheduling • Considerations of branching probabilities and repetitions • Multi-cycle communication (under development) • Code motion & speculation (under development) • Functional / loop pipelining (under development) • Physical layout integration (to be supported)

  10. Scheduling  Current Status • Design objective • Focus on high-performance designs • Overall approach • Use a system of pairwise difference constraints to express all kinds of scheduling constraints • Represent the design objective in a linear function • The system is immediately solvable via any linear programming solver with integral solutions

  11. Constraint equations generation Relative timing constraintsDependency constraintsFrequency constraintsResource constraints … Objective function generation Linear programming solver LP solution interpretation Scheduling  Design Framework CDFG xPilot scheduler Target platformmodeling(resource library & chip layout) User-specified design constraints& assignments System of pairwise difference constraints STG (State Transition Graph)

  12. Example : Greatest Common Divisor x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0); • GCD C description BB1 x = inport1;y = inport2;while (x != y) { if ( x > y ) x = x – y; else y = y – x;}*outport = x; T x_1 = (x_0, x_1, x_2);y_1 = (y_0, y_1, y_2);cond2 = (x_1 > y_1); BB2 T BB3 BB4 x_2 = x1 – y1;cond3 = (x_2 != y_1); y_2 = y1 – x1;cond4 = (x_1 != y_2); T T BB5 x_3 = (x_0, x_1, x_2);*outport = x_3;

  13. u: x_1 = (x_0, x_1, x_2); v: cond2 = (x_1 > y_1); Constraints Generation • Data dependency constraint • Operation v is data dependent on operation u, i.e., (u, v)Es(v) – s(u)  0 where schedule variable s(v) represents the relative schedule of node v • Other constraints can be represented in a similar way … • The constraint equations form a system of pairwise difference constraints • Matrix A is totally unimodular • Feasibility check can be formulated as a single-source shortest path problem • Optimizations can be performed via any LP solver; the dual problem is equivalent to a min-cost network flow problem

  14. Solution by LP Solver x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0); • Scheduling are performed across the basic block boundaries BB1 0 T x_1 = (x_0, x_1, x_2);y_1 = (y_0, y_1, y_2);cond2 = (x_1 > y_1); BB2 T BB3 BB4 x_2 = x1 – y1;cond3 = (x_2 != y_1); y_2 = y1 – x1;cond4 = (x_1 != y_2); T T 1 BB5 x_3 = (x_0, x_1, x_2);*outport = x_3;

  15. x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0); x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0); if (cond1){ x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2){ x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); *outport = x_3; } x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 - y1; cond3 = (x_2 != y_1); y_2 = y1 - x1; cond4 = (x_1 != y_2); x_3 = (x_0, x_1, x_2);*outport = x_3; Schedule Interpretation

  16. x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0); if (cond1) { x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); *outport = x_3; } Deriving State Transition Graph • Final STG for GCD cond3 || cond4

  17. Unified Resource Binding • Provides an unified resource sharing framework to optimize for various design objectives • Simultaneous functional unit binding, register binding and port binding • Equipped with advanced techniques to optimized the interconnect and steering logic networks • Guided by a flexible cost evaluation engine to achieve different objectives, e.g., performance, area, power, etc. • Extendable to exploit physical layout information

  18. R1 R2 R3 R4 R1 R2 R3 R4 F1 F2 MUX MUX F1 MUX R5 R5 (a) Case 1 Case 2 R1 R2 R1 R2 F1 F2 F1 MUX R3 R3 (b) Case 2 Case 1 An FU/Register binding Example • Observations: • Binding has large impact to the resulting performance and cost • Functional unit and register binding are highly correlated Note: Assume all operations and variables are compatible for sharing

  19. Drawbacks of Previous Work • Many existing algorithms focus on functional-unit- or register- “number” minimization • Technology advances – interconnect effect increasing • 51% of the total dynamic power of a microprocessor in 0.13um tech. • Up to 80% of the dynamic power in future technologies • May generate larger amount of multiplexers and interconnects • Unfavorable performance and cost results • Optimization for unrealistic goals • Minimize “number” of FUs, registers, or multiplexors • Should have detailed datapath models to guide the optimization • No technology specific consideration • Should have platform-specific characterizations

  20. Resource Binding in xPilot STG (State Transition Graph) xPilot architecture exploration Baseline Register Binding User-specified designconstraints Iteration FU Allocation/Binding Datapath model for performance-costestimation Register Allocation/Binding Target platform (resource library & chip layout) Improved?? Yes No STG + Best Datapath Models

  21. 1* 1* 3* 2* > 4* C1 5* C1’ 2*, 3* 4* 5* > < C2 C2’ < 6+ 6+ power pruned MUL MUL delay Design Space Exploration • Exploration phases: • Exploring Node 2: • (1) (2) two mul • (1, 2) one mul • Exploring Node 3: • (1) (2) (3) three mul • (1, 2) (3) two mul • (1, 3) (2) two mul • Exploring Node 4: • (1) (2) (3) (4) • (1, 2, 4) (3) • (1, 2) (3, 4) • (1, 2) (3) (4) • (1, 3, 4) (2) • (1, 3) (2, 4) • (1, 3) (2) (4) • …. Compatible Graphs A State Transition Graph (STG) Datapath for solution (1, 2, 4) (3) Datapath Model Curve for Design Space Pruning

  22. Experimental Results  Benchmark Suite • Benchmark suite • PR, MCM: • DSP kernels: pure additions/subtractions and multiplications • CACHE • Cache controller: control-intensive designs with cycle-accurate I/O operations • MOTION: • Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest amount of computations • IDCT: • JPEG inverse discrete cosine transform: computation intensive • DWT: • JPEG2000 discrete wavelet transform: computation intensive with modest control flow • EDGELOOP: • Extracted from H.264 decoder: a very complex design, features a mix of computation, control, and memory accesses

  23. Experimental Results  Code Size Reduction

  24. Experimental Results  Comparison with SPARK On Scheduling • SPARK [UCI/UCSD, 2004], a state of the art academic high-level synthesis tool

  25. Experimental Results  Comparison with SPARK On Binding • On average, xPilot resource binding achieves designs with similar area, and 2.48x higher frequency over Spark

  26. Synthesis Results for DWT (JPEG2000) • Settings • Target platform: Altera Stratix • RTL synthesis & place-and-route: Altera QuartusII v5.0 • Simulation: Mentor ModelSim SE6.0 • Design alternatives

  27. Experimental Results: ASIC Flow • Magma RTL to GDSII flow • Technology library: Cadence Generic Standard Cell Library 0.18um • Tradeoff study: • 1st column: delay constraint enforced in xPilot • 2nd column: control step count of xPilot generated RTL • 3rd-5th column: data reported after mapping by Magma tool

  28. Experimental Results: ASIC Flow (cont.)

More Related