1 / 39

Compiling for Coarse-Grained Adaptable Architectures

Compiling for Coarse-Grained Adaptable Architectures. Carl Ebeling Affiliates Meeting February 26, 2002. Outline. Embedded Systems Platforms Processor + ASICs The Performance/Power/Price crunch Bridging the Processor/ASIC gap The role of “adaptable hardware”

celine
Download Presentation

Compiling for Coarse-Grained Adaptable Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiling for Coarse-Grained Adaptable Architectures Carl Ebeling Affiliates Meeting February 26, 2002

  2. Outline • Embedded Systems Platforms • Processor + ASICs • The Performance/Power/Price crunch • Bridging the Processor/ASIC gap • The role of “adaptable hardware” • Coarse-grained vs. fine-grained • Compiling to coarse-grained architectures • Scheduling via Place and Route

  3. Platform-Based Systems • One platform for many different systems • Leverage economy of scale • Beyond “SOC” • Mix of processors and ASIC components • Processor • General system code • ASICs • High performance/low power

  4. Processor/ASIC Efficiency ASIC Processor

  5. The Platform Problem • Performance/power demands increasing rapidly • More functionality pushed into ASICs • Platform becomes special-purpose • Lose economy of scale • Solution: “Programmable ASICs” • Hardware that looks like software

  6. Adaptive Computing • ASIC is a “fixed instruction” architecture • Construct dataflow graph at fab time • Arbitrary size, complexity • Adaptive computing – change that instruction • Construct the dataflow graphs “on the fly” • Adapt the architecture to the problem

  7. FPGA-Based Adaptive Computing • FPGAs can be used to implement arbitrary circuits • Build ASIC components on-the-fly • Many styles • Configurable function units • Configurable co-processors

  8. The Problem with FPGAs • Cost • >100x overhead • Bit logic functions and routing • Great for bit-ops, FSMs; lousy for arithmetic • Power • Functions are widely spaced  long wires • Programming model • Mapping computation to HW is time-consuming • Weeks/months for relatively small problems

  9. Coarse-Grained Architectures • LUTs  Arithmetic operations • Wires  Data busses • Decreases overhead substantially • 25x - 100x gain • Large potential impact on embedded platforms • Compiling is the big challenge

  10. Rapid Architecture • Merging of processor and FPGA ideas • Start with array of function units

  11. Rapid • Add registers • “Distributed register file”

  12. Rapid • Add small distributed memories • Save data locally for reuse

  13. Rapid • Add I/O ports • Streaming data interfaces

  14. Rapid • Add interconnect network • Segmented busses • Multiplexers

  15. Interconnect Control • Interconnect is modeled using muxes • FU inputs are muxed • Bus inputs are muxed • Bus hierarchy possible • Bus inputs from FU’s and other buses

  16. Control Signals • Function units • e.g. ALU opcodes • e.g. Memory R/W • Mux controls • How many? • ~20/FU (including muxes) • >50 FUs • >1000 control signals

  17. Configuring Control Control Fab-time decision “Hard” (~60%) Configurable, ala FPGA “Soft” (~40%) Compile-time decisionStatically Configured Configured Static (~30%)Does not change during current app Dynamic (~10%)Changes under program control

  18. Proposed Tool Flow

  19. Control-Dataflow Graph

  20. Example Dataflow Graph • Add subsequences of length 3 • transform to: for (i=2; i<N; i++) { Y[i] = X[i-2] + X[i-1] + X[i];} A = X[1]; B = X[0];for (i=2; i<N; i++) { Y[i] = A + B + X[i]; B = A; A = X[i];}

  21. Example Dataflow Graph • DFG for one iteration • Combinational - executed in one cycle • DFG is in a loop, executed repeatedly • Linked to other DFGs via registers

  22. Stitching Dataflow Graphs

  23. Stitching Dataflow Graphs

  24. Stitching Dataflow Graphs

  25. Scheduling Dataflow Graphs • Mapping operations/values in space and time • Key problems • Data interconnect • No crossbar, no central register file • Control constraints • Hard control – one decision for all time • Control optimization • Soft control – maximize sharing • Place & Route formulation allows simultaneous solution of all constraints

  26. Example Datapath Graph • Two adders with pipeline registers • Two input streams, two output streams • Two datapath registers • Two pairs of interconnect registers

  27. Datapath Execution • Control determines what datapath does • Possibly different each clock cycle • Datapath Execution (DPE) • Computation performed in one clock cycle • Starts/ends with clock tick • Combinational logic

  28. Space/Time Execution Cycle 1

  29. Space/Time Execution Cycle 1 Cycle 2

  30. Space/Time Execution Cycle 1 Cycle 2 Cycle 3

  31. Dataflow Graph Execution Cycle 1 Cycle 2 Cycle 3

  32. Start an Execution Every Cycle Cycle 1 Cycle 2 Cycle 3

  33. Connect DFG Outputs to Inputs Cycle 1 Cycle 2 Cycle 3

  34. Dataflow Graph is in a loop • Initiation interval is one clock cycle • Wrap register outputs back to the top • Gives an iterative modulo schedule

  35. Result of Scheduling: Control Matrix Control signals • Only soft control • Control values • 0/1 • x – don’t care • f( ) • Status signals • Control variables • Control optimization • Compress matrix time

  36. Control Optimization • Static control • Unused or constant • No instruction bits • Shared control • Same value • Complemented value • One instruction bit • Pipelined control • Pipelined DFG • Control offset in time • One instruction bit

  37. Control Optimization • Static control • Unused or constant • No instruction bits • Shared control • Same value • Complemented value • One instruction bit • Pipelined control • Pipelined DFG • Control offset in time • One instruction bit

  38. Control Optimization • Static control • Unused or constant • No instruction bits • Shared control • Same value • Complemented value • One instruction bit • Pipelined control • Pipelined DFG • Control offset in time • Increases sharing

  39. Conclusion • New role for adaptive computing • Solution for embedded systems platforms • Coarse-grained architectures • Reduce configurability overhead • Merge ideas from processors and FPGAs • Compiling is the key challenge • Finding parallelism is not the problem • Scheduling data movement • Use Place & Route to solve many simultaneous constraints

More Related