1 / 35

ECE 697F Reconfigurable Computing Lecture 5 Technology Mapping: Packing Logic into LUTs

ECE 697F Reconfigurable Computing Lecture 5 Technology Mapping: Packing Logic into LUTs. Overview. Logic synthesis LUT Clustering LUT capacity Chortle – example technology mapper Architecture-specific optimization. Boolean network.

baby
Download Presentation

ECE 697F Reconfigurable Computing Lecture 5 Technology Mapping: Packing Logic into LUTs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 697FReconfigurable ComputingLecture 5Technology Mapping: Packing Logic into LUTs

  2. Overview • Logic synthesis • LUT Clustering • LUT capacity • Chortle – example technology mapper • Architecture-specific optimization

  3. Boolean network • A Boolean network is the main representation of the logic functions for technology independent optimizations. • Each node can be represented as sum-of-products (or product-of-sums). • Provides multi-level structure, but functions in the network need not correspond to logic gates.

  4. primary outputs out1 = k2 + x2’ out2 = k3 + x1 k2 = x1’ x2 x4 + k1 k3 = k1 x4’ k1 = x2 + x3 primary inputs x1 x2 x3 x4 Boolean network example

  5. Terms • Support: set of variables used by a function. • Transitive fanout: all the primary outputs and intermediate variables of a function. • Transitive fanin: all the primary inputs and intermediate variables used by a function. Transistive fanin determines a cone of logic. cone primary inputs output

  6. x2 1 don’t care x1 0 1 1 x3 Partially-specified function

  7. Optimizations • Simplification. • Changing the way a function is represented. • Network restructuring. • Adding and removing nodes. • Delay restructuring. • Optimizations that reduce the height of critical paths.

  8. Partial collapsing f1 f4 F f4 f2 f3 f3 before after

  9. Technology mapping • Cover the function:

  10. FPGA tech mapping • Cost (number of inputs) doesn’t always increase with added functions:

  11. FPGAs vs. custom logic • Cost metric for static gates is literal: • ax + bx’ has four literals, requires 8 transistors. • Cost metric for FPGAs is logic element: • All functions that fit in an LE have the same cost.

  12. LUT-based logic synthesis • Find the largest logic cone that will fit into the LUT: r = q + s’ s = d’ q = g’ + h d = a + b

  13. A C A C B D B D How much fits in a LUT? • One 2-input NAND gate frequently used for comparison. • Approximately 12 ~ 15 gates per four-input LUT. • 216 functions -> 80 after IO swapping 14 after IO inversion • 4-input determined to be optimal [Rose 1990]

  14. Technology-Independent Logic Optimization • Improve circuit based on cost • Keep same functionality • Boolean Evaluation/decomposition • Simple factoring -> minimizing literals f = ac + ad + bc + bd g = a + b + c e = a + b g = e + c f = e(c + d)

  15. Factorization • Based on division: • formulate candidate divisor; • test how it divides into the function; • if g = f/c, we can use c as an intermediate function for f. • Algebraic division: don’t take into account Boolean simplification. Less expensive then Boolean division.

  16. Inv, cost 2 NAND2, cost 3 AOI-21, cost 4 Library-based Technology Mapping – MIS II • Three steps: decomposition, matching, covering • Circuit first decomposed into NAND representations • Different collections of NANDs can be implemented differently in VLSI

  17. Cost = Cost = MIS II • Decompose into NAND-2 using Boolean techniques • Use dynamic programming to match subtrees with libraries • Choose lowest cost implementation that covers all primitives.

  18. Tech Mapping for LUTs • Minimize total number of LUTs • Minimize the number of levels of LUTs • Many different approaches • Partitioning -> Flowmap • BDDs -> XMAP • Chortle -> Covering • Basic Xilinx tech mapping follows Chortle with modification to handle registers.

  19. L M J K G H I D E F A B C x w y z Chortle-crf • Dynamic programming approach • Minimize # LUTs – primary goal • Minimize # input circuit root uses • Secondary goal • Operates on AND-OR circuits. Locate boundaries

  20. With decomposition 2-LUTs Without decomp 4-LUTs Chortle-crf • Major innovation is bin packing • Simultaneously addresses decomposition and matching • Goal: Find decomposition of every node in the network that minimizes # LUTs in final circuit

  21. Mapping Each Tree • Dynamically visit each node in the graph • Fanin nodes drive the node under evaluation Boxes -> fanin LUTs, cost is number of inputs Bins -> N input LUT (in this case 5) First Fit Decreasing /* construct 2-level decomp */ box list <- fanin LUTs sorted by size bin list <- 0 while (box list is not 0) { box <- largest LUT find bin that will contain LUT if bin doesn’t exist bin <- box /* create new bin */ else bin <- box /* pack in exisiting */

  22. Multi-Level Decomposition • Chain LUTs together • Output of largest second level LUT connected to LUT with unused input • May need to add a new LUT • Leads to min LUTs and fanout LUT with smallest # input • This fanout LUT used as input to next stage

  23. w u v x y w u v x z.2 y z.1 v u w x y z.1 Examples a) Fanin LUTs b) Two-level Decomposition c) Multi-level Decomposition

  24. Optimality • For LUTs with fewer than 6 inputs Chortle will create an optimal result for subtree • Combination of sub-trees is not optimized. • Local optimizations needed to ensure global optimality. Reconvergent paths -> net drives multiple gates. Replicating logic -> creating additional fanout

  25. Translating a Design to an FPGA • Improve 2-level decomposition to take fanout into account • Replace FFD with an exhaustive search that repeatedly invokes FFD. • Try both with and without reconvergent path and select best mapping (forced merging) • Inputs must reconverge at node being decomposed.

  26. Reconvergent Paths • Frequently, more than one pair of fan-in LUTs share inputs • For each combination of pairs that share inputs, perform FFD. • Two-level decomp with fewest bins and smallest least filled bin retained Reconverge pair list <- all pairs of fanin LUTs with shared inputs best LUTs <- 0 for all possible pairs from pair list { merged LUTs <- copy of fanin LUTs with forced merge FFD(merged LUTs) /* best combo */ }

  27. Maximum Share Decreasing • Exhaustive search prohibitive • Select box using following criteria • Greatest # inputs • Shares greatest # inputs with any existing bin • Shares greatest # of inputs with existing (remaining) boxes • Reduces to FFD for no input sharing • Points 2 and 3 optimize network sharing

  28. Without Replication With Replication Node Replication • Apply replication to fanout nodes • Map without replication first • Locally decompose fanout nodes to determine savings • Ordering important

  29. Results – Chortle-crf • 20 netlists mapped to 5-input LUTs • Reconvergence reduced LUTs by 2.7% • Replication reduced LUTs by 3.7% • Combined 14% reduction achieved • Replication exposes reconvergent paths creating additional opportunities for optimization.

  30. Chortle-d • Minimize delay through circuit • Generally increases hardware required • Reduced logic levels by 38% • Increased # LUTs by 79% • Note most delay in FPGA in interconnect

  31. Other Approaches • MIS-PGA • Groups inputs into LUTs • Decompose into 4-LUTs (Roth-Karp) • 47 times slower than Chortle • 14% fewer LUTs • XMAP • Represent circuit as BDDs • Effective for multiplexer based devices. • Also, BDS-PGA

  32. 1. Use network flow to partition circuit. Flowmap 2. Determine point where minimum flow achieved for minimum cut 3. Cut until LUTs of size N achieved.

  33. FF FF Taking Flip flops into Account • FPGA devices contain fixed resources – FFs • Technology mapping should take these into account • Consider fanout nodes.

  34. LUT Packing - VPACK • Seed BLE – choose BLE with most inputs. • Select next BLE -> BLE which shares most inputs and outputs with cluster • Continue until cluster is full or adding any BLE will overflow I -> # inputs • Hill Climbing – exceed I limit temporarily to find better minimum.

  35. Summary • Many tech mapping algorithms exist to minimize delay/area • Chortle use dynamic programming heuristic to perform mapping • Largely a solved problem • More sophisticated techniques evaluated recently

More Related