1 / 30

Combinational and Sequential Mapping with Priority Cuts

Combinational and Sequential Mapping with Priority Cuts. Alan Mishchenko Sungmin Cho Satrajit Chatterjee Robert Brayton UC Berkeley. Outline. Traditional cut-based LUT mapping Improved technology mapping with priority cuts Sequential mapping Other applications of priority cuts

morna
Download Presentation

Combinational and Sequential Mapping with Priority Cuts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combinational and Sequential Mapping with Priority Cuts Alan Mishchenko Sungmin Cho Satrajit Chatterjee Robert Brayton UC Berkeley

  2. Outline • Traditional cut-based LUT mapping • Improved technology mapping with priority cuts • Sequential mapping • Other applications of priority cuts • Experimental results

  3. Technology Mapping Input: A Boolean network (And-Inverter Graph) Output: A netlist of K-LUTs implementing the Boolean network optimizing some cost function f f Technology Mapping e e a c d a c d b b The subject graph The mapped netlist

  4. k-feasible Cuts r A cut of a node n is a set of nodes in transitive fan-in such that every path from the node to PIs is blocked by nodes in the cut. A k-feasible cut means the size of the cut must be k or less. p q a b c The set {p, b, c} is a 3-feasible cut of node r. (It is also a 4-feasible cut.) k-feasible cuts are important in FPGA mapping since the logic between a node and the nodes in its cut can be replaced by a k-LUT.

  5. k-feasible Cut Computation The set of cuts of a node is a ‘cross product’ of the sets of cuts of its children { {r},{p, q}, {p, b, c},{a, b, q}, {a, b, c} } r { {p},{a, b} } { {q},{b, c} } Computation is done bottom-up p q { {b} } { {c} } { {a} } a c b Any cut that is of size greater than k is discarded (P. Pan et al, FPGA ’98; J. Cong et al, FPGA ’99)

  6. Basic Mapping Algorithm Depth-optimal LUT mapping of a DAG using all cuts at each node Input: And-Inverter Graph • Compute K-feasible cuts for each node • Compute best arrival time at each node • In topological order (from PI to PO) • Compute the depth of all cuts and choose the best one • Perform area recovery • Using area flow • Using exact local area • Chose the best cover • In reverse topological order (from PO to PI) Output: Mapped Netlist

  7. Area Recovery Summary • Area recovery heuristics • Area-flow (global view) • Chooses cuts with better logic sharing • Exact local area (local view) • Minimizes the number of LUTs needed to map each node • The results of area recovery depends on • The order of processing nodes • The order of applying two passes • The number of iterations • This scheme works for the constant-delay model • Any change off the critical path doesn’t affect critical path

  8. Drawbacks of Traditional Mapping Based on Exhaustive Cut Enumeration • For large designs, there may be many k-feasible cuts • Order of millions • Previous ways of dealing with the problem • Detect and remove cut dominance • Perform cut pruning • Store only cuts on the frontier of mapping

  9. Outline • Traditionalcut-based technology mapping • Improved technology mapping • Sequential mapping • Other applications of priority cuts • Experimental results

  10. New Mapping Algorithm Near-depth-optimal LUT mapping of a DAG using several cuts at each node Input: And-Inverter Graph • Compute K-feasible cuts for each node • Compute arrival time at each node • In topological order (from PI to PO) • Compute the depth of all cuts and choose the best one • Compute at most C good cuts and choose the best one • Perform area recovery • Using area flow • Using exact local area • Re-compute at most C good cuts and choose the best one in each iteration • Chose the best cover • In reverse topological order (from PO to PI) Output: Mapped Netlist

  11. Computing Priority Cuts • Consider nodes in a topological order • At each node, merge two sets of fanin cuts (each containing C cuts) getting (C+1) * (C+1) + 1 cuts • Sort these cuts using a given cost function, select C best cuts, and use them for computing priority cuts of the fanouts • Select one best cut, and use it to map the node • Sorting criteria

  12. Discussion • K - max cut size • C - max number of cuts • n - number of nodes • m – number of edges • Complexity analysis • Traditional mapping algorithm • FlowMap O(Kmn) (J. Cong et al, TCAD ’94) • CutMap O(2KmnK) (J. Cong et al, FPGA ’95) • Proposed mapping algorithm • O(KC2n)

  13. Priority Cuts: A Bag of Tricks • Compute and use priority cuts (a subset of all cuts) • Dynamically update the cuts in each mapping pass • Use different sorting criteria in each mapping pass • Include the best cut from the previous pass into the set of candidate cuts of the current pass • Consider several depth-oriented mappings to get a good starting point for area recovery • Use complementary heuristics for area recovery • Perform cut expansion as part of area recovery • Use efficient memory management

  14. Outline • Traditionalcut-based technology mapping • Improved technology mapping • Sequential mapping • Other applications of priority cuts • Experimental results

  15. Sequential Mapping • That is, combinational mapping and retiming combined • Minimizes clock period in the combined solution space • Previous work: • Pan et al, FPGA’98 • Cong et al, TCAD’98 • Our contribution: divide sequential mapping into steps • Find the best clock period via sequential arrival time computation (Pan et al, FPGA’98) • Run combinational mapping with the resulting arrival/required times of the register outputs/inputs • Perform final retiming to bring the circuit to the best clock period computed in Step 1

  16. Sequential Mapping (continued) • Advantages • Uses priority cuts (L=1) for computing sequential arrival times • very fast • Reuses efficient area recovery available in combinational mapping • almost no degradation in LUT count and register count • Greatly simplifies implementation • due to not computing sequential cuts (cuts crossing register boundary) • Quality of results • Leads to quality that is better (by ~15%) than combinational mapping followed by retiming • due to searching the combined search space • Achieves almost the same (-1%) clock period as the general sequential mapping with sequential cuts • due to using transparent register boundary without computing sequential cuts

  17. Outline • Traditionalcut-based technology mapping • Improved technology mapping • Sequential mapping • Other applications of priority cuts • Experimental results

  18. Speeding Up SAT Solving • Perform technology mapping into K-LUTs for area • Define area as the number of CNF clauses needed to represent the Boolean function of the cut • Run several iterations of area recovery • Reduced the number of CNF clauses by ~50% • Compared to a smart circuit-to-CNF translation (M. Velev) • Improves SAT solver runtime by 3-10x • Experimental results will be given later

  19. Minimizing the Total Number of BDD Nodes Needed to Represent a Boolean Network • Perform technology mapping into K-LUTs for minimizing area under delay constraints • Define area of a cut as the number of BDD nodes needed to represent the Boolean function of the cut • Run delay-oriented mapping, followed by several iterations of area recovery

  20. Cut Sweeping • Reduce the circuit by detecting and merging shallow equivalences (proposed by Niklas Een) • By “shallow” equivalences, we mean equivalent points, A and B, for which there exists a K-cut C (K < 16) such that FA(C) = FB(C) • A subset of “good” K-input priority cuts can be computed • The quality of a cut is determined by the number of fanouts of the cut leaves • The more fanouts, the more likely the cut is a common cut for two nodes • Cut sweeping quickly reduces the circuit • Typically ~50% gain of SAT sweeping (Fraiging) • Cut sweeping is much faster than SAT sweeping • Typically 10-100x, for large designs • Can be used as a fast preprocessing to (or a low-cost substitute for) SAT sweeping

  21. Sequential Resynthesis for Delay • Restructure logic along the tightest sequential loops to reduce delay after retiming (Soviani/Edwards, TCAD’07) • Similar to sequential mapping • Computes seq arrival times for the circuit • Uses the current logic structure, as well as logic structure, transformed using Shannon expansion w.r.t. the latest variables • Accepts transforms leading to delay reduction • In the end, retimes to the best clock period • The improvement is 7-60% in delay with 1-12% area degradation (ISCAS circuits) • This algorithm could benefit from the use of priority cuts

  22. Outline • Traditionalcut-based technology mapping • Improved technology mapping • Sequential mapping • Other applications of priority cuts • Experimental results

  23. Experimental Comparison • Compare the new mapping against the traditional mapping in terms of • Delay • Area • Runtime • Memory • Compare on large industrial benchmarks with choices • Analyze the performance of the new mapping for • Large designs • Large LUTs • Explore the potential of sequential mapping • Computer used for experiments • IBM ThinkPad laptop with 1.6GHz and 2Gb RAM

  24. Priority cuts vs. Cut enumeration (C=8) Used a set of the large public benchmarks

  25. Priority Cutsvs. Cut Enumeration (K=6, C = 16) Cut enumeration Priority cuts Mapping w/o choices Priority cuts Cut enumeration Mapping with choices Used a set of large industrial benchmarks

  26. Performance on Large Designs (C=1) Using design wb_conmax.v (part of IWLS 2005 benchmarks) This is a WISHBONE Interconnect Matrix IP core. It can interconnect up to 8 Masters and 16 Slaves Source: http://www.opencores.org

  27. Performance for Large LUTs (C=1) Using 100 timeframes of design wb_conmax.v

  28. Sequential Mapping (K=6, C=8) Used a subset of ISCAS benchmarks, for which retiming reduced delay

  29. Summary • Reviewed traditional technology mapping • Cut computation • Optimum-depth mapping • Area recovery • Presented an improved approach to mapping • Computes a small number of cuts at each node • Uses new ideas to dramatically reduce memory and runtime • Reported experimental results • Compared priority cuts with exhaustive cut enumeration • Delay and area are comparable or better by 1-3% • Memory and runtime are greatly reduced (5x for 6-LUTs) • Showed performance on very large designs (2 sec to map 1M) • Compared combinational and sequential mapping • Implemented in ABC • Google: “abc berkeley” (package “if”)

  30. The End

More Related