1 / 30

# Combinational and Sequential Mapping with Priority Cuts - PowerPoint PPT Presentation

Combinational and Sequential Mapping with Priority Cuts. Alan Mishchenko Sungmin Cho Satrajit Chatterjee Robert Brayton UC Berkeley. Outline. Traditional cut-based LUT mapping Improved technology mapping with priority cuts Sequential mapping Other applications of priority cuts

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Combinational and Sequential Mapping with Priority Cuts' - morna

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Combinational and Sequential Mapping with Priority Cuts

Alan Mishchenko

Sungmin Cho

Satrajit Chatterjee

Robert Brayton

UC Berkeley

• Traditional cut-based LUT mapping

• Improved technology mapping with priority cuts

• Sequential mapping

• Other applications of priority cuts

• Experimental results

Input: A Boolean network (And-Inverter Graph)

Output: A netlist of K-LUTs implementing the Boolean network optimizing some cost function

f

f

Technology

Mapping

e

e

a

c

d

a

c

d

b

b

The subject graph

The mapped netlist

k-feasible Cuts

r

A cut of a node n is a set of nodes in transitive fan-in

such that

every path from the node to PIs is blocked by nodes in the cut.

A k-feasible cut means the size of the cut must be k or less.

p

q

a

b

c

The set {p, b, c} is a 3-feasible cut of node r. (It is also a 4-feasible cut.)

k-feasible cuts are important in FPGA mapping since the logic between a node and the nodes in its cut can be replaced by a k-LUT.

k-feasible Cut Computation

The set of cuts of a node is a ‘cross product’ of the sets of cuts of its children

{ {r},{p, q}, {p, b, c},{a, b, q}, {a, b, c} }

r

{ {p},{a, b} }

{ {q},{b, c} }

Computation is done bottom-up

p

q

{ {b} }

{ {c} }

{ {a} }

a

c

b

Any cut that is of size greater than k is discarded

(P. Pan et al, FPGA ’98; J. Cong et al, FPGA ’99)

Depth-optimal LUT mapping of a DAG using all cuts at each node

Input: And-Inverter Graph

• Compute K-feasible cuts for each node

• Compute best arrival time at each node

• In topological order (from PI to PO)

• Compute the depth of all cuts and choose the best one

• Perform area recovery

• Using area flow

• Using exact local area

• Chose the best cover

• In reverse topological order (from PO to PI)

Output: Mapped Netlist

• Area recovery heuristics

• Area-flow (global view)

• Chooses cuts with better logic sharing

• Exact local area (local view)

• Minimizes the number of LUTs needed to map each node

• The results of area recovery depends on

• The order of processing nodes

• The order of applying two passes

• The number of iterations

• This scheme works for the constant-delay model

• Any change off the critical path doesn’t affect critical path

• For large designs, there may be many k-feasible cuts

• Order of millions

• Previous ways of dealing with the problem

• Detect and remove cut dominance

• Perform cut pruning

• Store only cuts on the frontier of mapping

Outline Enumeration

• Traditionalcut-based technology mapping

• Improved technology mapping

• Sequential mapping

• Other applications of priority cuts

• Experimental results

New Mapping Algorithm Enumeration

Near-depth-optimal LUT mapping of a DAG using several cuts at each node

Input: And-Inverter Graph

• Compute K-feasible cuts for each node

• Compute arrival time at each node

• In topological order (from PI to PO)

• Compute the depth of all cuts and choose the best one

• Compute at most C good cuts and choose the best one

• Perform area recovery

• Using area flow

• Using exact local area

• Re-compute at most C good cuts and choose the best one in each iteration

• Chose the best cover

• In reverse topological order (from PO to PI)

Output: Mapped Netlist

Computing Priority Cuts Enumeration

• Consider nodes in a topological order

• At each node, merge two sets of fanin cuts (each containing C cuts) getting (C+1) * (C+1) + 1 cuts

• Sort these cuts using a given cost function, select C best cuts, and use them for computing priority cuts of the fanouts

• Select one best cut, and use it to map the node

• Sorting criteria

Discussion Enumeration

• K - max cut size

• C - max number of cuts

• n - number of nodes

• m – number of edges

• Complexity analysis

• Traditional mapping algorithm

• FlowMap O(Kmn) (J. Cong et al, TCAD ’94)

• CutMap O(2KmnK) (J. Cong et al, FPGA ’95)

• Proposed mapping algorithm

• O(KC2n)

Priority Cuts: A Bag of Tricks Enumeration

• Compute and use priority cuts (a subset of all cuts)

• Dynamically update the cuts in each mapping pass

• Use different sorting criteria in each mapping pass

• Include the best cut from the previous pass into the set of candidate cuts of the current pass

• Consider several depth-oriented mappings to get a good starting point for area recovery

• Use complementary heuristics for area recovery

• Perform cut expansion as part of area recovery

• Use efficient memory management

Outline Enumeration

• Traditionalcut-based technology mapping

• Improved technology mapping

• Sequential mapping

• Other applications of priority cuts

• Experimental results

Sequential Mapping Enumeration

• That is, combinational mapping and retiming combined

• Minimizes clock period in the combined solution space

• Previous work:

• Pan et al, FPGA’98

• Cong et al, TCAD’98

• Our contribution: divide sequential mapping into steps

• Find the best clock period via sequential arrival time computation (Pan et al, FPGA’98)

• Run combinational mapping with the resulting arrival/required times of the register outputs/inputs

• Perform final retiming to bring the circuit to the best clock period computed in Step 1

Sequential Mapping (continued) Enumeration

• Uses priority cuts (L=1) for computing sequential arrival times

• very fast

• Reuses efficient area recovery available in combinational mapping

• almost no degradation in LUT count and register count

• Greatly simplifies implementation

• due to not computing sequential cuts (cuts crossing register boundary)

• Quality of results

• Leads to quality that is better (by ~15%) than combinational mapping followed by retiming

• due to searching the combined search space

• Achieves almost the same (-1%) clock period as the general sequential mapping with sequential cuts

• due to using transparent register boundary without computing sequential cuts

Outline Enumeration

• Traditionalcut-based technology mapping

• Improved technology mapping

• Sequential mapping

• Other applications of priority cuts

• Experimental results

Speeding Up SAT Solving Enumeration

• Perform technology mapping into K-LUTs for area

• Define area as the number of CNF clauses needed to represent the Boolean function of the cut

• Run several iterations of area recovery

• Reduced the number of CNF clauses by ~50%

• Compared to a smart circuit-to-CNF translation (M. Velev)

• Improves SAT solver runtime by 3-10x

• Experimental results will be given later

Minimizing the Total Number of BDD Nodes Needed to Represent a Boolean Network

• Perform technology mapping into K-LUTs for minimizing area under delay constraints

• Define area of a cut as the number of BDD nodes needed to represent the Boolean function of the cut

• Run delay-oriented mapping, followed by several iterations of area recovery

Cut Sweeping a Boolean Network

• Reduce the circuit by detecting and merging shallow equivalences (proposed by Niklas Een)

• By “shallow” equivalences, we mean equivalent points, A and B, for which there exists a K-cut C (K < 16) such that FA(C) = FB(C)

• A subset of “good” K-input priority cuts can be computed

• The quality of a cut is determined by the number of fanouts of the cut leaves

• The more fanouts, the more likely the cut is a common cut for two nodes

• Cut sweeping quickly reduces the circuit

• Typically ~50% gain of SAT sweeping (Fraiging)

• Cut sweeping is much faster than SAT sweeping

• Typically 10-100x, for large designs

• Can be used as a fast preprocessing to (or a low-cost substitute for) SAT sweeping

Sequential Resynthesis for Delay a Boolean Network

• Restructure logic along the tightest sequential loops to reduce delay after retiming (Soviani/Edwards, TCAD’07)

• Similar to sequential mapping

• Computes seq arrival times for the circuit

• Uses the current logic structure, as well as logic structure, transformed using Shannon expansion w.r.t. the latest variables

• Accepts transforms leading to delay reduction

• In the end, retimes to the best clock period

• The improvement is 7-60% in delay with 1-12% area degradation (ISCAS circuits)

• This algorithm could benefit from the use of priority cuts

Outline a Boolean Network

• Traditionalcut-based technology mapping

• Improved technology mapping

• Sequential mapping

• Other applications of priority cuts

• Experimental results

Experimental Comparison a Boolean Network

• Compare the new mapping against the traditional mapping in terms of

• Delay

• Area

• Runtime

• Memory

• Compare on large industrial benchmarks with choices

• Analyze the performance of the new mapping for

• Large designs

• Large LUTs

• Explore the potential of sequential mapping

• Computer used for experiments

• IBM ThinkPad laptop with 1.6GHz and 2Gb RAM

Priority cuts vs. Cut enumeration (C=8) a Boolean Network

Used a set of the large public benchmarks

Priority Cuts a Boolean Networkvs. Cut Enumeration (K=6, C = 16)

Cut enumeration

Priority cuts

Mapping w/o choices

Priority cuts

Cut enumeration

Mapping with choices

Used a set of large industrial benchmarks

Performance on Large Designs (C=1) a Boolean Network

Using design wb_conmax.v (part of IWLS 2005 benchmarks)

This is a WISHBONE Interconnect Matrix IP core. It can interconnect up to 8 Masters and 16 Slaves

Source: http://www.opencores.org

Performance for Large LUTs (C=1) a Boolean Network

Using 100 timeframes of design wb_conmax.v

Sequential Mapping (K=6, C=8) a Boolean Network

Used a subset of ISCAS benchmarks, for which retiming reduced delay

Summary a Boolean Network

• Reviewed traditional technology mapping

• Cut computation

• Optimum-depth mapping

• Area recovery

• Presented an improved approach to mapping

• Computes a small number of cuts at each node

• Uses new ideas to dramatically reduce memory and runtime

• Reported experimental results

• Compared priority cuts with exhaustive cut enumeration

• Delay and area are comparable or better by 1-3%

• Memory and runtime are greatly reduced (5x for 6-LUTs)

• Showed performance on very large designs (2 sec to map 1M)

• Compared combinational and sequential mapping

• Implemented in ABC

• Google: “abc berkeley” (package “if”)

The End a Boolean Network