Loading in 5 sec....

ECE 697F Reconfigurable Computing Lecture 5 Technology Mapping: Packing Logic into LUTsPowerPoint Presentation

ECE 697F Reconfigurable Computing Lecture 5 Technology Mapping: Packing Logic into LUTs

- By
**baby** - Follow User

- 83 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' ECE 697F Reconfigurable Computing Lecture 5 Technology Mapping: Packing Logic into LUTs' - baby

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### ECE 697FReconfigurable ComputingLecture 5Technology Mapping: Packing Logic into LUTs

Overview

- Logic synthesis
- LUT Clustering
- LUT capacity
- Chortle – example technology mapper
- Architecture-specific optimization

Boolean network

- A Boolean network is the main representation of the logic functions for technology independent optimizations.
- Each node can be represented as sum-of-products (or product-of-sums).
- Provides multi-level structure, but functions in the network need not correspond to logic gates.

out1 = k2 + x2’

out2 = k3 + x1

k2 = x1’ x2 x4 + k1

k3 = k1 x4’

k1 = x2 + x3

primary inputs

x1

x2

x3

x4

Boolean network exampleTerms

- Support: set of variables used by a function.
- Transitive fanout: all the primary outputs and intermediate variables of a function.
- Transitive fanin: all the primary inputs and intermediate variables used by a function. Transistive fanin determines a cone of logic.

cone

primary inputs

output

Optimizations

- Simplification.
- Changing the way a function is represented.

- Network restructuring.
- Adding and removing nodes.

- Delay restructuring.
- Optimizations that reduce the height of critical paths.

Technology mapping

- Cover the function:

FPGA tech mapping

- Cost (number of inputs) doesn’t always increase with added functions:

FPGAs vs. custom logic

- Cost metric for static gates is literal:
- ax + bx’ has four literals, requires 8 transistors.

- Cost metric for FPGAs is logic element:
- All functions that fit in an LE have the same cost.

LUT-based logic synthesis

- Find the largest logic cone that will fit into the LUT:

r = q + s’

s = d’

q = g’ + h

d = a + b

C

A

C

B

D

B

D

How much fits in a LUT?- One 2-input NAND gate frequently used for comparison.
- Approximately 12 ~ 15 gates per four-input LUT.
- 216 functions -> 80 after IO swapping
14 after IO inversion

- 4-input determined to be optimal
[Rose 1990]

Technology-Independent Logic Optimization

- Improve circuit based on cost
- Keep same functionality

- Boolean Evaluation/decomposition
- Simple factoring -> minimizing literals
f = ac + ad + bc + bd

g = a + b + c

e = a + b g = e + c

f = e(c + d)

Factorization

- Based on division:
- formulate candidate divisor;
- test how it divides into the function;
- if g = f/c, we can use c as an intermediate function for f.

- Algebraic division: don’t take into account Boolean simplification. Less expensive then Boolean division.

NAND2, cost 3

AOI-21, cost 4

Library-based Technology Mapping – MIS II- Three steps: decomposition, matching, covering
- Circuit first decomposed into NAND representations
- Different collections of NANDs can be implemented differently in VLSI

Cost =

MIS II- Decompose into NAND-2 using Boolean techniques
- Use dynamic programming to match subtrees with libraries
- Choose lowest cost implementation that covers all primitives.

Tech Mapping for LUTs

- Minimize total number of LUTs
- Minimize the number of levels of LUTs
- Many different approaches
- Partitioning -> Flowmap
- BDDs -> XMAP
- Chortle -> Covering

- Basic Xilinx tech mapping follows Chortle with modification to handle registers.

M

J

K

G

H

I

D

E

F

A

B

C

x

w

y

z

Chortle-crf- Dynamic programming approach
- Minimize # LUTs – primary goal
- Minimize # input circuit root uses
- Secondary goal

- Operates on AND-OR circuits.

Locate boundaries

2-LUTs

Without decomp

4-LUTs

Chortle-crf- Major innovation is bin packing
- Simultaneously addresses decomposition and matching
- Goal: Find decomposition of every node in the network that minimizes # LUTs in final circuit

Mapping Each Tree

- Dynamically visit each node in the graph
- Fanin nodes drive the node under evaluation
Boxes -> fanin LUTs, cost is number of inputs

Bins -> N input LUT (in this case 5)

First Fit Decreasing /* construct 2-level decomp */

box list <- fanin LUTs sorted by size

bin list <- 0

while (box list is not 0) {

box <- largest LUT

find bin that will contain LUT

if bin doesn’t exist

bin <- box /* create new bin */

else

bin <- box /* pack in exisiting */

- Fanin nodes drive the node under evaluation

Multi-Level Decomposition

- Chain LUTs together
- Output of largest second level LUT connected to LUT with unused input
- May need to add a new LUT
- Leads to min LUTs and fanout LUT with smallest # input
- This fanout LUT used as input to next stage

u

v

x

y

w

u

v

x

z.2

y

z.1

v

u

w

x

y

z.1

Examplesa) Fanin LUTs

b) Two-level Decomposition

c) Multi-level Decomposition

Optimality

- For LUTs with fewer than 6 inputs Chortle will create an optimal result for subtree
- Combination of sub-trees is not optimized.
- Local optimizations needed to ensure global optimality.
Reconvergent paths -> net drives multiple gates.

Replicating logic -> creating additional fanout

Translating a Design to an FPGA

- Improve 2-level decomposition to take fanout into account
- Replace FFD with an exhaustive search that repeatedly invokes FFD.
- Try both with and without reconvergent path and select best mapping (forced merging)
- Inputs must reconverge at node being decomposed.

Reconvergent Paths

- Frequently, more than one pair of fan-in LUTs share inputs
- For each combination of pairs that share inputs, perform FFD.
- Two-level decomp with fewest bins and smallest least filled bin retained
Reconverge

pair list <- all pairs of fanin LUTs with shared inputs

best LUTs <- 0

for all possible pairs from pair list {

merged LUTs <- copy of fanin LUTs with forced merge

FFD(merged LUTs) /* best combo */

}

Maximum Share Decreasing

- Exhaustive search prohibitive
- Select box using following criteria
- Greatest # inputs
- Shares greatest # inputs with any existing bin
- Shares greatest # of inputs with existing (remaining) boxes

- Reduces to FFD for no input sharing
- Points 2 and 3 optimize network sharing

With Replication

Node Replication- Apply replication to fanout nodes
- Map without replication first
- Locally decompose fanout nodes to determine savings
- Ordering important

Results – Chortle-crf

- 20 netlists mapped to 5-input LUTs
- Reconvergence reduced LUTs by 2.7%
- Replication reduced LUTs by 3.7%
- Combined 14% reduction achieved
- Replication exposes reconvergent paths creating additional opportunities for optimization.

Chortle-d

- Minimize delay through circuit
- Generally increases hardware required
- Reduced logic levels by 38%
- Increased # LUTs by 79%

- Note most delay in FPGA in interconnect

Other Approaches

- MIS-PGA
- Groups inputs into LUTs
- Decompose into 4-LUTs (Roth-Karp)
- 47 times slower than Chortle
- 14% fewer LUTs

- XMAP
- Represent circuit as BDDs
- Effective for multiplexer based devices.
- Also, BDS-PGA

1. Use network flow to partition circuit.

Flowmap2. Determine point where minimum flow achieved for minimum cut

3. Cut until LUTs of size N achieved.

FF

Taking Flip flops into Account- FPGA devices contain fixed resources – FFs
- Technology mapping should take these into account
- Consider fanout nodes.

LUT Packing - VPACK

- Seed BLE – choose BLE with most inputs.
- Select next BLE -> BLE which shares most inputs and outputs with cluster
- Continue until cluster is full or adding any BLE will overflow I -> # inputs
- Hill Climbing – exceed I limit temporarily to find better minimum.

Summary

- Many tech mapping algorithms exist to minimize delay/area
- Chortle use dynamic programming heuristic to perform mapping
- Largely a solved problem
- More sophisticated techniques evaluated recently

Download Presentation

Connecting to Server..