Loading in 5 sec....

IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005PowerPoint Presentation

IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005

Download Presentation

IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005

Loading in 2 Seconds...

- 139 Views
- Uploaded on
- Presentation posted in: General

IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions

IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005

Anup Hosangadi

Ryan Kastner

ECE Department, UCSB

Farzan Fallah

Advanced CAD Research

Fujitsu Labs of America

- Introduction
- Related Work
- Polynomial transformation
- Common Subexpression elimination
- Results
- Conclusions

- Multiplications by constants encountered in many application areas
- DSP transforms in Audio, Video, Image processing (DFT, DCT, IDCT etc..)
- Filtering operations in Communication (FIR, IIR filters)
- Multiple Input Multiple Output (MIMO) systems
- Polynomials in Computer graphics

- Multiplication is expensive in hardware
- Decompose constant multiplications into shifts and additions
- 13*X = (1101)2*X = X + X<<2 + X<<3

- Signed digits can reduce the number of additions/subtractions
- Canonical Signed Digits (CSD) (Knuth’74)
- (57)10 = (0110111)2 = (100-1001)CSD

- Further reduction possible by common subexpression elimination
- Upto 50% reduction (R.Hartley TCS’96)

4+, 4<<

3+, 3<<

- Common subexpressions
= common digit patterns

- F1 = 7*X = (0111)*X = X + X<<1 + X<<2
F2 = 13*X = (1101)*X = X + X<<2 + X<<3

- D1 = X + X<<2
F1 = D1 + X<<1

F2 = D1 + X<<3

- Good for single variable: FIR filters(transposed form)
- Multiple variable? (DFT, DCT etc..??)

- F1 = 7*X = (0111)*X = X + X<<1 + X<<2

“0101”

=> X + X<<2

- Simple Bipartite matching (Potkonjak et. al TCAD’95)
- (10101) and (01101) => common pattern = “101”
- (10010) and (010010) => cannot detect pattern “1001”

- Recursive Shift and Add (RESANDS) (H.Nguyen et. Al, TVLSI 2000)
- (10010) and (010010) => common pattern “1001”

- Exhaustive enumeration of all digit patterns (Pasko et. Al. TCAD’99)
- (1011) => “0011”, “1001”, “1010”, “0101”, “1011”

- Extending techniques for multiple variables
Y1 a11 a12 a13 X1

Y2 =a21 a22 a23 xX2

Y3 a31 a32 a33 X3

Potkonjak et. al. TCAD’95

All Distinct SijXj and CikDk

Y1

Y2

Y3

- Multiple Variable Common Subexpression elimination (A.Hosangadi et. al ASAP’04)
- Polynomial transformation of linear systems.
- Use rectangular covering methods
- Cannot find subexpressions with reversed signs
eg. (X1 – X2<<1) ≠ (X2<<1 – X1)

- Common occurrence when signed digits are used
- Rectangle covering has exponential complexity
- Method to overcome these limitations ?

- Algebraic methods in multi-level logic synthesis (MLLS)
- Reducing literal count in a set of Boolean expressions
- Factoring, decomposition: Established algebraic techniques
- Typically used for thousands of variables and literals

- Apply these methods to optimize linear systems?

D1 = X1+ X2<<2

Y1 = D1 + D1<<3 + X1<<3

Y2 = D1 + X2<<2

- View linear systems as set of arithmetic expressions
- Expressions consisting of +,-,<< operators
- Develop methodology for extracting common subexpressions

- Polynomial formulation

C × X = (±X×Li)

(14)10 × X = (1110)2 × X

= X<<3 + X<<2 + X<<1

= XL3 + XL2 + XL1

= (100-10)CSD × X = XL4 – XL1

- Y0 1 1 1 1 X0
Y1 =2 1 -1 -2 X1

Y2 1 -1 -1 1 X2

Y3 1 -2 2 -1 X3

- Decomposing constant multiplications

H.264 Integer Transform

Y0 = X0 + X1 + X2 + X3

Y1 = X0<<1 + X1 - X2 - X3<<1

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1<<1 + X2<<1 - X3

12+, 4<<

- Y0 1 1 1 1 X0
Y1 =2 1 -1 -2 X1

Y2 1 -1 -1 1 X2

Y3 1 -2 2 -1 X3

- Polynomial transformation

H.264 Integer Transform

Y0 = X0 + X1 + X2 + X3

Y1 = X0L + X1 - X2 - X3L

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1L + X2L - X3

12+, 4<<

- Concurrent Decomposition and Factorization of Boolean Expressions (J.Rajski et. al TCAD’92)
- Popular as Fast-Extract (Fx) algorithm
- Expression f = gh + r
- g = (ab + c) => Double cube divisor
- g = ab => Single cube divisor

- Fx algorithm for Linear systems?

- Obtained from every pair of terms in each expression
- Divide by the minimum exponent of L
- eg. F = X1 + X2L + X3L3
- { +X2L, +X3L3}: Divide by L => (X2+ X3L2)
- Divisors = (X1 + X2L), (X1 + X3L3), (X2 + X3L2)

- Divide by the minimum exponent of L
- Two divisors intersect if
- The terms involved are distinct
- (X1 – X2L)∩ (X1 - X2L) = φ
(X1 – X2L)∩ (-X1 + X2L) = φ (reversed signs allowed !!)

- Theorem: Multiple term common subexpression in set of expression iff non-overlapping intersection among two-term divisors
- Many divisors with intersections, which one to choose?
- Use greedy selection of divisor with most # of intersections

- Selecting divisors changes expressions
- Perform concurrent decomposition of expressions

- Creating set of divisors {Divisors};
{Divisors} = φ;

for each expression Pi

{

{Dnew} = Divisors for Pi;

{Divisors} = {Divisors}∩ {Dnew};

Update frequency statistics of {Divisors} ;

}

{Divisors} = Set of all 2-term divisors;

while( intersections present)

{

Find Best_Divisor in {Divisors} ;

{T} = Set of terms involved in intersection;

{D} = Set of divisors involving any term in {T} ;

{Divisors} = {Divisors} – {D};

Rewrite Expressions;

{Dnew} = New Divisors involving new terms;

{Divisors} = {Divisors}∩ {Dnew};

}

- MxM constant matrix; N digits of precision
Y0 1111 1111 1011 1001Y0 = X0 + X0L + ... XM-1L3+ XM-1

Y1

.. … … … …

..

YM-11111 1110 0011 1010

M

N

O(MN) terms

M

=> O(M2N2) divisors

- Creating set of divisors {Divisors};
{Divisors} = φ;

for each expression Pi

{

{Dnew} = Divisors for Pi;

{Divisors} = {Divisors}∩ {Dnew};

Update frequency statistics of {Divisors} ;

}

O(M2N2) distinct divisors

O(M2N2)

O(M3N2)

O(M2N2)

{Divisors} = Set of all 2-term divisors;

while( intersections present)

{

Find Best_Divisor in {Divisors} ;

{T} = Set of terms involved in intersection;

{D} = Set of divisors involving any term in {T} ;

{Divisors} = {Divisors} – {D};

Rewrite Expressions;

{Dnew} = New Divisors involving new terms;

{Divisors} = {Divisors}∩ {Dnew};

}

O(M2N2)

- H.264 example
- >> Select D0 = (X0 + X3)

Y0 = X0 + X1 + X2 + X3

Y1 = X0L + X1 - X2 - X3L

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1L + X2L - X3

- H.264 example
- >> Select D1 = (X1 – X2)

Y0 = D0 + X1 + X2

Y1 = X0L + X1 - X2 - X3L

Y2 = D0 - X1 - X2

Y3 = X0 - X1L + X2L - X3

- H.264 example
- >> Select D2 = (X1 + X2)

Y0 = D0 + X1 + X2

Y1 = X0L + D1 - X3L

Y2 = D0 - X1 - X2

Y3 = X0 - D1L - X3

- H.264 example
- >> Select D3 = (X0 – X3)

Y0 = D0 + D2

Y1 = X0L + D1 -X3L

Y2 = D0 - D2

Y3 = X0 - D1L - X3

8+, 2<<

- Extracting 4 divisors

D0 = X0 + X3 Y0 = D0 + D2

D1 = X1 – X2 Y1 = D1 + D3L

D2 = X1 + X2 Y2 = D0 - D2

D3 = X0 - X3 Y3 = D3 – D1L

Original: 12+, 4<<

Rectangle Covering:

10+, 3<<

- Goal
- Reduction in #additions/subtractions
- Effect on area/latency on synthesis
- Simulate designs to estimate power consumption

- Transforms DCT, IDCT,DFT, DST, DHT.
- 8x8 constant matrices
- 16 digits precision (CSD representation)
- Compare with
- Potkonjak (TCAD’95)
- RESANDS (Nguyen et. al TVLSI’2000)
- Rectangle Covering (A.Hosangadi et.al ASAP’04)

Run Time 0.81s 0.08s

(III) RESANDS

(IV) Rect. Covering

(V) 2-term CSE

- Synthesis results (minimum latency constraints)

(III) RESANDS

(IV) Rect. Covering

(V) 2-term CSE

- Power consumption

- A new technique for eliminating common subexpressions in linear systems
- Fewer operations than known methods
- Much faster than rectangle covering
- Combine with scheduling on given resources

- Thank you
- Questions??