420 likes | 542 Views
This paper presents techniques for optimizing polynomial expressions to reduce power consumption in hardware synthesis, focusing on adaptive filters, interpolation/extrapolation, and encryption applications. The study explores redundancy elimination methods and algebraic techniques to minimize the number of operations required, taking a comparative approach with existing methods like Horner form. Key differences, kernel generation, and common subexpression extraction strategies are discussed, aiming to enhance efficiency and performance in energy-efficient hardware design.
E N D
Energy Efficient Hardware Synthesis of Polynomial Expressions18th International Conference on VLSI Design Anup Hosangadi Ryan Kastner ECE Department, UCSB Farzan Fallah Advanced CAD Research Fujitsu Labs of America
Outline • Introduction • Related Work • Problem formulation • Algorithms for optimizing polynomials • Experimental results • Conclusions
Introduction • Embedded system applications need to compute polynomial expressions • Continuous functions can be approximated by Taylor Series • Adaptive (polynomial) filters • Polynomial interpolation/extrapolation in Computer Graphics • Encrpytion
Introduction • Commonly occuring computations implemented in hardware • More flexibility than processor architecture • NPAs (Hardware accelarators) in PICO project • Custom Instructions (Tensilica) • Upto 100 times improvement over processor implementation (Kastner et.al TODAES’02) • Develop techniques for reducing power consumption
Related Work (Behavioral transforms) • Power consumption depends on many factors • Reducing number of operations • Hardware: (Nguyen and Chatterjee TVLSI’00) • Software: (I.Hong et.al TODAES’99) • Voltage reduction after speedup transformations • Retiming, Pipelining, Algebraic restructuring (Chandrakasan et. al TCAD’95)
Related Work • Scheduling and resource allocation • Shutting down unused resources (Monteiro et. al. DAC 96) • Allocation of registers, functional units and interconnects (A.Raghunathan et. al ICCD’94) • Multiple Vdd scheduling • Assigning supply voltage to each operation in CDFG (M.Chang and M.Pedram TVLSI’97)
Related Work • Switching power is proportional to number of operations • Multiplications are expensive in Embedded systems • Average 40 times more power than addition at 5V (V.Krishna et. al, VLSI Design 1999) • Careful optimization of expressions is therefore necessary to save power
Reducing operations in polynomial expressions • No good tool for polynomials • Designers rely on hand optimized libraries • Conventional compiler techniques: CSE and Value numbering not suited for polynomials. • Horner form: most popular representation • anxn + a1xn-1 + ….an-1x + a0 = (…((anx + an-1)x + an-2)x + ..a1)x + a0 • Not good for multivariate polynomials • Only a single polynomial expression at a time
Comparison with Horner form • Quartic-spline polynomial (3-D graphics) P = zu4 + 4avu3 + 6bu2v2 + 4uv3w + qv4 • Horner form (from MapleTM) P = zu4 + (4au3 + (6bu2 + (4uw + qv)v)v)v (17 multiplications) • Proposed algebraic method: d1 = v2 ; d2 = d1*v P = u3(uz + ad2) + d1( qd1 + u(wd2 + 6bu) ) (11 multiplications)
Related Work (Polynomial Expressions • Expression Factorization (M.A. Breuer JACM’69) • Allows only one kind of operator at a time • Using Symbolic Algebra (M.A.Peymandoust, De Micheli) • Mapping polynomial datapaths to libraries (DAC’01) • Low power embedded software (DATE’02) • Results depend heavily on set of library elements eg. (a2 – b2) = (a+b)(a-b) iff (a+b) or (a-b) is a library element • Manipulates only a single expression at a time F1 = A+ B + C + D; F2 = A + P + D; => Extract (A + D)
Motivating Example • Consider set of expressions • Using CSE 16 multiplications and 4 additions/subtractions 12 multiplications and 4 additions/subtractions
Motivational Example • Using Horner transform • Using our algebraic technique 12 multiplications and 4 additions/subtractions 7 multiplications and 3 additions/subtractions
Introduction to algebraic technique for redundancy elimination • Algebraic techniques in multi-level logic synthesis (MLLS) • Decomposition, factoringreduce number of literals • Distill and Condense use Rectangle Covering methods • Polynomial Expressions (Our Technique) • Factoring, Single term common subexpressions reduces number of multiplications • Multiple term common subexpressions reduces number of additions and possibly multiplications • Key Differences (Generalization to handle higher orders) • Kernelling techniques • Finding single cube intersections
Introduction to our technique(Outline) • Find a subset of all possible subexpressions (kernel generation) • Transformation of Polynomial Expressions • Problem formulation • Extract multiple term common subexpressions and factors • Extract single term common factors
Introduction to our technique • Terminology • Literal: A variable or a constant eg. a,b,2,3.14 • Cube: Product of literals e.g. +3a2b, -2a3b2c • SOP: Sum of cubes e.g. +3a2b – 2a3b2c • Cube-free expression: No literal or cube can divide all the cubes of the expression • Kernel: A cube free sub-expression of an expression, e.g. 3 – 2abc • Co-Kernel: A cube that is used to divide an expression to get a kernel, e.g. a2b
Introduction to our Technique • Matrix Representation of Polynomial Expressions • F = x3y – xy2z is represented by • Each row represents a product term • Each column represents a variable/constant • Each element (i,j) represents power of variable j in term i
Generation of Kernels (example) • P1 = x3y + x2y2z {L} = {x,y,z} • Divide by x: Ft = P1/x = x2y + xy2z
Generation of Kernels (example) Ft = P1/x = x2y + xy2z • C = Biggest Cube dividing all cubes of Ft / C = C = 1 1 0 = xy
Generation of Kernels (example) • Obtain Kernel: F1 = Ft/C = (x2y + xy2z)/(xy) = ( x + yz) • Obtain Co-Kernel D1 = x*(xy) = x2y • No kernels within F1. Go back to P1 • P1 = x3y + x2y2z • Divide now by next variable y Ft = x3 + x2yz • C = x2 • But (x < y) ε C Stop Here, to avoid repeating same kernel Ft/C = (x + yz) • No more kernels extracted • Record kernel F1 = P1 with co-kernel ‘1’
Concept of kernels and co-kernels • Theorem: Two expressions f and g can have a multiple term common subexpression iff there are 2 kernels Kf and Kg having a multiple term intersection • Detection of multiple term common subexpressions by intersection of sets of kernels • Each co-kernel : kernel pair represents a possible factorization • e.g. x3y + x2y2z = [x2y](x + yz) • Set of kernels a subset of all possible subexpressions
All Kernels and Co Kernels Which kernels to choose?
Kernel Cube Matrix (KCM) • One row for each Kernel generated • One column for each distinct kernel cube • Each non-zero element represents a term x3y
Finding Kernel Intersections(Distill Algorithm) • Each kernel intersection or factor appears as a rectangle • Rectangle: Set of rows and columns such that all elements are ‘1’ • Value of a rectangle = Weighted sum of the energy savings of the different operations • Goal: Maximum valued rectangular covering of KCM • Greedy heuristic: covering by prime rectangles
Modeling value function of a rectangle • Formula for weighted sum of energy savings on selection of a rectangle R = # of rows ; C = # of columns M(Ri) = # of multiplications in row (co-kernel) i. M(Ci) = # of multiplications in column (kernel-cube) i m = ratio of average energy consumption of multiplication to addition in the target library Value =
4x + 4yz = 4d1 d1 = (x + yz) x3y + x2y2z = x2yd1 Saves 5 multiplications and 1 addition Value = 201 units (m = 40) Distill Algorithm
Distill Algorithm 4xy – x2y = xyd2 d2 = 4 – x Saves 2 multiplications Value = 80 Remove covered terms
Distill Algorithm • Distill algorithm exits after no more kernel intersections can be found P1 = x2yd1 d1 = x + yz P2 = 4d1 – xyz d2 = 4 - x P3 = xyd2 Can further optimize by finding single cube intersections
Finding single cube intersections (Condense algorithm) • Form Cube Literal Matrix (CLM) • One row for each cube • One column for each literal • Eg. 2 cubes F1 = a4b3c; and F2 = a2b4c2
Finding single cube intersections (Condense algorithm) • Each (single term) common subexpression appears as a rectangle. • Rectangle: Set of rows and columns where all elements are non-zero • Value of a rectangle is number of multiplications saved by selecting it • C = cube corresponding to the rectangle Value = Rows*( (ΣC[i] ) -1) • Maximum valued rectangular covering will give minimum number of multiplications • Use greedy iterative covering by prime rectangles
Cube Literal Matrix (Condense Algorithm) CLM for our example after Distill algorithm Save 2 multiplications by extracting xy C = xy
Condense Algorithm Extracting xy No more favorable cube intersections found
Final Implementation • Total 7 multiplications, 3 additions/subtractions • Savings of 5 multiplications, 1 addition/subtraction compared to CSE • Impossible to obtain such results using conventional techniques
Experimental setup • Polynomials used in Computer graphics and Signal Processing • 1.0 µ technology library, characterized for power consumption • Synthesized using Synopsys Design CompilerTM • Min Hardware constraints (1 adder + 1 multiplier) • Med Hardware constraints (Max 4 multipliers)
Experimental setup • Estimated power using Synopsys Power CompilerTM for random inputs, using RTL Simulator (VCSTM) • Compared energy consumption with CSE and Horner form • Compared energy after voltage scaling
Conclusions • Technique to reduce number of operations in polynomial expressions • Large savings in energy consumption observed over CSE and Horner methods • Need to consider scheduling and resource allocation to obtain further improvements
Conclusions • Thank you!! • Questions ???
Finding Kernel Intersections(Distill Algorithm) • Worst case scenario for Distill algorithm • Number of prime rectangles exponential in number of rows/columns • Heuristic methods to find best prime rectangle • In practice polynomial expressions are not so large