1 / 41

Anup Hosangadi Ryan Kastner ECE Department, UCSB

Energy Efficient Hardware Synthesis of Polynomial Expressions 18 th International Conference on VLSI Design. Anup Hosangadi Ryan Kastner ECE Department, UCSB. Farzan Fallah Advanced CAD Research Fujitsu Labs of America. Outline. Introduction Related Work Problem formulation

konala
Download Presentation

Anup Hosangadi Ryan Kastner ECE Department, UCSB

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Energy Efficient Hardware Synthesis of Polynomial Expressions18th International Conference on VLSI Design Anup Hosangadi Ryan Kastner ECE Department, UCSB Farzan Fallah Advanced CAD Research Fujitsu Labs of America

  2. Outline • Introduction • Related Work • Problem formulation • Algorithms for optimizing polynomials • Experimental results • Conclusions

  3. Introduction • Embedded system applications need to compute polynomial expressions • Continuous functions can be approximated by Taylor Series • Adaptive (polynomial) filters • Polynomial interpolation/extrapolation in Computer Graphics • Encrpytion

  4. Introduction • Commonly occuring computations implemented in hardware • More flexibility than processor architecture • NPAs (Hardware accelarators) in PICO project • Custom Instructions (Tensilica) • Upto 100 times improvement over processor implementation (Kastner et.al TODAES’02) • Develop techniques for reducing power consumption

  5. Related Work (Behavioral transforms) • Power consumption depends on many factors • Reducing number of operations • Hardware: (Nguyen and Chatterjee TVLSI’00) • Software: (I.Hong et.al TODAES’99) • Voltage reduction after speedup transformations • Retiming, Pipelining, Algebraic restructuring (Chandrakasan et. al TCAD’95)

  6. Related Work • Scheduling and resource allocation • Shutting down unused resources (Monteiro et. al. DAC 96) • Allocation of registers, functional units and interconnects (A.Raghunathan et. al ICCD’94) • Multiple Vdd scheduling • Assigning supply voltage to each operation in CDFG (M.Chang and M.Pedram TVLSI’97)

  7. Related Work • Switching power is proportional to number of operations • Multiplications are expensive in Embedded systems • Average 40 times more power than addition at 5V (V.Krishna et. al, VLSI Design 1999) • Careful optimization of expressions is therefore necessary to save power

  8. Reducing operations in polynomial expressions • No good tool for polynomials • Designers rely on hand optimized libraries • Conventional compiler techniques: CSE and Value numbering not suited for polynomials. • Horner form: most popular representation • anxn + a1xn-1 + ….an-1x + a0 = (…((anx + an-1)x + an-2)x + ..a1)x + a0 • Not good for multivariate polynomials • Only a single polynomial expression at a time

  9. Comparison with Horner form • Quartic-spline polynomial (3-D graphics) P = zu4 + 4avu3 + 6bu2v2 + 4uv3w + qv4 • Horner form (from MapleTM) P = zu4 + (4au3 + (6bu2 + (4uw + qv)v)v)v (17 multiplications) • Proposed algebraic method: d1 = v2 ; d2 = d1*v P = u3(uz + ad2) + d1( qd1 + u(wd2 + 6bu) ) (11 multiplications)

  10. Related Work (Polynomial Expressions • Expression Factorization (M.A. Breuer JACM’69) • Allows only one kind of operator at a time • Using Symbolic Algebra (M.A.Peymandoust, De Micheli) • Mapping polynomial datapaths to libraries (DAC’01) • Low power embedded software (DATE’02) • Results depend heavily on set of library elements eg. (a2 – b2) = (a+b)(a-b) iff (a+b) or (a-b) is a library element • Manipulates only a single expression at a time F1 = A+ B + C + D; F2 = A + P + D; => Extract (A + D)

  11. Motivating Example • Consider set of expressions • Using CSE 16 multiplications and 4 additions/subtractions 12 multiplications and 4 additions/subtractions

  12. Motivational Example • Using Horner transform • Using our algebraic technique 12 multiplications and 4 additions/subtractions 7 multiplications and 3 additions/subtractions

  13. Introduction to algebraic technique for redundancy elimination • Algebraic techniques in multi-level logic synthesis (MLLS) • Decomposition, factoringreduce number of literals • Distill and Condense use Rectangle Covering methods • Polynomial Expressions (Our Technique) • Factoring, Single term common subexpressions reduces number of multiplications • Multiple term common subexpressions reduces number of additions and possibly multiplications • Key Differences (Generalization to handle higher orders) • Kernelling techniques • Finding single cube intersections

  14. Introduction to our technique(Outline) • Find a subset of all possible subexpressions (kernel generation) • Transformation of Polynomial Expressions • Problem formulation • Extract multiple term common subexpressions and factors • Extract single term common factors

  15. Introduction to our technique • Terminology • Literal: A variable or a constant eg. a,b,2,3.14 • Cube: Product of literals e.g. +3a2b, -2a3b2c • SOP: Sum of cubes e.g. +3a2b – 2a3b2c • Cube-free expression: No literal or cube can divide all the cubes of the expression • Kernel: A cube free sub-expression of an expression, e.g. 3 – 2abc • Co-Kernel: A cube that is used to divide an expression to get a kernel, e.g. a2b

  16. Introduction to our Technique • Matrix Representation of Polynomial Expressions • F = x3y – xy2z is represented by • Each row represents a product term • Each column represents a variable/constant • Each element (i,j) represents power of variable j in term i

  17. Generation of Kernels (example) • P1 = x3y + x2y2z {L} = {x,y,z} • Divide by x: Ft = P1/x = x2y + xy2z

  18. Generation of Kernels (example) Ft = P1/x = x2y + xy2z • C = Biggest Cube dividing all cubes of Ft / C = C = 1 1 0 = xy

  19. Generation of Kernels (example) • Obtain Kernel: F1 = Ft/C = (x2y + xy2z)/(xy) = ( x + yz) • Obtain Co-Kernel D1 = x*(xy) = x2y • No kernels within F1. Go back to P1 • P1 = x3y + x2y2z • Divide now by next variable y Ft = x3 + x2yz • C = x2 • But (x < y) ε C Stop Here, to avoid repeating same kernel Ft/C = (x + yz) • No more kernels extracted • Record kernel F1 = P1 with co-kernel ‘1’

  20. Concept of kernels and co-kernels • Theorem: Two expressions f and g can have a multiple term common subexpression iff there are 2 kernels Kf and Kg having a multiple term intersection • Detection of multiple term common subexpressions by intersection of sets of kernels • Each co-kernel : kernel pair represents a possible factorization • e.g. x3y + x2y2z = [x2y](x + yz) • Set of kernels a subset of all possible subexpressions

  21. All Kernels and Co Kernels Which kernels to choose?

  22. Kernel Cube Matrix (KCM) • One row for each Kernel generated • One column for each distinct kernel cube • Each non-zero element represents a term x3y

  23. Finding Kernel Intersections(Distill Algorithm) • Each kernel intersection or factor appears as a rectangle • Rectangle: Set of rows and columns such that all elements are ‘1’ • Value of a rectangle = Weighted sum of the energy savings of the different operations • Goal: Maximum valued rectangular covering of KCM • Greedy heuristic: covering by prime rectangles

  24. Modeling value function of a rectangle • Formula for weighted sum of energy savings on selection of a rectangle R = # of rows ; C = # of columns M(Ri) = # of multiplications in row (co-kernel) i. M(Ci) = # of multiplications in column (kernel-cube) i m = ratio of average energy consumption of multiplication to addition in the target library Value =

  25. 4x + 4yz = 4d1 d1 = (x + yz) x3y + x2y2z = x2yd1 Saves 5 multiplications and 1 addition Value = 201 units (m = 40) Distill Algorithm

  26. Distill Algorithm 4xy – x2y = xyd2 d2 = 4 – x Saves 2 multiplications Value = 80 Remove covered terms

  27. Distill Algorithm • Distill algorithm exits after no more kernel intersections can be found P1 = x2yd1 d1 = x + yz P2 = 4d1 – xyz d2 = 4 - x P3 = xyd2 Can further optimize by finding single cube intersections

  28. Finding single cube intersections (Condense algorithm) • Form Cube Literal Matrix (CLM) • One row for each cube • One column for each literal • Eg. 2 cubes F1 = a4b3c; and F2 = a2b4c2

  29. Finding single cube intersections (Condense algorithm) • Each (single term) common subexpression appears as a rectangle. • Rectangle: Set of rows and columns where all elements are non-zero • Value of a rectangle is number of multiplications saved by selecting it • C = cube corresponding to the rectangle Value = Rows*( (ΣC[i] ) -1) • Maximum valued rectangular covering will give minimum number of multiplications • Use greedy iterative covering by prime rectangles

  30. Cube Literal Matrix (Condense Algorithm) CLM for our example after Distill algorithm Save 2 multiplications by extracting xy C = xy

  31. Condense Algorithm Extracting xy No more favorable cube intersections found

  32. Final Implementation • Total 7 multiplications, 3 additions/subtractions • Savings of 5 multiplications, 1 addition/subtraction compared to CSE • Impossible to obtain such results using conventional techniques

  33. Experimental setup • Polynomials used in Computer graphics and Signal Processing • 1.0 µ technology library, characterized for power consumption • Synthesized using Synopsys Design CompilerTM • Min Hardware constraints (1 adder + 1 multiplier) • Med Hardware constraints (Max 4 multipliers)

  34. Experimental setup • Estimated power using Synopsys Power CompilerTM for random inputs, using RTL Simulator (VCSTM) • Compared energy consumption with CSE and Horner form • Compared energy after voltage scaling

  35. Results (Comparing operations)

  36. Results (Min Hardware constraints)

  37. Results (Med Hardware constraints)

  38. Conclusions • Technique to reduce number of operations in polynomial expressions • Large savings in energy consumption observed over CSE and Horner methods • Need to consider scheduling and resource allocation to obtain further improvements

  39. Conclusions • Thank you!! • Questions ???

  40. Extra slides

  41. Finding Kernel Intersections(Distill Algorithm) • Worst case scenario for Distill algorithm • Number of prime rectangles exponential in number of rows/columns • Heuristic methods to find best prime rectangle • In practice polynomial expressions are not so large

More Related