220 likes | 347 Views
This paper presents an ε-optimal algorithm for optimizing clock tree wire sizing under zero-skew constraints, addressing both delay and power minimization in digital design. The ClockTune algorithm provides a comprehensive solution set for the delay/power/area trade-off, demonstrating efficient performance with pseudo-polynomial runtime. Experimental results show rapid convergence and optimality, making it suitable for practical design applications. The approach allows for incrementally refining clock trees, effectively reducing interconnect delay and power consumption while maintaining stability.
E N D
ε-Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National Taiwan University) University of Wisconsin-Madisonhttp://vlsi.ece.wisc.edu
Outline • Background • Motivation and contribution • Literature overview • ClockTune algorithm • Problem formulation • ClockTune algorithm overview • Optimality and complexity analysis • Experimental results • Runtime, memory usage, and optimality • Power/Delay trade-off • Incremental refinement
Motivation • Clock skew cycle time penalty • Start with zero-skew clock tree • Minimize clock delay reduces system-level skew (Kuh, et al. [DAC ‘90]) • Clock tree is power-hungry (30% in Intel McKinley(0.18um/1GHz/130W) • P = f CV2 • Minimize switching capacitance (wiring area) • Stability affects design convergence • Allow incremental refinement to accommodate local changes • Interconnect delay dominates total delay • Wire-sizing is effective in reducing interconnect delay
Motivation • Non-convex zero-skew constraints • No known algorithm solves zero-skew wire-sizing problem optimally with polynomial runtime • Hence, a good clock tree wire-sizing algorithm can • Minimize delay and power • Guarantee optimality and runtime • Have good stability
Contribution • First ε-optimal algorithm for solving clock min-delay/power zero-skew wire-sizing optimization problem • Provide complete (Sampled) solution set of the delay/power/area trade-off information for design planning • Efficient pseudo-polynomial runtime (6170-branch clock tree in 6 minutes within 1% optimality) • Runtime v.s. Optimality tradeoff • Incremental clock re-balancing to speed up design convergence
Literature Overview • “Reliable non-zero skew clock tree using wire width optimization”, Pillage, et al. [DAC ’93] • Iteratively optimize skew and delay using adjoint sensitivity analysis • Aimed at reliable clock trees under process variation • Deferred Merging Embedding (DME) algorithm, Kahng, et al. [TCAD ’92] • Bottom-up merging segment construction, top-down embedding • Integrated Deferred Merging Embedding (IDME) algorithm, Wong, et al. [ISPD’00] • Handles simultaneous routing, buffer-insertion, and wire-sizing • Merging segment set: a set of line samples of a merging region • No optimality guarantee • The size of MSS grows exponentially • “Process variation aware clock tree routing”, Lu, et al. [ISPD ’03] • Based on DME/BST
Outline • Background • Motivation and contribution • Literature overview • ClockTune algorithm • Problem formulation • ClockTune algorithm overview • Optimality and complexity analysis • Experimental results • Runtime, memory usage, and optimality • Power/Delay trade-off • Incremental refinement
Problem formulation • min-ZSWS (Zero Skew Wire Sizing) problem • Given a clock routing minimize s.t. where Pi, Pj are paths from v to leaf nodes i and j • Zero-skew constraints are non-convex constraints • No known algorithm solves the problem optimally in polynomial runtime
DC region DC region approach • Clock Delay and wiring Capacitance are top concerns • Define f : RNR2, such that • fY(w) = Delay(Tv(w)), fX(w) = Capacitance(Tv(w)) • DC region (v):The projection of the feasible region • Choose a d-c pair from the DC region on R2 Feasible region
ClockTune algorithm overview • Phase 1: bottom-up construct DC regions for every node • Phase 2: top-down embedding after delay/power tradeoff
Optimality analysis • Embeddings not fall on the delay samples will be omitted • Propagated error • Delay sampling error • Wire width sampling error (detailed in the paper)
Optimality analysis • Error is bounded • d : delay sampling resolution • w : wire width sampling resolution • k, : Constants related to l, r0, c0, wm, wM … • Generally speaking, error reduced about a half when resolution doubled Error Resolution
Optimality runtime trade off • Control sampling resolution can trade off optimality with runtime and memory
Complexity analysis • Runtime • Bottom-up phase takes O(n p max(p,q)) • Top-down phase takes O(np) • Overall: O(n p max(p,q)) • Memory • O(np) where n : number of nodes of the clock tree, p : number of delay samples taken at each node q : number of wire width samples taken at each level-2 node
Outline • Background • Motivation and contribution • Related works • problem formulation • ClockTune Algorithm • Design space projection • Algorithm overview • Optimality and complexity analysis • Experimental Results • Runtime, memory usage, and optimality • Power/Delay trade-off • Incremental refinement
Experimental setup • ClockTune is implemented in C++, executed on a 128MB 533MHz Pentium III PC • Benchmarks r1 – r5 from Tsay et al. [ICCAD‘91] • Initial routing generated by BB+DME algorithm with minimum wire width w = 1 m • ClockTune uses wm = 1 m, wM = 4 m • p: number of delay samples taken at every node • q: number of wire width samples taken at every level-2 node • r0 = 0.03, c0 = 210-16/m2
Runtime and memory usage • Runtime and memory usage are linear to problem size when p, q are fixed • Within 1% optimality when p,q=256 (runtime < 6 minutes, memory ~ 64MB)
Optimality results • Optimality • Error below 1% with p=q=256 • Error reduced to about a half when resolution doubled
Power/Delay trade-off 5~150ns Delay r5 Minimum power 0.2~1.1nF Minimum delay Capacitance 15:1 delay:power trade-off
Incremental refinement • DC region captures the design space • Enables incremental refinement
Conclusion & Future Work • Provide a zero-skew clock tree wire-sizing algorithm which • Minimizes delay and area ε-optimally • Guarantees pseudo-polynomial runtime and memory usage • Provides delay/power trade-off information to designers • Speeds up design convergence by allowing clock tree re-balancing with minimum changes • Better delay model • Buffer insertion/sizing capability