1 / 30

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis. Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR March 21 st , 2007 ISPD 2007, Austin. Outline. Introduction Problem Formulation Clustering Algorithm Experimental Results Conclusion.

orsen
Download Presentation

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR March 21st, 2007 ISPD 2007, Austin

  2. Outline • Introduction • Problem Formulation • Clustering Algorithm • Experimental Results • Conclusion

  3. Local Clock Capacitance Distribution in a Microprocessor • Interconnects contribute to major portion of total capacitance • Clocks are the most active nets in the design • Minimizing interconnect capacitance in clocks leads to reduction in dynamic power • Distribution generated from several blocks in a microprocessor

  4. Microprocessor Clock Hierarchy Local Clock Network: CTS Solution Space • Clock network in a processor: • Distributed as a grid followed by tree Global Clock Distribution Using Multiple spines LCBs RCBs Regional Clock Buffers PLL Local Clock Buffers RCBs LCBs To state elements Tunable Grid Buffers Clock Grid

  5. Previous Work • Zero skew (unbuffered) trees: Tsay TCAD’93, Boese et al. ASIC’92, Edahiro DAC’93, ’94 • Buffered trees: • Vittal et al., DAC’95: Trades off buffers with wires; unsuitable for controlled implementation of clock gating and delayed clocking • Mehta et al., ICCD’97: Uses dynamic programming based heuristic for clustering • Tsai et al., ICCAD’05: Formulation employing tunable buffers

  6. Sequentials (x,y), sizes Logical Clock Tree RTL Clock Buffer Duplication Logic Synthesis Routing Clock Nets Physical Synthesis Sizing Clock Buffers CTS Routing (Simplified version) Clock Tree Synthesis (CTS) • Performed after the placement/sizing of sequentials • Converts logical clock tree into physical one • Flow employed in several microprocessor designs CTS

  7. Duplication K-stage buffers Duplication K-stage receivers Clock Buffer Duplication • Given a clock buffer, duplicate it to meet delay, slope, RC, skew constraints • Decides • receivers driven by the same driver • the clock tree topology • Applied recursively in reverse topological order • Driven by clustering or partitioning • Often intractable when capacity constraints specified • Many heuristics available

  8. Outline • Introduction • Problem Formulation • Clustering Algorithm • Experimental Results • Conclusion

  9. Effect of Clustering on Capacitance 4 placed sequentials Solution 1 Solution 2 Solution 3 • A cluster implies a clock buffer • Interconnect capacitance varies significantly for different solutions even with same number of clusters

  10. Clustering Targeting Power • Find the clusters such that total local clock power is minimum • Power in local clock, PLocal Clock= PDynamic+ PLeakgge • PDynamic = PSequentialCap + PBufferCap + PRouting Cap • PLeakage and PBufferCap can be shown proportional to total cap • PSequentialCap is fixed for CTS purposes • Reducing PLocal Clock is equivalent to minimizing interconnect cap • Find the clusters such that total interconnect capacitance is minimum

  11. ? Routing-aware Clustering: Chicken-and-Egg Problem • Routing cap is unknown till the clustering is performed • Clustering cannot be performed till routing cap is known

  12. Problem Simplification • Let’s assume minimum spanning tree (MST) routing estimates • Other candidates: HPWL, Edahiro metric • Data in the paper show MST and Edahiro metric strongly correlated with actual clock tree wirelength • MST possesses submodularity property suitable for greedy optimization • Can the problem be solved optimally, i.e., can we perform clustering such that the routing cap./overall power is minimum • Yes, it can be (if capacity constraints are dropped)

  13. Problem Definition • Given: Set of receivers S = {s1, …, sn}, their loads (csi), and locations (xsi, ysi) • Find: A set of clusters, Sclusters = {c1, …, cm} such that Σiα + MST (ci) is minimum • Subject to Constraints (or Design Parameters): • Maximum # of receivers • Due to process, routing, etc. • Maximum load in a cluster • Due to library • Bounding box width/height • To control RC delay and variations in it

  14. Outline • Introduction • Problem Formulation • Clustering Algorithm • Experimental Results • Conclusion

  15. Power-aware Clustering Algorithm • Similar to Kruskal’s MST construction algorithm • Steps in algorithm: • Create complete graph G(S, E, W) • Assign each edge estimated capacitance as the weight • Create trivial solution with each cluster containing a receiver • For each edge, in ascending order of weights • Merge clusters till the cost function is minimized

  16. 1 A cluster An edge 5 5 4 4 The weight 2 Example • Constraint: maximum # of receivers constraint 3

  17. 1 5 5 4 4 2 Example • Constraint: maximum # of receivers constraint 3

  18. 1 5 5 4 4 2 Example • Constraint: maximum # of receivers constraint 3

  19. 1 5 5 4 4 2 Example • Constraint: maximum # of receivers constraint 3 • Power-aware clustering results in clusters with total MST value of 3, which is optimal in this case

  20. Optimality, Time Complexity of Algorithm • Ensures optimality when no capacity constraints (max. load, # of receivers) specified • Reduces to minimum spanning forest problem • Runs in O(n2 log n) time in number of receivers • Handles blocks with ~5K sequentials easily • 1.34 seconds for clustering of 1037 sequentials • Run-times practical and comparable to competitive algorithms • Clock buffer duplication takes minutes on ~5K sequential blocks

  21. Outline • Introduction • Problem Formulation • Clustering Algorithm • Experimental Results • Conclusion

  22. Evaluation of Power-Aware Clustering (PoAwCl) • Implemented clustering algorithm, PoAwCl, in C++ • Incorporated in the clock buffer duplication step using TCL • Rest of the CTS kept unchanged • Generated clock trees on microprocessor blocks by changing only the clustering/partitioning heuristics • Best of the results compared with the PoAwCl

  23. 13% Average Improvement Results on Clock Trees: Int. Cap. Improvement

  24. 6% Average Improvement Results on Clock Trees: Total Cap. Improvement

  25. 11% Average Improvement Results on Clock Trees: Wirelength Improvement

  26. ●,+,*,▼denote locations of sequentials; same type symbols denote a cluster 4 clusters, in each case, represent 4 clock buffers driving the sequentials in their clusters Looking at Cluster Pictures Power-aware clustering Clustering aimed at minimizing # of buffers

  27. Power-aware clustering (on right) results in smaller wirelength Viewing the Routing

  28. Agenda • Introduction • Motivation • Problem Formulation • Clustering Algorithm • Experimental Results • Conclusion

  29. Conclusion • Power-aware clustering results in 13% improvement in interconnect cap • Also Frees up routing resources by 11% discounting shielding and spacing of clock wires • Used for other applications such as enable logic (or clock gating) synthesis, trunk-routing • Acknowledgment: Intel’s CAD Organization • for providing the source code of the CTS package which sped up the development

  30. Thank you….

More Related