Large Scale Circuit Placement: Gap and Promise

Large Scale Circuit Placement: Gap and Promise Jason Cong UCLA VLSI CAD LAB1 Joint work with Chin-Chih Chang, Tim Kong, Michail Romesis, Joseph R. Shinnerl, Min Xie and Xin Yuan

Outline • Introduction • Gap Analysis of Existing Placement Algorithms • Scalable Paradigm – Multilevel Placement

Why Still Placement Problem • True, it has been studied over 30 years, but … • We need good solutions more then ever • One of most important steps in IC implementation flow • Directly defines interconnects • Difficult • Problem size grows 2X every 18-24 months • Moore’s Law • Cannot place hierarchically without quality degradation

Example of Logic Hierarchy in Final Layout By courtesy of IBM (Tony Drumm)

Why Still Placement • True, it has been studied over 30 years, but … • We need good solutions more then ever • One of most important steps in IC implementation flow • Directly defines interconnects • Difficult • Problem size grows 2X every 18-24 months • Moore’s Law • Cannot place hierarchically without quality degradation • We are not very good at it …

Outline • Introduction • Gap Analysis of Existing Placement Algorithms • Scalable Paradigm – Multilevel Placement

Motivation • Lack of significant progress in wirelength reduction • Rate of reduction is about 5-10% every 2-3 years • Latest developments in placement differ mainly in runtime • Most work compare only with known heuristics • Use real design based benchmarks • Use synthetic benchmarks • Little understanding about the divergence from the optimal

All the modules are of equal size, and there is no space between rows and adjacent modules • For 2-pin nets , connect any two adjacent modules • For each n-pin net , connect the n modules in a rectangular region close to a square, i.e., the length of each side is close to sqrt(n) • The wirelength is of each n-pin net is given by Placement Examples with Known Optimal Wirelength [Chang et al, 2003] • Given a (real) netlist N • Construct netlist N’ with known opt. WL and match the net distribution of N

Extend PEKO by introducing non-local nets to mimic global connections • Method 1: Generate a subset of i-pin nets byrandomly connecting i modules on the chip • Method 2: Generate a subset of i-pin nets according to wirelength distribution vector (WDV) Placement Examples with Known Upperbounds [Cong et al, 2003] • Limitations of PEKO • All the nets are local • Wirelength contribution by global connections in real designs can be significant

Generate 7 2-pin randomly Illustration:PEKU Example Construction Input : t = 64, D = {d2=35,d3=21,d4=7,d5=4,d6=2, d7=1} =0.2 W = {w1…w3=0, w4=3, w5=3, w6= 0,w7 =2,w8 =2,w9=1, w10=0, w11=1, w12=1} Generate 28 2-pin optimally Generate 16 3-pin optimally Generate 5 3-pin randomly Generate 6 4-pin optimally Generate 1 4-pin randomly Generate 4 5-pin optimally Generate 2 6-pin optimally Generate 1 7-pin optimally Total WL = 184

Studied Five State-of-the-Art Placers • Capo [Caldwell et al, 2000] • Based on multilevel partitioner • Aims to enhance the routability • Dragon [Wang et al, 2000] • Uses hMetis for initial partition • SA with bin-based swapping • mPL [Chan et al, 2000] • Nonlinear programming on the coarsest level • Discrete relaxation at finer levels • mPG [Chang et al, 2002] • Uses FC clustering and hierarchical density control • Incremental A-tree for routability • Qplace [Cadence Inc.] • Leading edge industrial placer • Component of Silicon Ensemble

Experimental Results on PEKO • Existing Algorithms can be 59% to 140% away from the optimal on PEKO • On Examples with pads • mPG and Qplace show improvement of 12% and 10% repectively • Dragon, mPL, and Capo do not benefit much from the additional information • There is significant room for improvement in placement algorithms

Experimental Results on PEKO • Capo, QPlace and mPL scales well in runtime • Average solution quality of each tool shows deterioration by an additional 9% to 17% when the problem size increases by a factor of 10

Experimental Results on PEKU • The effectiveness of existing placers can vary significantly for circuits of similar size but different characteristics • Comparing QRs helps to identify the technique that works best under each scenario QR (Placed Wirelength vs Upperbound) may not be tight

High Interest in the Community

Original longest path Artificial path Timing-driven Placement Examples with Known Optimal (TPEKO) • Obtain a placement for the circuit from any available tool • Perform timing analysis on the circuit • Create an artificial combinational path with equal or larger delay than the longest path • Guarantee the cells in the path are adjacent to each other • Make necessary modifications

Evaluating Timing-Driven Placement Algorithms Using TPEKO • Evaluating two state-of-the-art FPGA placement algorithms • VPR [Marquardt et al. 2000] • PATH [Kong 2002] • Can be far away from the optimal for difficult examples • 35% on average • 54% in the worst case

Observations from Gap Analysis • Significant opportunity in placement • Existing algorithms may produce solutions far away from the optimal • The quality result of the same placer varies for circuits of similar size but different characteristic • Scalability problem in runtime and solution quality • Significant ROI • Benefit equal to one to two generations of process scaling • But without requiring multi-billion dollar investment (hopefully!)

Outline • Introduction • Gap Analysis of Existing Placement Algorithms • Scalable Paradigm • Timing Optimization • Routability Optimization • ConcludingRemarks • Application • Multi-Million Gate FPGA Placement

Paradigm 2: Multilevel Placement • Coarsening: build the hierarchy by recursive aggregation (generalized clustering) • Relaxation: improve the placement at each level by localized optimization • Interpolation: transfer coarse-level solution to adjacent, finer level (generalized declustering) • Multilevel Flow: multiple traversals over multiple hierarchies (V-cycle variations)

Given problem Problem size decreases Interpolation & Relaxation (optimization) Coarsening(Clustering) Multi-Level Optimization Framework • Multilevel coarsening generates smaller problem sizes at coarser levels  faster optimization at coarser levels • May explore different aspects of the solution space at different levels • Gradual refinement on good solutions from coarser levels is very efficient • Successful in many applications • Originally developed for PDEs • Recent success in VLSI CAD: partitioning, placement, routing

Coarsening by clustering • A bin grid structure at each level • Hierarchical area density control • Optimization by SA, QP, RDFL, etc. Initial Placement Refinement by placement Multilevel Coarse Placement

Merged Nets Merge each vertex with its “best” neighbor Multilevel Methods: Coarsening by Recursive Aggregation • Recursive aggregation defines the hierarchy. • Different aggregation algorithms can be used on different levels and/or in different V-cycles. • Clustering methods • First-Choice Clustering (hMetis [Karypis 1999]). • AMG based aggregation • An aggregate need not be a cluster. A cell can be fractionally associated to more than one aggregate

Multilevel Methods: Relaxation(Intralevel Optimization) • Iterative improvement at each level by fast, localized computation • Discrete permutation enumerations; swapping • Unconstrained quadratic wirelength minimization on subsets • Network-flow based improvement on subsets (RDFL) • Local relaxation is sufficient. Global improvement comes from the multilevel hierarchy. • Relaxations at finer levels may be quite different, e.g., more discrete, than relaxations at coarser levels.

Movable Cell Unrelated Cell Fixed Neighbor Relaxation on Local Subsets Move the red cells to their optimal positions, holding all other cells fixed and (perhaps) ignoring overlap Original Subnetlist with Subproblem

Example: Goto-based Discrete Relaxation • Each cell’s optimal location is readily calculated when all other cells are held fixed. • Compute a chain A, B, C, D, E, whereB is a randomly selected neighbor of A’s optimal location, etc. • Examine all permutations of the chain and take the best one. • Problem: the chain is not closed (A is not necessarily near any other cell’s optimal location).

Example: Quadratic Relaxation on Noncontiguous Subsets (QRS) • Select a subset M of cells to move • Identify other cells and pads, F, connected to M by nets in • Decouple the horizontal and vertical problems. • M is obtained as segments of length k along a DFS vertex traversal of the netlist

Solving the QRS subproblem • Problem formulation (horizontal case): • Iteratively solve the weighted quadratic minimization problem, using the current solution to determine the weight (as in Gordian-L) • May result in cell overlap!

Calculate a max-gain monotone path on the bin-grid graph Define a DAG on neighboring bins. Edge cost reflects the best wirelength gain over all cell swaps between two bins. Ripple-move legalization [Hur and Lillis, 2000] Because many forms of subset relaxation ignore overlap, post-relaxation cell swaps may be needed to remove overlap.

Multilevel Methods: Interpolation(Generalized Declustering) • Goal: transfer a partial solution from a coarser level to its adjacent finer level • Simplest approach: place all components of a cluster at its center • Better approach: place each component of an aggregate at the weighted average of the aggregates to which it is strongly connected. • Optionally: impose constraints; e.g., the average location of the components can be held fixed.

Place the C-Pt representatives The inherited position of a cluster component ( ) can be determined by several cluster positions, not just its own. Place the F-pts by weighted interpolation AMG-style Linear Interpolation

Next finer level cells cluster AMG interpolation C-point Within each cluster, select the one with maximum degreeas C-point; others are considered as F-points AMG-based Linear Interpolation [A. Brandt 1986] constant

Geometric based FC clustering Iterated Multilevel Flow Make use of placement solution from 1st V-cycle First Choice (FC) clustering

Iterated Multilevel Flow Iterated V-Cycles F-Cycle Backtracking V-Cycle

Sample Impact of the Multilevel Components to mPL’s overall quality • First-Choice Clustering: 3—4% reduced WL • QRS Relaxation: 5—6% reduced WL • AMG Interpolation: 2—3% reduced WL • Iterated V-cycles: 2—8% reduced WL

mPL 3.0 vs. mPL1.0 and Gordian-L (12% better than mPL1.0 with 2x longer runtime; 2% better than Gordian-L and 7x faster)

mPL 3.0 vs. Capo 8.5 and Dragon (10% better than Capo with 2x longer runtime;4% worse than Dragon but 4x faster)

2nd V-cycle QRS AMG mPL3.0 vs. mPL1.0, Capo8.5, Dragon and Gordian-L

Level 0 small object big object cluster fixed big object Big objects legalization Level k Coarsest level placement Extension: Multilevel Mixed-size Placement • Simultaneous place big and small objects • Gradually fix the locations of big objects and generate overlap-free placement for big objects during multilevel placement

Example: Final Placement of ibm02 by mPG-ms

Concluding Remarks • There is significant opportunity to improve the placement technologies • Multilevel placement is a promising scalable solution

Large Scale Circuit Placement: Gap and Promise