A NetworkFlow Approach to TimingDriven Incremental Placement for ASICs

A NetworkFlow Approach to TimingDriven IncrementalPlacement for ASICs 􀀀 Shantanu Dutt, Huan Ren, Fenghua Yuan and Vishal Suthar Dept. of Electrical and Computer Engineering University of IllinoisChicago

Outline • Motivation & prior work • General methodology of FlowPlace • Net delay model • TD analytical global placement • TD network flow based detailed placer • Benchmarks • Experimental results • Conclusions

Placement in high performance designs Has large effect on performance metrics, e.g., timing, power Fast timing closure is a major but often hard-to-realize goal Need to meet several metrics at the same time Incremental timing-driven placement Initial placement  improve timing incrementally on crit. paths More accurate timing information can be acquired from the initial placement Minimize the affect to other metrics in initial placement—convergence is a byproduct Also important for ECO applications Motivation

Prior Work • Existing timing driven placement • Path-based: minimize the critical paths directly • Pros: timing is essentially path-based • Cons: excessive number of paths. • Net-based: transform timing into net-weights or net-budgets • Pros: low complexity, flexible • Cons: often ignores path information; has a convergence problem • Net-based approach is the most common method • Kahng et al. (ISPD’02) • Minimize the max weighted net delay using LP with net weight based on the max path delay violation through the net. All paths meet constraints simultaneously. • Can fit into a standard WL-driven top-down design flow • Yang et al. (ICCAD’02) • New slack allocation approach which assigns more slack to nets with larger estimated WL and fanout • Minimizing total net delay violation using simulated annealing • Achieves a more efficient slack usage in final placement

Prior Work (cont.) Incremental TD placement • Wonjoon et al. (ICCAD’03) • Path based constraints for every violated path pj (has maximum path # limit) • Simple bisection method to remove overlap, no control of delay change • Luo et al. (DAC’06) • Consider both cells in the critical path and cells that are logically adjacent to the critical path to control timing perturbation • Delay model with delay/slew propagation • Both algorithms use LP for replacement, which doesn’t address the quadratic part of the delay accurately N/w flow based detailed placement • Brenner et al. (ISPD’04), Doll et al. (ICCAD’94) • Try to send flow from congested area or cells that haven’t been placed to vacant area with minimum cost • Allow temporary small illegality (e.g., overlap or out of boundary) caused by movement according to the flow • WL driven, and the deterioration is small from global placement results

Initial placed circuit STA & Determine critical node set (moveC) TD analytical global placement (TAN) on moveC TD n/w-flow based detailed placement (TIF) On moveC New placement w/ improved performance Our Goals & Methodology • Goals: • Accurate pre-route delay est. • Targeted global & detailed TD re-placement of critical & near-critical paths • Minimal effect on the rest of the circuit • Fast

up (xp, yp) ud (xd, yd) uq (xq, yq) ui (xi, yi) centroid C(xc, yc) Star graph model WL and Pre-Route Delay Model • WL calculation We use a star graph model to calculate WL Driver node driving load capacitance • Pre-route delay model Self interconnect delay ui (xi, yi) (1-g) of Ctotal ld, i Self interconnect seeing other interconnect & load capacitance up (xp, yp) ud (xd, yd) ld, i/2 g of Ctotal uq (xq, yq) Delay model Best results for g = 1 Fidelity of our model. The future model is still under development, which modeling nets with multiple star structures

Net w/ 2 critical paths through it TD Analytical Global Placement (TAN) • A TD extension of a combination of Gordian and Gordian-L • Essentially a quadratic programming approach • Use an iterative approach to model the linear terms of delay in objective function • Critical delay cost of a net • Need to focus only on the sinks on the critical paths. • Formulation: • A net with more critical paths through it is more important to optimize—can achieve min. on all those paths w/ one opt. step

3 3 3 4 4 4 3 3 After optimization one is longer than the other 2 2 2 3 3 After optimization both paths have approx. the same delay TD Analytical Global Placement (contd.) • Allocated slack of a net • A weight measure for determining TD WL reduction of a net • Two factor needs to be considered: minimum path slack through the net and # of nets in that path. • Therefore we uniformly allocate path slack to each net, the allocated slack of a net is: Equi-delay paths Net slack= Path slack: Observaton: Nets with the same weight in TAN tend to have the same length after optimization 6 6 Net delay Before optimization two paths have the same delay Thus we can get:

Quadratic terms. Can be solved by normal quadratic programming technique Linear part. The linear terms here is approximated by a quadratic terms as following In the formulation, the coordinates in the denominator is the current value We do several iterations until the results convergent The linear terms of y is dealt in the same way TD Analytical Global Placement (contd.) • Final objective function to solve min-max via min-sum • The delay cost of a net • The objective function The delay cost part is divided into quadratic and linear part

TD N/W-Flow Based Detailed Placement (TIF) C21 C11 C12 C13 C14 A1 C22 C24 W2 Cell placement after cells are moved in the flow direction • General Purpose • Solves the overlap problem form global placer • Minimizes the deterioration of delay improvement obtd. from global placer • Legalizes the placement satisfying WS constraints General n/w-flow graph C11 C12 C13 C14 W1 Row1 Source Row2 C21 C22 W21 C24 W2 T S Sink A1 A2 • Arc cost = TD cost; linear & step funct. • Arc capacity: • hor: how much a cell can move • (accuracy issues) • vert: width of head cell • S  moved cell: width(cell) • row  T: WS of row C31 C32 C33 W3 Row3 Flow to legalize A1 position

u’i Dld,i ui (1-g) of C’total l’d,i up ud If ui is the critical sink or driver g of C’total uq Delay model Otherwise: Arc Cost in TIF • Sensitivity based cost • We define delay of a net to be the delay from its driver cell to its most critical sink cell. Consider the net delay change when a cell is moved: • Arc cost formulation • For a cell, we find the most critical nets (belong to path with smallest slack) connected to it, the unit flow cost of the arcs from the cell is: From experiments, k=2 gives best results

w(v)=7 overlap disp(v)=2 v u v f1=2 (5, c1) disp(u)=5 v disp(w)=3 (7, c3) (7, c2) u disp(w)=5 w x w x f2=3 w(v)=5 Non-discrete flow Tackling Illegalities in TIF • The incremental detailed placement problem is a DOP. Thus, certain illegalities are introduced in it by using a continuous optimization method. There are two major problems • Discrete flow requirement in vertical arcs • The vertical arc represents vertical cell movement by a discrete amount (dist to nearest row). •  flow on it should be either full capacity (cell width) or 0. • N/w-flow solution may not meet this requirement • Resulting placement problems: u v u w x Resulting Placement. The full cost of movement is not incurred in n/w-flow. Cell moved up has larger area than the n/w flow modeled Initial placement

Tackling Illegalities in TIF (contd.) • Our flow discretizing soln for vertical arc: The 3 step process: • Step1: Initially, vertical arc cap=1, cost=full cost • Step2: After the first 1 unit flow is passed, cap=original cap-1, cost=0 • Step3: After all flow is passed. The cost and capacity of the adjacent horizontal arc are updated to 0 w(v)=7 w(v)=7 w(v)=7 disp(v)=5 v v v u v f2=4 f2=4 f1=1 f1=1 f1=1 (1, full-cost) (4,0) (4,0) disp(u)=5 (7, c3) (7, c2) (7, c3) (7, c2) (inf,0) (inf,0) w x u u u w x w x w x w(v)=5 w(v)=5 w(v)=5 disp(w)=5 Step2 Step3 Step1 Full cost is incurred Final placement Horiz arc cost updated Encourage flow to keep going through arc

Tackling illegalities in TIF (contd.) • Split flows This occurs when there are flows on both upward and downward arcs. C21 C22 f2=3 (5,c1) (5,c2) f1=2 C31 C32 • Two heuristics to solve the problem • The two split flow will go through the tree structure to the sink. There are two heuristic. • Max flow: We choose the branch tree with larger flow 2. Min cost: We choose the branch tree with smaller flow cost looking at the first k levels Tree1 C12 ……. C21 C22 C23 f1 A1 ….. f2 Tree2 C31 C32 C33 Our experiment shows Max flow heuristic does better.

Satisfying White Space Constraints • Due to the discrete nature of the detailed placement problem, the white space constraint: max row width does not exceed a pre-specified limit can’t be ensured by the n/w-flow process. • Two methods are used to deal with this problem • Dynamic row size constraint monitoring • Push-violation arcs in the next iteration w(v)=7 WS=3 v WS=-2 u v f1=2 (5, c1) disp(u)=5 (7, c3) (7, c2) u disp(w)=5 w x w f2=3 w(v)=5 WS violation Non-discrete flow

Violated row S Satisfying WS Constraints (contd.) • Dynamic WS constraint monitoring • Monitor total cell width in each row after every –ve cycle-based iter. improv. of n/w flow: Initial flow on vertical arc: the total cell width is moved to target row Fully reverse flow: the total cell width is moved back to orig. row • To facilitate cell movement we allow temporary white space violation in a row for each direction of the flow • Once a viol in a direction occurs no further are allowed unless it goes to 0. Monitored by top and bottom viol guards Gb and Gt • If violation remains in the row then: • Push violation arc in the next iteration. Thrashing prevented by disallowing reverse movement Min-cost flow W=4 Full row Gt = 0  -5 Otherwise W=3 W=9 Gb = 0  4  4 Net viol = 0  -1 W=7 Min-cost flow

Physical flow interpretation Global n/w flow Detailed n/w flow (on induced network) No All new cells placed & all viol fixed? Yes End TIF’s High-level Flow Global Network Flow • Global flow network gives a global view of generally how flows will go. • With the global flow, we can eliminate detailed-flow arcs that are not likely to have flow on it • This can greatly reduce the cycles in the detailed n/w-flow, thus reducing time without obvious improvement deterioration Ci+1 is probabilistic average of all left-to-right detailed horizontal arc costs in the row Ci+1,I is the weighted average of the detailed vertical arc costs between two rows Row i-1 A2 (w(A2),0) violated row (violi,0) Row i Sink (w(R), Ci+1,i)) A1 (w(Wi+1), Ci+1) Row i+1 65 % runtime reduction at the cost of 1-2 timing deterioration

Benchmarks • There are three set of benchmarks Ibm, Faraday and TD-Dragon • The Ibm and Faraday are originally not timing benchmarks; we generate synthetic timing characteristics for them • The Ibm circuits don’t identify FFs. We determine FFs in cycles, and break all cycles with minimum # of FFs. The average percentage of FFs is 13% • Both suites don’t have information of resistance and capacities of cells and interconnects. We choose the typical value of .18 microns technique for these parameters. • Benchmark Characteristics

Efficacy of TD Arc Costs • Global place (TAN)  Detailed place (TD cost): deterioration 4.3% •  Detailed place (unit cost): deterioration 7.8% •  Detailed place (0 cost): deterioration 10.7% • 45% deterioration reduction of global place results by going from unit-cost  TD-cost

Final Results 24.2% 19.7% 4.5% Delay improvement for ibm benchmarks—initial placement WL-driven (Dragon) 24.3% 20.6% 3.7% Delay improvement for Faraday benchmarks—initial placement WL-driven (Dragon)

Final Results (contd.) 12.0% 8.2% 24.1% 3.8% 19.6% 4.5% Delay improv. for TD-Dragon benchmarks placed by Dragon (cell delay) Delay improv for TD-Dragon benchmarks placed by Dragon (no cell delay) 10.2% 4.0% 6.2% Delay improvement on TD-Dragon placement for different WS constraints. • [Wonjoon & Bazargan, ICCAD’03]achieves an avg of 2.8% improv. with 5% WS • For 5% WS, our improvement is 4.2% (50% relative improvement)

Empirical Asymptotic Time Complexity • Runtime is 18% of Dragon and 12% of TD-Dragon • Obtains a soln for a 210K cct ibm18 w/ 34% improv in 24 mins Linear curve best fits data Linear curve best fits data

Conclusions • Proposed a TD incremental placement flow FlowPlace • Global and detailed incremental placer • New accurate pre-route net delay models • Can opt. both quadratic and the linear delay terms in global placer • TD n/w flow to solve detailed TD placement: • sensitivity-based TD arc costs; constraint satisfaction (e.g., WS); discretization of illegal continuous solns; global n/w flow graph • Promising results • Delay improv up to 34%--for a 210K-cell WL-opt. layout in 24 mins • Delay improv up to 10%--for a 26K-cell TD-opt. layout in just above 5 mins • The average delay improvement is18.34% • The WL deterioration is an average of 8% • The average run time is only 12-18% of original placement runtime • TD-IBM benchmarks and placed outputs avail at the FlowPlace page:www.ece.uic.edu/~dutt/benchmarks-etc/FlowPlace/flow.html • Concepts can be extended to timing and power optimization with constraints and physical re-synthesis

Satisfying white space constraints • Dynamic WS constraint monitoring • We monitor total cell width in each row after every –ve cycle-based iter. improv. of n/w flow: initial flow on vertical arc: the total cell width is moved target row fully reverse flow: the total cell width is moved back to orig. row • To facilitate cell movement we allow temporary white space violation under constraints W=5 W=5 vio_top=3 x u WS=-3 vio_top=0 vio_bot=0 W=7 WS=2 WS=-2 v Sink Viol_max=max cell width Violation from above and bellow are calculated separately vio_top=3 W=5 vio_top=0 u WS=-5 v WS=0 v vio_bot=0 vio_bot=2 WS=5 WS=0 u Because the flow allowed in step two due to separate violation limit for flow from above and below, we can finally legalize the placement.

A NetworkFlow Approach to TimingDriven Incremental Placement for ASICs