Parallelization by SimPL ification : A Case Study in VLSI Placement

Parallelization by SimPLification:A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan PAPA2011, University of Michigan

Complexities of Parallel Algorithms & SW Objectives of parallelization A. Improve completion time by using multiple cores in || B. Improve throughput by using stream processing(latency may increase and become less predictable) C. Improve power consumption (by decreasing clk rate) Not an objective (a pitfall) Come up with a slow algorithm that is easy to parallelize In this talk: how to accomplish 1.A without 2 Take a leading algorithm and speed up its bottlenecks Design a new algorithm that is(a) better, (b) easy to parallelize PAPA2011, University of Michigan

CAD Algorithms Sequence of optimizations Subject to Amdahl’s law The more the stages, the harder to parallelize effectively Additional complications Elaborate data structures may entail overheadfor parallel access When processing is light, memory bandwidthmay become a bottleneck (with 4+ threads) Recommendations A simpler algorithm is often either to parallelize(fewer stages, simpler data structures) Using standard solvers, e.g., linear algebrahelps reuse previous work on parallelization PAPA2011, University of Michigan

Global Placement: Motivation Interconnect lagging in performance while transistors continue scaling Circuit delay, power dissipation and areadominated by interconnect Routing quality highly controlled by placement Circuit size and complexity rapidly increasing Scalable placement algorithm is critical Simplicity, integration with other optimizations IR drop Coupling RC delay Unloaded PAPA2011, University of Michigan

Goals in Placement Find good relative ordering of cells Minimize wire length and congestion Maximize timing slack Find good spacing of cells Eliminate wiring congestion problems Provide space for post placement stages clock trees buffer insertion timing correction Find good global position PAPA2011, University of Michigan

Optimize Relative Order A B C PAPA2011, University of Michigan

To spread ... A B C PAPA2011, University of Michigan

.. or not to spread A B C PAPA2011, University of Michigan

Place to the left A B C PAPA2011, University of Michigan

… or to the right A B C PAPA2011, University of Michigan

Optimize Relative Order A B C Without whitespace,placement is dominated by ordering PAPA2011, University of Michigan

Example of Global Placement (APlace 2.04 from UCSD)

Example of Global Placement (mFar from UCSB)

Placement Formulation Objective: Minimize estimated wirelength Half-perimeter wirelength (HPWL) (max X – min X) + (max Y – min Y) Subject to constraints: Legality: Row-based placement with no overlaps Routability: Limiting localinterconnect congestion forsuccessful routing Timing: Meeting performancetarget of a design y x PAPA2011, University of Michigan

Quadratic Placement Consider a graph first, not a hypergraph Minimize Σ(xi-xj)2+(yi-yj)2(the sum is over eij) Seems unrelated to Σ |xi-xj|+|yi-yj| but can still be separated into x- and y-components Physical analogy: Hooke’s law Consider an elastic spring, spread by x Force F=-kx (k is the spring constant) Energy E=kx2 Our goal: minimize the energy of the system • A system of springs will only settle in a minimum PAPA2011, University of Michigan

Iterative Optimization PAPA2011, University of Michigan

Prior Work Ideal Placer Low runtime without sacrificing solution quality Simplicity, integration with other optimizations Ideal placer Speed mFAR, Kraftwerk2, FastPlace3 Quadratic and force-directed mPL6, APlace2, NTUPlace3 Non-convex optimization Solution Quality PAPA2011, University of Michigan

Key features of SimPL Flat quadratic placement Primal dual optimization Closing the gap between upper and lower bounds Upper-Bound Solution byLook-ahead Legalization Wirelength Final Solution Final Legal Solution Lower-Bound SolutionbyLinear System Solver Initial WL Opt. Iteration PAPA2011, University of Michigan

Placement Instance Converge Common Analytical Placement Flow Initial WLOptimization GlobalPlacement no yes Legalization and Detailed Placement PAPA2011, University of Michigan

Placement Instance WLConverge Converge SimPL Flow Initial WLOptimization no B2B GraphBuilding Linear System Solver yes Look-aheadLegalization(Upper-Bound) GlobalPlacement Pseudonet Insertion no yes B2B GraphBuilding Legalization and Detailed Placement Linear System Solver (Lower-Bound) B2B net model[P. Spindler, et al, “Kraftwerk2 - A Fast Force-Directed Quadratic Placement Approach Using an Accurate Net Model,” TCAD 2008] We delegate final legalization and detailed placement to FastPlace-DP [M. Pan, et al, “An Efficient and Effective Detailed Placement Algorithm”, ICCAD2005]

SimPL: Look-ahead Legalization Purpose: Produces almost-legal placement (Upper-Bound) while preserving the relative cell ordering given by linear system solver (Lower-Bound) Identify target region Find overflow bin b Create a minimal wide enough bin cluster B around b Perform geometric top-down partitioning Find cell area median (Cc) and whitespace median (CB) Assign cells (Cc) to corresponding partitions (CB) Non-linear scaling Form stripe regions Move cells across stripe regions in-order based on whitespace PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (1) Performing geometric top-down partitioning Cell-area median (Cc) whitespacemedian (CB) Overfilled bin B1 Bin cluster (B) B0 PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (2) Cell-area median (Cc) whitespacemedian (CB) B0 PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (2) CB CB CB 4 4 7 7 3 3 1 1 5 5 8 8 6 6 2 2 Per-stripeLinear Scaling CellOrdering Uniform cutlines Obstacle borders PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (3) Example (adaptec1) Look-ahead legalization stops when target regions become small enough

SimPL: Using legal locations as anchors Purpose: Gradually perturb the linear system to generate lower-bound solutions with less overlap Anchors and Pseudonets Look-ahead locations used as fixed, zero-area anchors Anchors and original cells connected with 2-pin pseudonets Pseudonet weights grow linearly with iterations PAPA2011, University of Michigan

Next illustration: Tug-of-war between low-wirelength and legalized placements PAPA2011, University of Michigan

SimPL Iterations on Adaptec1 (1) Iteration=0 (Init WL Opt.) Iteration=1 (Upper Bound) Iteration=2 (Lower Bound) Iteration=3 (Upper Bound)

SimPL Iterations on Adaptec1 (2) Iteration=10 (Lower Bound) Iteration=11 (Upper Bound) Iteration=11 (Upper Bound) Iteration=20 (Lower Bound) Iteration=20 (Lower Bound) Iteration=21 (Upper Bound) Iteration=21 (Upper Bound)

SimPL Iterations on Adaptec1 (3) Iteration=30 (Lower Bound) Iteration=31 (Upper Bound) Iteration=40 (Lower Bound) Iteration=41 (Upper Bound)

Convergence of SimPL Legal solution is formed between two bounds PAPA2011, University of Michigan

Empirical Results: ISPD05 Benchmarks Experimental setup Single threaded runs on a 3.2GHz Intel core i7 Quad CPU Q660 Linux workstation HPWL is computed by GSRC Bookshelf Evaluator < 5000 lines of code in C++, including CG solverfor sparse linear systems (w Jacobi preconditioner) PAPA2011, University of Michigan

Speeding Up Placement Using Parallelism SimPL has very few components (5KLOC) Each bottleneck is amenable to some form of ||-ism Thread-level Instruction-level PAPA2011, University of Michigan

Parallelism in Conjugate Gradient Solver Coarse-grain row partitioning Implemented using OpenMP3.0 compiler intrinsic SSE2 (Streaming SIMD Extensions) instructions Process 4 multiple data with a single instruction Marginal runtime improvement in SpMxV Reducing memory bandwidth demand of SpMxV CSR (Compressed Sparse Row) format Y. Saad, “Iterative Methods for Sparse Linear Systems,” SIAM 2003 PAPA2011, University of Michigan

Parallelism in CG Solver - Example PAPA2011, University of Michigan

Parallelism in B2B Mode Update B2B net model update B2B model is separable Can process the x and y cases in parallel Additionally, split the nets of the netlist into equal groups that can be processed by multiple threads. PAPA2011, University of Michigan

SSE optimization affects Runtime Profile PAPA2011, University of Michigan

Parallelism in Look-ahead Legalization (1) Look-ahead legalization (LAL) started consuming a significant fraction of overall runtime Top-down geometric partitioning and non-linear scaling (T&N) are amenable to parallelization Top-down partitioning generates an increasing number of subtasks of similar sizes which can be solved in parallel After each level of T&N on bin cluster, eachthread generates two sub-clusters with similar numbers of cells PAPA2011, University of Michigan

Parallelism in Look-ahead Legalization (2) LAL keeps the global queue of bin clusters Q Static partitioning Assign initial bin clusters to available threads such that each thread has similar number of bin clusters to start Subtask updates Thread ti processes one of two sub-clusters (for the next level of T&N), the remainder is added to the global cluster queue Q Dynamic task scheduling When thread tiis idle, it dynamically retrieves clusters from the global cluster queue Q. The number of clusters to be retrieved N = max(Q.size()/N_threads, 1) PAPA2011, University of Michigan

Empirical Results – Overall Speed-ups Experimental setup Multithreaded runs on a 8-core AMD-based system with four dual-core CPUs and 16GByte RAM Each CPU was Opteron 880 processor running at 2.4GHz with 1024KB cache PAPA2011, University of Michigan

Empirical Results – Component Speed-ups PAPA2011, University of Michigan

Extending the Routability-driven Placement Ongoing work: simultaneous place-and-route PAPA2011, University of Michigan

Simultaneous Place-and-Route After Look-Ahead Legalization (LAL) perform Look-Ahead Routing (LAR) Integrate an in-house router through clean API Cell locations in, accurate congestion maps out The placer accounts for congestion in addition to density(slightly modified formulas, almost no extra work) ISPD 2011 contest organized by IBM Research New, large benchmarks Placements evaluated by a common global router PAPA2011, University of Michigan

SimPL SimPLR Key metric is #overflows (OF) Also shown – routed WL (RtWL) PAPA2011, University of Michigan

Conclusions New flat quadratic placement algorithm: SimPL Novel primal-dual based approach Amenable to integration with physical synthesis Self-contained, compact implementation Fastest among available academic placers Highly competitive solution quality Amenable to parallelism Easy to extend to simultaneous place-and-route PAPA2011, University of Michigan

Questions and Answers Thank you! Time for Questions PAPA2011, University of Michigan

Parallelization by SimPL ification : A Case Study in VLSI Placement