Scalable Stochastic Programming

Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory petra@mcs.anl.gov Joint work with MihaiAnitescu and Miles Lubin

Motivation • Sources of uncertainty in complex energy systems • Weather • Consumer Demand • Market prices • Applications @Argonne – Anitescu, Constantinescu, Zavala • Stochastic Unit Commitment with Wind Power Generation • Energy management of Co-generation • Economic Optimization of a Building Energy System

Stochastic Unit Commitment with Wind Power • Wind Forecast – WRF(Weather Research and Forecasting) Model • Real-time grid-nested 24h simulation • 30 samples require 1h on 500 CPUs (Jazz@Argonne) Thermal generator Wind farm Slide courtesy of V. Zavala & E. Constantinescu

Economic Optimization of a Building Energy System • Proactive management - temperature forecasting & electricity prices • Minimize daily energy costs Slide courtesy of V. Zavala

Optimization under Uncertainty • Two-stage stochastic programming with recourse (“here-and-now”) subj. to. subj. to. continuous discrete Sample average approximation (SAA) subj. to. Sampling Inference Analysis M samples

Linear Algebra of Primal-Dual Interior-Point Methods Convex quadratic problem IPM Linear System Min subj. to. Multi-stage SP Two-stage SP nested arrow-shaped linear system (via a permutation)

The Direct Schur Complement Method (DSC) • Uses the arrow shape of H • 1.Implicit factorization 2. Solving Hz=r • 2.1. Back substitution 2.2. Diagonal Solve 2.3. Forward substitution

Parallelizing DSC – 1. Factorization phase Process 1 Dense linear algebra Process 2 Process 1 2. Backsolve Process p Factorization of the 1st stage Schur complement matrix = BOTTLENECK Sparse linear algebra

Parallelizing DSC – 2. Backsolve Process 1 Process 1 Process 2 Process 2 Dense linear algebra Process 1 1.Factorization Process p Process p 1st stage backsolve = BOTTLENECK Sparse linear algebra Sparse linear algebra

Scalability of DSC Unit commitment 76.7% efficiency butnot always the case Large number of 1st stage variables: 38.6% efficiency on Fusion @ Argonne

BOTTLENECK SOLUTION 1: STOCHASTIC PRECONDITIONER

Preconditioned Schur Complement (PSC) (separate process) REMOVES the factorization bottleneck Slightly largerbacksolve bottleneck

The Stochastic Preconditioner • The exact structure of C is • IID subset of n scenarios: • The stochastic preconditioner(Petra & Anitescu, 2010) • For C use the constraint preconditioner (Keller et. al., 2000)

The “Ugly” Unit Commitment Problem • DSC on P processes vs PSC on P+1 process Optimal use of PSC – linear scaling • 120 scenarios Factorization of the preconditioner can not be hidden anymore.

Quality of the Stochastic Preconditioner • “Exponentially” better preconditioning (Petra & Anitescu 2010) • Proof: Hoeffding inequality • Assumptions on the problem’s random data • Boundedness • Uniform full rank of and not restrictive

Quality of the Constraint Preconditioner • has an eigenvalue 1 with order of multiplicity . • The rest of the eigenvalues satisfy • Proof: based on Bergamaschiet. al., 2004.

The Krylov Methods Used for • BiCGStab using constraint preconditioner M • Preconditioned Projected CG (PPCG) (Gould et. al., 2001) • Preconditioned projection onto the • Does not compute the basis for Instead,

Performance of the preconditioner • Eigenvalues clustering & Krylov iterations • Affected by the well-known ill-conditioning of IPMs.

SOLUTION 2: PARALELLIZATION OF STAGE 1 LINEAR ALGEBRA

Parallelizing the 1st stage linear algebra • We distribute the 1st stage Schur complement system. • C is treated as dense. • Alternative to PSC for problems with large number of 1st stage variables. • Removes the memory bottleneck of PSC and DSC. • We investigated ScaLapack, Elemental (successor of PLAPACK) • None have a solver for symmetric indefinite matrices (Bunch-Kaufman); • LU or Cholesky only. • So we had to think of modifying either. densesymm. pos. def., sparse full rank.

ScaLapack (ORNL) • Classical block distribution of the matrix • Blocked “down-looking” Cholesky - algorithmic blocks • Size of algorithmic block = size of distribution block! • For cache-performance - large algorithmic blocks • For good load balancing - small distribution blocks • Must trade off cache-performance for load balancing • Communication: basic MPI calls • Inflexible in working with sub-blocks

Elemental (UT Austin) • Unconventional “elemental” distribution: blocks of size 1. • Size of algorithmic block size of distribution block • Both cache-performance (large alg. blocks) and load balancing (distrib. blocks of size 1) • Communication • More sophisticated MPI calls • Overhead O(log(sqrt(p))), p is the number of processors. • Sub-blocks friendly • Better performance in a hybrid approach, MPI+SMP, than ScaLapack

Cholesky-based -like factorization • Can be viewed as an “implicit” normal equations approach. • In-place implementation inside Elemental: no extra memory needed. • Idea: modify the Cholesky factorization, by changing the sign after processing p columns. • It is much easier to do in Elemental, since this distributes elements, not blocks. • Twice as fast as LU • Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block.

Distributing the 1st stage Schur complement matrix • All processors contribute to all of the elements of the (1,1) dense block • A large amount of inter-process communication occurs. • Possibly more costly than the factorization itself. • Solution: use buffer to reduce the number of messages when doing a Reduce_scatter. • approach also reduces the communication by half – only need to send lower triangle.

Reduce operations • Streamlined copying procedure - Lubin and Petra (2010) • Loop over continuous memory and copy elements in send buffer • Avoids divisions and modulus ops needed to compute the positions • “Symmetric” reduce for • Only lower triangle is reduced • Fixed buffer size • A variable number of columns reduced. • Effectively halves the communication (both data & # of MPI calls).

Large-scale performance • First-stage linear algebra: ScaLapack (LU), Elemental(LU), and • Strong scaling of PIPS with and • 90.1% from 64 to 1024 cores • 75.4% from 64 to 2048 cores • > 4,000 scenarios SAA problem: 1st stage variables: 82,000 Total #: 189 million Thermal units: 1,000 Wind farms: 1,200

Towards real-life models – UC with transmission constraints • Current status: ISOs (Independent system operator) uses UC with • deterministic wind profiles, market prices and demand • network (transmission) constraints • Outer 1-h timestep 24 horizon simulation • Inner 5-min timestep 1h horizon corrections • Stochastic UC with transmission constraints (V. Zavala 2010) • Stochastic wind profiles & transmission constraints • Deterministic market prices and demand • 24 horizon with 1h timestep • Kirchoff’s laws are part of the constraints • The problem is huge: KKT systems are 1.8 Bil x 1.8 Bil Load node Generator

Solving UC with transmission constraints • 32k wind scenarios (k=1024) • 32k nodes (131,072 cores) on Intrepid BG/P • Hybrid programming model: SMP inside MPI • Sparse 2nd-stage linear algebra: WSMP (IBM) • Dense 1st-stage linear algebra: Elemental in SMP mode • Time to solution: 20h = 4h for loading + 16h for solving • Flop count is aprox. 1% of the peak • For a 4h Horizon problem very good strong scaling

Real-time solution • Exploiting the structure • Time-coupling in the second-stage -> block tridiagonal linear systems. • Sparsebacksolvesmust be used. • Speedup in the backsolve phase > 20x is obtained. • Preliminary tests for 24h horizon:time to solution < 1h (after loading). • Flop count does not necessarily increase. • Strongscaling does not change by much in the case of the 24h horizon problem. The second-stage KKT systems has a block tridiagonal structure given by the time-coupling The structure of the KKT system for a general 2-stage SP problem

Concluding remarks • PIPS – parallel interior-point solver for stochastic SAA problems • Specialized linear algebra layer • Small-sized 1st-stage subproblems DSC • Medium-sized 1st-stage  PSC • Large-sized 1st-stage  Distributed SC • Scenario parallelization in a hybrid programming model MPI+SMP

Thank you for your attention! Questions?

Scalable Stochastic Programming