129 Views

Download Presentation
##### Scalable Stochastic Programming

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Scalable Stochastic Programming**Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory petra@mcs.anl.gov Joint work with MihaiAnitescu and Miles Lubin**Motivation**• Sources of uncertainty in complex energy systems • Weather • Consumer Demand • Market prices • Applications @Argonne – Anitescu, Constantinescu, Zavala • Stochastic Unit Commitment with Wind Power Generation • Energy management of Co-generation • Economic Optimization of a Building Energy System**Stochastic Unit Commitment with Wind Power**• Wind Forecast – WRF(Weather Research and Forecasting) Model • Real-time grid-nested 24h simulation • 30 samples require 1h on 500 CPUs (Jazz@Argonne) Thermal generator Wind farm Slide courtesy of V. Zavala & E. Constantinescu**Economic Optimization of a Building Energy System**• Proactive management - temperature forecasting & electricity prices • Minimize daily energy costs Slide courtesy of V. Zavala**Optimization under Uncertainty**• Two-stage stochastic programming with recourse (“here-and-now”) subj. to. subj. to. continuous discrete Sample average approximation (SAA) subj. to. Sampling Inference Analysis M samples**Linear Algebra of Primal-Dual Interior-Point Methods**Convex quadratic problem IPM Linear System Min subj. to. Multi-stage SP Two-stage SP nested arrow-shaped linear system (via a permutation)**The Direct Schur Complement Method (DSC)**• Uses the arrow shape of H • 1.Implicit factorization 2. Solving Hz=r • 2.1. Back substitution 2.2. Diagonal Solve 2.3. Forward substitution**Parallelizing DSC – 1. Factorization phase**Process 1 Dense linear algebra Process 2 Process 1 2. Backsolve Process p Factorization of the 1st stage Schur complement matrix = BOTTLENECK Sparse linear algebra**Parallelizing DSC – 2. Backsolve**Process 1 Process 1 Process 2 Process 2 Dense linear algebra Process 1 1.Factorization Process p Process p 1st stage backsolve = BOTTLENECK Sparse linear algebra Sparse linear algebra**Scalability of DSC**Unit commitment 76.7% efficiency butnot always the case Large number of 1st stage variables: 38.6% efficiency on Fusion @ Argonne**Preconditioned Schur Complement (PSC)**(separate process) REMOVES the factorization bottleneck Slightly largerbacksolve bottleneck**The Stochastic Preconditioner**• The exact structure of C is • IID subset of n scenarios: • The stochastic preconditioner(Petra & Anitescu, 2010) • For C use the constraint preconditioner (Keller et. al., 2000)**The “Ugly” Unit Commitment Problem**• DSC on P processes vs PSC on P+1 process Optimal use of PSC – linear scaling • 120 scenarios Factorization of the preconditioner can not be hidden anymore.**Quality of the Stochastic Preconditioner**• “Exponentially” better preconditioning (Petra & Anitescu 2010) • Proof: Hoeffding inequality • Assumptions on the problem’s random data • Boundedness • Uniform full rank of and not restrictive**Quality of the Constraint Preconditioner**• has an eigenvalue 1 with order of multiplicity . • The rest of the eigenvalues satisfy • Proof: based on Bergamaschiet. al., 2004.**The Krylov Methods Used for**• BiCGStab using constraint preconditioner M • Preconditioned Projected CG (PPCG) (Gould et. al., 2001) • Preconditioned projection onto the • Does not compute the basis for Instead,**Performance of the preconditioner**• Eigenvalues clustering & Krylov iterations • Affected by the well-known ill-conditioning of IPMs.**Parallelizing the 1st stage linear algebra**• We distribute the 1st stage Schur complement system. • C is treated as dense. • Alternative to PSC for problems with large number of 1st stage variables. • Removes the memory bottleneck of PSC and DSC. • We investigated ScaLapack, Elemental (successor of PLAPACK) • None have a solver for symmetric indefinite matrices (Bunch-Kaufman); • LU or Cholesky only. • So we had to think of modifying either. densesymm. pos. def., sparse full rank.**ScaLapack (ORNL)**• Classical block distribution of the matrix • Blocked “down-looking” Cholesky - algorithmic blocks • Size of algorithmic block = size of distribution block! • For cache-performance - large algorithmic blocks • For good load balancing - small distribution blocks • Must trade off cache-performance for load balancing • Communication: basic MPI calls • Inflexible in working with sub-blocks**Elemental (UT Austin)**• Unconventional “elemental” distribution: blocks of size 1. • Size of algorithmic block size of distribution block • Both cache-performance (large alg. blocks) and load balancing (distrib. blocks of size 1) • Communication • More sophisticated MPI calls • Overhead O(log(sqrt(p))), p is the number of processors. • Sub-blocks friendly • Better performance in a hybrid approach, MPI+SMP, than ScaLapack**Cholesky-based -like factorization**• Can be viewed as an “implicit” normal equations approach. • In-place implementation inside Elemental: no extra memory needed. • Idea: modify the Cholesky factorization, by changing the sign after processing p columns. • It is much easier to do in Elemental, since this distributes elements, not blocks. • Twice as fast as LU • Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block.**Distributing the 1st stage Schur complement matrix**• All processors contribute to all of the elements of the (1,1) dense block • A large amount of inter-process communication occurs. • Possibly more costly than the factorization itself. • Solution: use buffer to reduce the number of messages when doing a Reduce_scatter. • approach also reduces the communication by half – only need to send lower triangle.**Reduce operations**• Streamlined copying procedure - Lubin and Petra (2010) • Loop over continuous memory and copy elements in send buffer • Avoids divisions and modulus ops needed to compute the positions • “Symmetric” reduce for • Only lower triangle is reduced • Fixed buffer size • A variable number of columns reduced. • Effectively halves the communication (both data & # of MPI calls).**Large-scale performance**• First-stage linear algebra: ScaLapack (LU), Elemental(LU), and • Strong scaling of PIPS with and • 90.1% from 64 to 1024 cores • 75.4% from 64 to 2048 cores • > 4,000 scenarios SAA problem: 1st stage variables: 82,000 Total #: 189 million Thermal units: 1,000 Wind farms: 1,200**Towards real-life models – UC with transmission**constraints • Current status: ISOs (Independent system operator) uses UC with • deterministic wind profiles, market prices and demand • network (transmission) constraints • Outer 1-h timestep 24 horizon simulation • Inner 5-min timestep 1h horizon corrections • Stochastic UC with transmission constraints (V. Zavala 2010) • Stochastic wind profiles & transmission constraints • Deterministic market prices and demand • 24 horizon with 1h timestep • Kirchoff’s laws are part of the constraints • The problem is huge: KKT systems are 1.8 Bil x 1.8 Bil Load node Generator**Solving UC with transmission constraints**• 32k wind scenarios (k=1024) • 32k nodes (131,072 cores) on Intrepid BG/P • Hybrid programming model: SMP inside MPI • Sparse 2nd-stage linear algebra: WSMP (IBM) • Dense 1st-stage linear algebra: Elemental in SMP mode • Time to solution: 20h = 4h for loading + 16h for solving • Flop count is aprox. 1% of the peak • For a 4h Horizon problem very good strong scaling**Real-time solution**• Exploiting the structure • Time-coupling in the second-stage -> block tridiagonal linear systems. • Sparsebacksolvesmust be used. • Speedup in the backsolve phase > 20x is obtained. • Preliminary tests for 24h horizon:time to solution < 1h (after loading). • Flop count does not necessarily increase. • Strongscaling does not change by much in the case of the 24h horizon problem. The second-stage KKT systems has a block tridiagonal structure given by the time-coupling The structure of the KKT system for a general 2-stage SP problem**Concluding remarks**• PIPS – parallel interior-point solver for stochastic SAA problems • Specialized linear algebra layer • Small-sized 1st-stage subproblems DSC • Medium-sized 1st-stage PSC • Large-sized 1st-stage Distributed SC • Scenario parallelization in a hybrid programming model MPI+SMP**Thank you for your attention!**Questions?