- 78 Views
- Uploaded on
- Presentation posted in: General

Scalable Stochastic Programming

Scalable Stochastic Programming

Cosmin Petra and MihaiAnitescu

Mathematics and Computer Science Division

Argonne National Laboratory

Informs Computing Society Conference

Monterey, California

January, 2011

- Sources of uncertainty in complex energy systems
- Weather
- Consumer Demand
- Market prices

- Applications @Argonne – Anitescu, Constantinescu, Zavala
- Stochastic Unit Commitment with Wind Power Generation
- Energy management of Co-generation
- Economic Optimization of a Building Energy System

Zavala’s SA2 talk

- Wind Forecast – WRF(Weather Research and Forecasting) Model
- Real-time grid-nested 24h simulation
- 30 samples require 1h on 500 CPUs ([email protected])

Slide courtesy of V. Zavala & E. Constantinescu

- Two-stage stochastic programming with recourse (“here-and-now”)

subj. to.

subj. to.

continuous

discrete

Sample average approximation (SAA)

subj. to.

Sampling

Inference Analysis

M samples

Convex quadratic problem

IPM Linear System

Min

subj. to.

Multi-stage SP

Two-stage SP

nested

arrow-shaped linear system

(via a permutation)

The Direct Schur Complement Method (DSC)

- Uses the arrow shape of H
- 1.Implicit factorization 2. Solving Hz=r
- 2.1. Back substitution 2.2. Diagonal Solve

2.3. Forward substitution

Process 1

Process 2

Process 1

2. Backsolve

Process p

Factorization of the 1st stage Schur complement matrix = BOTTLENECK

Process 1

Process 1

Process 2

Process 2

Process 1

1.Factorization

Process p

Process p

1st stage backsolve = BOTTLENECK

Unit commitment

76.7% efficiency

butnot always the case

Large number of 1st stage variables: 38.6% efficiency

on Fusion @ Argonne

(separate process)

REMOVES the factorization bottleneck

Slightly largerbacksolve bottleneck

- The exact structure of C is
- IID subset of n scenarios:
- The stochastic preconditioner(Petra & Anitescu, 2010)
- For C use the constraint preconditioner (Keller et. al., 2000)

- DSC on P processes vs PSC on P+1 process

Optimal use of PSC – linear scaling

- 120 scenarios

Factorization of the preconditioner can not be

hidden anymore.

- “Exponentially” better preconditioning (Petra & Anitescu 2010)
- Proof: Hoeffding inequality
- Assumptions on the problem’s random data
- Boundedness
- Uniform full rank of and

not restrictive

- has an eigenvalue 1 with order of multiplicity .
- The rest of the eigenvalues satisfy
- Proof: based on Bergamaschiet. al., 2004.

- Eigenvalues clustering & Krylov iterations
- Affected by the well-known ill-conditioning of IPMs.

- We distribute the 1st stage Schur complement system.
- C is treated as dense.
- Alternative to PSC for problems with large number of 1st stage variables.
- Removes the memory bottleneck of PSC and DSC.
- We investigated ScaLapack, Elemental (successor of PLAPACK)
- None have a solver for symmetric indefinite matrices (Bunch-Kaufman);
- LU or Cholesky only.
- So we had to think of modifying either.

densesymm. pos. def.,

sparse full rank.

- Classical block distribution of the matrix
- Blocked “down-looking” Cholesky - algorithmic blocks
- Size of algorithmic block = size of distribution block!

- For cache-performance - large algorithmic blocks
- For good load balancing - small distribution blocks
- Must trade off cache-performance for load balancing
- Communication: basic MPI calls
- Inflexible in working with sub-blocks

- Unconventional “elemental” distribution: blocks of size 1.
- Size of algorithmic block size of distribution block
- Both cache-performance (large alg. blocks) and load balancing (distrib. blocks of size 1)
- Communication
- More sophisticated MPI calls
- Overhead O(log(sqrt(p))), p is the number of processors.

- Sub-blocks friendly
- Better performance in a hybrid approach, MPI+SMP, than ScaLapack

- Can be viewed as an “implicit” normal equations approach.
- In-place implementation inside Elemental: no extra memory needed.
- Idea: modify the Cholesky factorization, by changing the sign after processing p columns.
- It is much easier to do in Elemental, since this distributes elements, not blocks.
- Twice as fast as LU
- Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block.

- All processors contribute to all of the elements of the (1,1) dense block
- A large amount of inter-process communication occurs.
- Possibly more costly than the factorization itself.
- Solution: use buffer to reduce the number of messages when doing a Reduce_scatter.
- approach also reduces the communication by half – only need to send lower triangle.

- Streamlined copying procedure - Lubin and Petra (2010)
- Loop over continuous memory and copy elements in send buffer
- Avoids divisions and modulus ops needed to compute the positions

- “Symmetric” reduce for
- Only lower triangle is reduced
- Fixed buffer size
- A variable number of columns reduced.

- Effectively halves the communication (both data & # of MPI calls).

- First-stage linear algebra: ScaLapack (LU), Elemental(LU), and
- Strong scaling of PIPS with and
- 90.1% from 64 to 1024 cores
- 75.4% from 64 to 2048 cores
- > 4,000 scenarios

SAA problem:

1st stage variables: 82,000

Total #: 189 million

Thermal units: 1,000

Wind farms: 1,200

- PIPS – parallel interior-point solver for stochastic SAA problems
- Largest SAA prob.
- 189 Mil vars = 82k 1st-stage vars + 4k scens * 47k 2nd-stage vars
- 2048 cores

- Largest SAA prob.
- Specialized linear algebra layer
- Small-sized 1st-stage subproblems DSC
- Medium-sized 1st-stage PSC
- Large-sized 1st-stage Distributed SC

- Current work: Scenario parallelization in a hybrid programming model MPI+SMP

Thank you for your attention!

Questions?