Chip-Multiprocessors & You

Chip-Multiprocessors & You John Dennis dennis@ucar.edu March 16, 2007

Intel “Tera Chip” • 80 core chip • 1 Teraflop • 3.16 Ghz / 0.95V / 62W • Process • 45 nm technology • High-K • 2D mesh network • Each processor has 5-port router • Connects to “3D-memory” Software Engineering Working Group Meeting

Outline • Chip-Multiprocessor • Parallel I/O library (PIO) • Full with Large Processor Counts • POP • CICE Software Engineering Working Group Meeting

Moore’s Law • Most things are twice as nice [18 months] • Transistor count • Processor speed • DRAM density • Historical Result: • Solve problem twice as large in same time • Solve same size problem in half the time --> Inactivity leads to progress! Software Engineering Working Group Meeting

The advent of Chip-multiprocessors Moore’s Law gone bad!

New implications of Moore’s Law • Every 18 months • # of cores per socket doubles • Memory density doubles • Clock cycle may increase slightly • 18 months from now • 8 cores per socket • Slight increase in clock cycle (~15%) • Same memory per core!! Software Engineering Working Group Meeting

New implications of Moore’s Law (con’t) • Inactivity leads to no progress! • Possible outcome • Same problem size / same parallelism • solve problem ~15% faster • Bigger problem size • scalable memory? • More processors enable ~2x reduction in time to solution • Non-scalable memory? • May limit number of processors that can be used • Waste 1/2 of cores on sockets to use memory? • All components of application must scale to benefit from Moore’s Law increases! Memory footprint problem will not solve itself! Software Engineering Working Group Meeting

Questions ? Software Engineering Working Group Meeting

Parallel I/O library (PIO) John Dennis (dennis@ucar.edu) Ray Loy (rloy@mcs.anl.gov) March 16, 2007

Introduction • All component models need parallel I/O • Serial I/O is bad! • Increased memory requirement • Typically negative impact on performance • Primary Developers: [J. Dennis, R. Loy] • Necessary for POP BGW runs Software Engineering Working Group Meeting

Design goals • Provide parallel I/O for all component models • Encapsulate complexity into library • Simple interface for component developers to implement Software Engineering Working Group Meeting

Design goals (con’t) • Extensible for future I/O technology • Backward compatible (node=0) • Support for multiple formats • {sequential,direct} binary • netcdf • Preserve format of input/output files • Supports 1D, 2D and 3D arrays • Currently XY • Extensible to XZ or YZ Software Engineering Working Group Meeting

Terms and Concepts • PnetCDF: [ANL] • High performance I/O • Different interface • Stable • netCDF4 + HDF5 [NCSA] • Same interface • Needs HDF5 library • Less stable • Lower performance • No support on Blue Gene Software Engineering Working Group Meeting

Terms and Concepts (con’t) • Processor stride: • Allows matching of subset of MPI IO nodes to system hardware Software Engineering Working Group Meeting

Terms and Concepts (con’t) • IO decomp vs. COMP decomp • IO decomp == COMP decomp • MPI-IO + message aggregation • IO decomp != COMP decomp • Need Rearranger : MCT • No component specific info in library • Pair with existing communication tech • 1D arrays in library; component must flatten 2D and 3D arrays Software Engineering Working Group Meeting

Component Model ‘issues’ • POP & CICE: • Missing blocks • Update of neighbors halo • Who writes missing blocks? • Asymmetry between read/write • ‘sub-block’ decompositions not rectangular • CLM • Decomposition not rectangular • Who writes missing data? Software Engineering Working Group Meeting

What works • Binary I/O [direct] • Test on POWER5, BGL • Rearrange w/MCT + MPI-IO • MPI-IO no rearrangement • netCDF • netCDF • Rearrange with MCT [New] • Reduced memory • PnetCDF: • Rearrange with MCT • No rearrangement • Test on POWER5, BGL Software Engineering Working Group Meeting

What works (con’t) • Prototype added to POP2 • Reads restart and forcing files correctly • Writes binary restart files correctly • Necessary for BGW runs • Prototype implementation in HOMME [J. Edwards] • Writes netCDF history files correctly • POPIO benchmark • 2D array [3600x2400] (70 Mbyte) • Test code for correctness and performance • Tested on 30K BGL processors in Oct 06 • Performance • POWER5: 2-3x serial I/O approach • BGL: mixed Software Engineering Working Group Meeting

Complexity / Remaining Issues • Mulitple ways to express decomposition • GDOF: global degree of freedom --> (MCT, MPI-IO) • Subarrays: start + count (pNetCDF) • Limited expressiveness • Will not support ‘sub-block’ in POP & CICE, CLM • Need common language for interface • Interface between component model and library Software Engineering Working Group Meeting

Conclusions • Working prototype • POP2 for binary I/O • HOMME for netCDF • PIO telecon: discuss progress every 2 weeks • Work in progress • Multiple efforts underway • accepting help • http://swiki.ucar.edu/ccsm/93 • In CCSM subversion repository Software Engineering Working Group Meeting

Fun with Large Processor Counts:POP, CICE John Dennis dennis@ucar.edu March 16, 2007

Motivation • Can Community Climate System Model (CCSM) be a Petascale Application? • Use 10-100K processors per simulation • Increasing common access to large systems • ORNL Cray XT3/4 : 20K [2-3 weeks] • ANL Blue Gene/P : 160K [Jan 2008] • TACC Sun : 55K [Jan 2008] • Petascale for the masses ? • lagtime in Top 500 List [4-5 years] • @ NCAR before 2015 Software Engineering Working Group Meeting

Outline • Chip-Multiprocessor • Parallel I/O library (PIO) • Fun with Large Processor Counts • POP • CICE Software Engineering Working Group Meeting

Status of POP • Access to 17K Cray XT4 processors • 12.5 years/day [Current Record] • 70% of time in solver • Won BGW cycle allocation Eddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock] • 110 Rack Days/ 5.4M CPU hours • 20 year 0.1° POP simulation • Includes a suite of dye-like tracers • Simulate eddy diffusivity tensor Software Engineering Working Group Meeting

Status of POP (con’t) • Allocation will occur over ~7 days • Run in production on 30K processors • Needs Parallel I/O to write history file • Start runs in 4-6 weeks Software Engineering Working Group Meeting

Outline • Chip-Multiprocessor • Parallel I/O library (PIO) • Fun with Large Processor Counts • POP • CICE Software Engineering Working Group Meeting

Status of CICE • Tested CICE @ 1/10 • 10K Cray XT4 processors • 40K IBM Blue Gene processors [BGW days] • Use weighted space-filling curves (wSFC) • erfc • climatology Software Engineering Working Group Meeting

POP (gx1v3) + Space-filling curve Software Engineering Working Group Meeting

Space-filling curve partition for 8 processors Software Engineering Working Group Meeting

Weighted Space-filling curves • Estimate work for each grid block Worki = w0 + Pi*w1 where: w0: Fixed work for all blocks w1: Work if block contains Sea-ice Pi: Probability block contains Sea-ice • For our experiments: w0 = 2, w1 = 10 Software Engineering Working Group Meeting

Probability Function • Error Function: Pi = erfc(( -max(|lati|))/) where: lati max lat in block i  mean sea-ice extent  variance in sea-ice extent NH=70°, SH =60°, =5 ° Software Engineering Working Group Meeting

Large domains @ low latitudes Small domains @ high latitudes 1° CICE4 on 20 processors Software Engineering Working Group Meeting

0.1° CICE4 • Developed at LANL • Finite Difference • Models sea-ice • Shares grid and infrastructure with POP • Reuse techniques from POP work • Computational grid: [3600 x 2400 x 20] • Computational load-imbalance creates problems: • ~15% of grid has sea-ice • Use weighted Space-filling curves? • Evaluate using Benchmark: • 1 day/ Initial run / 30 minute timestep / no Forcing Software Engineering Working Group Meeting

CICE4 @ 0.1° Software Engineering Working Group Meeting

Load-imbalance: Hudson Bay south of 70° Timings for 1°,npes=160, NH=70° Software Engineering Working Group Meeting

Timings for 1°,npes=160, NH=55° Software Engineering Working Group Meeting

Better Probability Function • Climatological Function: Where: ij climatological maximum sea-ice extent [satellite observation] ni is the number of points within block i with non-zero ij Software Engineering Working Group Meeting

Timings for 1°,npes=160, climate-based Reduces dynamics sub-cycling time by 28%! Software Engineering Working Group Meeting

Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL) Computer Time: Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) Cray XT3/4 time: ORNL Sandia Acknowledgements/Questions? Software Engineering Working Group Meeting et

Nb Partitioning with Space-filling Curves • Map 2D -> 1D • Variety of sizes • Hilbert (Nb=2n) • Peano (Nb=3m) • Cinco (Nb=5p) • Hilbert-Peano (Nb=2n3m) • Hilbert-Peano-Cinco (Nb=2n3m5p) • Partitioning 1D array Software Engineering Working Group Meeting

Scalable data structures • Common problem among applications • WRF • Serial I/O [fixed] • Duplication of lateral boundary values • POP & CICE • Serial I/O • CLM • Serial I/O • Duplication of grid info Software Engineering Working Group Meeting

Scalable data structures (con’t) • CAM • Serial I/O • Lookup tables • CPL • Serial I/O • Duplication of grid info Memory footprint problem will not solve itself! Software Engineering Working Group Meeting

Remove Land blocks Software Engineering Working Group Meeting

Case Study:Memory use in CLM • CLM Configuration: • 1x1.25 grid • No RTM • MAXPATCH_PFT = 4 • No CN, DGVM • Measure stack and heap on 32-512 BG/L processors Software Engineering Working Group Meeting

Memory use of CLM on BGL Software Engineering Working Group Meeting

Motivation (con’t) • Multiple efforts underway • CAM scalability + high resolution coupled simulation [A. Mirin] • Sequential coupler [M. Vertenstein, R. Jacob] • Single executable coupler [J. Wolfe] • CCSM on Blue Gene [J. Wolfe, R. Loy, R. Jacob] • HOMME in CAM [J. Edwards] Software Engineering Working Group Meeting

Outline • Chip-Multiprocessor • Fun with Large Processor Counts • POP • CICE • CLM • Parallel I/O library (PIO) Software Engineering Working Group Meeting

Status of CLM • Work of T. Craig • Elimination of global memory • Reworking of decomposition algorithms • Addition of PIO • Short term goal: • Participation in BGW days June 07 • Investigation scalability at 1/10 Software Engineering Working Group Meeting

Status of CLM memory usage • May 1, 2006: • memory usage increases with processor count • Can run 1x1.25 on 32-512 processors of BGL • July 10, 2006: • Memory usage scales to asymptote • Can run 1x1.25 on 32- 2K processors of BGL • ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree] • January, 2007: • 150 persistent global arrays • 1/2 degee runs on 32-2K BGL processors • ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree] • February, 2007: • 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree] • Target: • no persistent global arrays • 1/10 degree runs on single rack BGL Software Engineering Working Group Meeting

Proposed Petascale Experiment • Ensemble of 10 runs/200 years • Petascale Configuration: • CAM (30 km, L66) • POP @ 0.1° • 12.5 years / wall-clock day [17K Cray XT4 processors] • Sea-Ice @ 0.1° • 42 years / wall-clock day [10K Cray XT3 processors • Land model @ 0.1° • Sequential Design (105 days per run) • 32K BGL/ 10K XT3 processors • Concurrent Design (33 days per run) • 120K BGL / 42K XT3 processors Software Engineering Working Group Meeting

Chip-Multiprocessors & You