Download
chip multiprocessors you n.
Skip this Video
Loading SlideShow in 5 Seconds..
Chip-Multiprocessors & You PowerPoint Presentation
Download Presentation
Chip-Multiprocessors & You

Chip-Multiprocessors & You

169 Views Download Presentation
Download Presentation

Chip-Multiprocessors & You

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Chip-Multiprocessors & You John Dennis dennis@ucar.edu March 16, 2007

  2. Intel “Tera Chip” • 80 core chip • 1 Teraflop • 3.16 Ghz / 0.95V / 62W • Process • 45 nm technology • High-K • 2D mesh network • Each processor has 5-port router • Connects to “3D-memory” Software Engineering Working Group Meeting

  3. Outline • Chip-Multiprocessor • Parallel I/O library (PIO) • Full with Large Processor Counts • POP • CICE Software Engineering Working Group Meeting

  4. Moore’s Law • Most things are twice as nice [18 months] • Transistor count • Processor speed • DRAM density • Historical Result: • Solve problem twice as large in same time • Solve same size problem in half the time --> Inactivity leads to progress! Software Engineering Working Group Meeting

  5. The advent of Chip-multiprocessors Moore’s Law gone bad!

  6. New implications of Moore’s Law • Every 18 months • # of cores per socket doubles • Memory density doubles • Clock cycle may increase slightly • 18 months from now • 8 cores per socket • Slight increase in clock cycle (~15%) • Same memory per core!! Software Engineering Working Group Meeting

  7. New implications of Moore’s Law (con’t) • Inactivity leads to no progress! • Possible outcome • Same problem size / same parallelism • solve problem ~15% faster • Bigger problem size • scalable memory? • More processors enable ~2x reduction in time to solution • Non-scalable memory? • May limit number of processors that can be used • Waste 1/2 of cores on sockets to use memory? • All components of application must scale to benefit from Moore’s Law increases! Memory footprint problem will not solve itself! Software Engineering Working Group Meeting

  8. Questions ? Software Engineering Working Group Meeting

  9. Parallel I/O library (PIO) John Dennis (dennis@ucar.edu) Ray Loy (rloy@mcs.anl.gov) March 16, 2007

  10. Introduction • All component models need parallel I/O • Serial I/O is bad! • Increased memory requirement • Typically negative impact on performance • Primary Developers: [J. Dennis, R. Loy] • Necessary for POP BGW runs Software Engineering Working Group Meeting

  11. Design goals • Provide parallel I/O for all component models • Encapsulate complexity into library • Simple interface for component developers to implement Software Engineering Working Group Meeting

  12. Design goals (con’t) • Extensible for future I/O technology • Backward compatible (node=0) • Support for multiple formats • {sequential,direct} binary • netcdf • Preserve format of input/output files • Supports 1D, 2D and 3D arrays • Currently XY • Extensible to XZ or YZ Software Engineering Working Group Meeting

  13. Terms and Concepts • PnetCDF: [ANL] • High performance I/O • Different interface • Stable • netCDF4 + HDF5 [NCSA] • Same interface • Needs HDF5 library • Less stable • Lower performance • No support on Blue Gene Software Engineering Working Group Meeting

  14. Terms and Concepts (con’t) • Processor stride: • Allows matching of subset of MPI IO nodes to system hardware Software Engineering Working Group Meeting

  15. Terms and Concepts (con’t) • IO decomp vs. COMP decomp • IO decomp == COMP decomp • MPI-IO + message aggregation • IO decomp != COMP decomp • Need Rearranger : MCT • No component specific info in library • Pair with existing communication tech • 1D arrays in library; component must flatten 2D and 3D arrays Software Engineering Working Group Meeting

  16. Component Model ‘issues’ • POP & CICE: • Missing blocks • Update of neighbors halo • Who writes missing blocks? • Asymmetry between read/write • ‘sub-block’ decompositions not rectangular • CLM • Decomposition not rectangular • Who writes missing data? Software Engineering Working Group Meeting

  17. What works • Binary I/O [direct] • Test on POWER5, BGL • Rearrange w/MCT + MPI-IO • MPI-IO no rearrangement • netCDF • netCDF • Rearrange with MCT [New] • Reduced memory • PnetCDF: • Rearrange with MCT • No rearrangement • Test on POWER5, BGL Software Engineering Working Group Meeting

  18. What works (con’t) • Prototype added to POP2 • Reads restart and forcing files correctly • Writes binary restart files correctly • Necessary for BGW runs • Prototype implementation in HOMME [J. Edwards] • Writes netCDF history files correctly • POPIO benchmark • 2D array [3600x2400] (70 Mbyte) • Test code for correctness and performance • Tested on 30K BGL processors in Oct 06 • Performance • POWER5: 2-3x serial I/O approach • BGL: mixed Software Engineering Working Group Meeting

  19. Complexity / Remaining Issues • Mulitple ways to express decomposition • GDOF: global degree of freedom --> (MCT, MPI-IO) • Subarrays: start + count (pNetCDF) • Limited expressiveness • Will not support ‘sub-block’ in POP & CICE, CLM • Need common language for interface • Interface between component model and library Software Engineering Working Group Meeting

  20. Conclusions • Working prototype • POP2 for binary I/O • HOMME for netCDF • PIO telecon: discuss progress every 2 weeks • Work in progress • Multiple efforts underway • accepting help • http://swiki.ucar.edu/ccsm/93 • In CCSM subversion repository Software Engineering Working Group Meeting

  21. Fun with Large Processor Counts:POP, CICE John Dennis dennis@ucar.edu March 16, 2007

  22. Motivation • Can Community Climate System Model (CCSM) be a Petascale Application? • Use 10-100K processors per simulation • Increasing common access to large systems • ORNL Cray XT3/4 : 20K [2-3 weeks] • ANL Blue Gene/P : 160K [Jan 2008] • TACC Sun : 55K [Jan 2008] • Petascale for the masses ? • lagtime in Top 500 List [4-5 years] • @ NCAR before 2015 Software Engineering Working Group Meeting

  23. Outline • Chip-Multiprocessor • Parallel I/O library (PIO) • Fun with Large Processor Counts • POP • CICE Software Engineering Working Group Meeting

  24. Status of POP • Access to 17K Cray XT4 processors • 12.5 years/day [Current Record] • 70% of time in solver • Won BGW cycle allocation Eddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock] • 110 Rack Days/ 5.4M CPU hours • 20 year 0.1° POP simulation • Includes a suite of dye-like tracers • Simulate eddy diffusivity tensor Software Engineering Working Group Meeting

  25. Status of POP (con’t) • Allocation will occur over ~7 days • Run in production on 30K processors • Needs Parallel I/O to write history file • Start runs in 4-6 weeks Software Engineering Working Group Meeting

  26. Outline • Chip-Multiprocessor • Parallel I/O library (PIO) • Fun with Large Processor Counts • POP • CICE Software Engineering Working Group Meeting

  27. Status of CICE • Tested CICE @ 1/10 • 10K Cray XT4 processors • 40K IBM Blue Gene processors [BGW days] • Use weighted space-filling curves (wSFC) • erfc • climatology Software Engineering Working Group Meeting

  28. POP (gx1v3) + Space-filling curve Software Engineering Working Group Meeting

  29. Space-filling curve partition for 8 processors Software Engineering Working Group Meeting

  30. Weighted Space-filling curves • Estimate work for each grid block Worki = w0 + Pi*w1 where: w0: Fixed work for all blocks w1: Work if block contains Sea-ice Pi: Probability block contains Sea-ice • For our experiments: w0 = 2, w1 = 10 Software Engineering Working Group Meeting

  31. Probability Function • Error Function: Pi = erfc(( -max(|lati|))/) where: lati max lat in block i  mean sea-ice extent  variance in sea-ice extent NH=70°, SH =60°, =5 ° Software Engineering Working Group Meeting

  32. Large domains @ low latitudes Small domains @ high latitudes 1° CICE4 on 20 processors Software Engineering Working Group Meeting

  33. 0.1° CICE4 • Developed at LANL • Finite Difference • Models sea-ice • Shares grid and infrastructure with POP • Reuse techniques from POP work • Computational grid: [3600 x 2400 x 20] • Computational load-imbalance creates problems: • ~15% of grid has sea-ice • Use weighted Space-filling curves? • Evaluate using Benchmark: • 1 day/ Initial run / 30 minute timestep / no Forcing Software Engineering Working Group Meeting

  34. CICE4 @ 0.1° Software Engineering Working Group Meeting

  35. Load-imbalance: Hudson Bay south of 70° Timings for 1°,npes=160, NH=70° Software Engineering Working Group Meeting

  36. Timings for 1°,npes=160, NH=55° Software Engineering Working Group Meeting

  37. Better Probability Function • Climatological Function: Where: ij climatological maximum sea-ice extent [satellite observation] ni is the number of points within block i with non-zero ij Software Engineering Working Group Meeting

  38. Timings for 1°,npes=160, climate-based Reduces dynamics sub-cycling time by 28%! Software Engineering Working Group Meeting

  39. Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL) Computer Time: Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) Cray XT3/4 time: ORNL Sandia Acknowledgements/Questions? Software Engineering Working Group Meeting et

  40. Nb Partitioning with Space-filling Curves • Map 2D -> 1D • Variety of sizes • Hilbert (Nb=2n) • Peano (Nb=3m) • Cinco (Nb=5p) • Hilbert-Peano (Nb=2n3m) • Hilbert-Peano-Cinco (Nb=2n3m5p) • Partitioning 1D array Software Engineering Working Group Meeting

  41. Scalable data structures • Common problem among applications • WRF • Serial I/O [fixed] • Duplication of lateral boundary values • POP & CICE • Serial I/O • CLM • Serial I/O • Duplication of grid info Software Engineering Working Group Meeting

  42. Scalable data structures (con’t) • CAM • Serial I/O • Lookup tables • CPL • Serial I/O • Duplication of grid info Memory footprint problem will not solve itself! Software Engineering Working Group Meeting

  43. Remove Land blocks Software Engineering Working Group Meeting

  44. Case Study:Memory use in CLM • CLM Configuration: • 1x1.25 grid • No RTM • MAXPATCH_PFT = 4 • No CN, DGVM • Measure stack and heap on 32-512 BG/L processors Software Engineering Working Group Meeting

  45. Memory use of CLM on BGL Software Engineering Working Group Meeting

  46. Motivation (con’t) • Multiple efforts underway • CAM scalability + high resolution coupled simulation [A. Mirin] • Sequential coupler [M. Vertenstein, R. Jacob] • Single executable coupler [J. Wolfe] • CCSM on Blue Gene [J. Wolfe, R. Loy, R. Jacob] • HOMME in CAM [J. Edwards] Software Engineering Working Group Meeting

  47. Outline • Chip-Multiprocessor • Fun with Large Processor Counts • POP • CICE • CLM • Parallel I/O library (PIO) Software Engineering Working Group Meeting

  48. Status of CLM • Work of T. Craig • Elimination of global memory • Reworking of decomposition algorithms • Addition of PIO • Short term goal: • Participation in BGW days June 07 • Investigation scalability at 1/10 Software Engineering Working Group Meeting

  49. Status of CLM memory usage • May 1, 2006: • memory usage increases with processor count • Can run 1x1.25 on 32-512 processors of BGL • July 10, 2006: • Memory usage scales to asymptote • Can run 1x1.25 on 32- 2K processors of BGL • ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree] • January, 2007: • 150 persistent global arrays • 1/2 degee runs on 32-2K BGL processors • ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree] • February, 2007: • 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree] • Target: • no persistent global arrays • 1/10 degree runs on single rack BGL Software Engineering Working Group Meeting

  50. Proposed Petascale Experiment • Ensemble of 10 runs/200 years • Petascale Configuration: • CAM (30 km, L66) • POP @ 0.1° • 12.5 years / wall-clock day [17K Cray XT4 processors] • Sea-Ice @ 0.1° • 42 years / wall-clock day [10K Cray XT3 processors • Land model @ 0.1° • Sequential Design (105 days per run) • 32K BGL/ 10K XT3 processors • Concurrent Design (33 days per run) • 120K BGL / 42K XT3 processors Software Engineering Working Group Meeting