Managing Technical Challenges in Project Athena: Insights from Larry Marx and Team

Project Athena: Technical Issues Larry Marx and the Project Athena Team

Outline • Project Athena Resources • Models and Machine Usage • Experiments • Running Models • Initial and Boundary Data Preparation • Post Processing, Data Selection and Compression • Data Management

Dedicated, Oct’09 – Mar’10 post-processing Dedicated, Oct’09 – Mar’10 79 million core-hours Shared, Oct’09 – Mar’10 5 million core-hours Athena 4512 nodes @ 4 cores, 2 GB mem Verne 5 nodes @ 32 cores, 128 GB mem Kraken 8256 nodes @ 12 cores, 16 GB mem Read-only scratch 78 TB (Lustre) nakji 360 TB (Lustre) homes 8 TB (NFS) 800+ TB HPSS tape archive

Models and Machine Usage • NICAM initially was the primary focus of implementation • Limited flexibility in scaling, due to icosahedral grid • Limited testing on multicore/cache processor architectures; production primarily on the vector-parallel (NEC SX) Earth Simulator • Step 1: Port low resolution version with simple physics to Athena • Step 2: Determine highest resolution possible on Athena and minimum and maximum number of cores to be used • Unique solution: G-level = 10 or 10,485,762 cells (7-km spacing) using exactly 2,560 cores • Step 3: Initially NICAM jobs failed frequently due to improper namelist settings. During visit by U. Tokyo and JAMSTEC scientists to COLA, new settings determined that generally ran with little trouble. However 2003 could never be stabilized and was abandoned.

Models and Machine Usage (cont’d) • IFS flexible scalability sustains good performance for higher resolution configurations (T1279 and T2047) using 2,560 processor cores • We defined one “slot” as 2,560 cores and managed a mix of NICAM and IFS jobs @ 1 job per slot  maximally efficient use of resource. • Having equal size slots for both models permits either model to be queued and run in the event of a job failure. • Selected jobs given higher priority so that they continue to run ahead of others. • Machine partition: 7 slots of 2,560 cores = 17,920 cores out of 18,048 • 99% machine utilization • 128 processors for pre- and post-processing and as spares (postpone reboot) • Lower resolution IFS experiments (T159 and T511) were run on Kraken • IFS runs were initially made by COLA. When the ECMWF SMS model management system was installed, runs could be made by COLA or ECMWF.

Project Athena Experiments

Initial and Boundary Data Preparation • IFS: • Most input data prepared by ECMWF. Large files shipped by removable disk. • Time Slice experiment input data prepared by COLA. • NICAM: • Initial data from GDAS 1° files. Available for all dates. • Boundary files other than SST included with NICAM. • SST from ¼° NCDC OI daily (version 2). Data starting 1 June 2002 include in situ, AVHRR (IR), and AMSR-E (microwave) . Earlier data does not include AMSR-E. • All data interpolated to icosahedral grid.

Post Processing, Data Selection and Compression • All IFS (Grib-1) data interpolated (coarsened) to the N80 reduce grid for common comparison among the resolutions and with the ERA-40 data. All IFS spectral data truncated to T159 coefficients and transformed to N80 full grid. • Key fields at full model resolution were processed, including transforming spectral coefficients to grids and compression to NetCDF-4 via GrADS. • Processing accomplished using Kraken, because Athena lacks sufficient memory and computing power on each node. • All the common comparison and selected high-resolution data electronically transferred to COLA via bbcp (up to 40MB/s sustained).

Post Processing, Data Selection and Compression (cont’d) • Nearly all (91) NICAM diagnostic variables saved. Each variable saved with (2560) separate files for model domains, resulting in over 230,000 files. The number of files quickly saturated LFS. • Original program to interpolate data to regular lat-lon grid had to be revised to use less I/O and to multithread, thereby eliminating a processing backlog. • Selected 3-d fields were interpolated from z-coordinate to p-coordinate levels. • Selected 2- and 3-d fields were compressed (NetCDF-4) and electronically transferred to COLA. • All selected fields coarsened to N80 full grid.

Data Management: NICS • All data archived to HPSS approaching 1 PB • Workflow required complex data movement: • All model runs at high resolution done on Athena • Model output stored on scratch or nakji and all copied to tape on HPSS • IFS data interpolation/truncation done directly from retrieved HPSS files • NICAM data processed using Verne and nakji (more capable CPUs and larger memory)

Data Management: COLA • Athena allocated 50TB (26%) on COLA disk servers. • Required considerable discussion and judgment to down-select variables from IFS and NICAM, based on factors including scientific use and data compressibility. • Large directory structure needed to organize the data, particularly IFS with many resolutions, sub-resolutions, data forms and ensemble members.

Data Management: Future • New machines at COLA and NICS will permit further analysis not currently possible due to lack of memory and compute power. • Some or all of the data will be made publically available eventually when long term disposition is determined. • TeraGrid Science Portal?? • Earth System Grid??

Summary • Large, international team of climate and computer scientists, using dedicated and shared resources, introduces many challenges for production computing, data analysis and data management • The shear volume and the complexity of the data, “breaks” everything: • Disk capacity • File name space • Bandwidth connecting systems within NICS • HPSS tape capacity • Bandwidth to remote sites for collaborating groups • Software for analysis and display of results (GrADS modifications) • COLA overcame these difficulties as they were encountered in 24×7 production mode and prevent having an idle dedicated computer.

Managing Technical Challenges in Project Athena: Insights from Larry Marx and Team

Managing Technical Challenges in Project Athena: Insights from Larry Marx and Team

Presentation Transcript

Pallas Athena/Minerva

Review of Phase I Oilheat Chimney Venting Project Summary of Key Technical Issues

Athens and the Panatheneia

Capital Athena Call Property Guru Pvt Ltd 8470032001

Apex Athena Noida

How to Use the US LCI Database

Athena

Athena

MEGAPROJECT Case Study

Iconography of Athena

~Athena: Goddess of Wisdom~

Autonomous vehicles: Technical issues

Odysseus and Athena

Odysseus and Athena

ATHENA WFI Meeting

Athena Issues and B-tagging Efficiency

Athena

Athena

E UROPEAN U NION ATHENA 2007

Guideline-Based Decision Support for Hypertension with ATHENA DSS

Athena Promachos

The Birth of Athena