Understanding and Optimizing Data Input/Output of Large-Scale Scientific Applications

Understanding and OptimizingData Input/Output of Large-ScaleScientific Applications Jeffrey S. Vetter (Leader) Weikuan Yu, Philip C. Roth Future Technologies Group Computer Science and Mathematics Division

ComputeNode ComputeNode ComputeNode ComputeNode Computenode Computenode Computenode Computenode Metadataserver Objectstorageserver Objectstorageserver Objectstorageserver I/O for large-scale scientific computing • Reading input and restart files • Writing checkpoint files • Writing movie, history files • Gaps of understanding across domains; efficiency is low SciDAC climate studies visualization at ORNL SciDAC astrophysics simulation visualization at ORNL

Application I/O Demand Expected Time Between System Failures Application I/O Demand I/O System Capability The I/O gap Widening gap betweenapplication I/O demandsand system I/O capability. Gap may grow too largefor existing techniques (e.g.,checkpointing) to beviable because of decreases in system reliability assystems get larger. Bandwidth I/O gap Time Checkpoints eventually take longer than system MTTF System Size

Application Instrumented MPI FunctionWrapper Library MPI Library (Including MPI-I/O) Instrumented POSIX I/OWrapper Library POSIX I/O Functions Lustre File System Portals Data Transfer Layer Insight into I/O behavior • Performance data collection infrastructure for Cray XT • Gathers detailed I/O request data without changes to application source code • Useful for • Characterizing application I/O • Driving storage system simulations • Deciding how and where tooptimize I/O Instrumented Cray XT MPI Software Stack Jaguar Cray XT system at ORNL

Storage enclosure Host bus interface Page/buffer cache I/O schedulers Disk RAID volume management Computer Node File system/storage servers Disk controller engine Scientific application Processor core Processor core HDF5/PnetCDF Disk tiers 1 UMA/NUMA memory X-axis MPI-I/O Z-axis 2 3 I/O Node File system (Lustre/PVFS2) Client components 4 Network interface 5 Y-axis Optimization through parallel I/O libraries • Challenges approachable via libraries: • Application hints and data manipulation • Processor/memory architecture • Parallel I/O protocol processing overhead • File-system–specific techniques • Network topology and status • Advantages from parallel I/O libraries • Interfacing application, runtime, and operating system • Ease of solution deployment

Application (P)netCDF HDF5 MPI-I/O POSIX Syscalls AD_lustre AD_Sysio sysio liblustre Portals Opportunistic and adaptive MPI-I/O for Lustre (OPAL) • An MPI-I/O package optimized for Lustre • A unified code base for Cray XT and Linux • Open source, better performance • Improved data-sieving implementation • Enabled arbitrary striping specification over Cray XT • Lustre stripe-aligned file domain partitioning • http://ft.ornl.gov/projects/io/#download • Provides better bandwidth and scalability than default Cray XT MPI-I/O library in many cases

Number of Processes 256 512 1024 12000 10000 8000 Bandwidth (MB/sec) 6000 4000 2000 0 Cray OPAL Cray OPAL Cray OPAL Checkpoint Plot-Center Plot-Corner Example results: OPAL • Bandwidth for FLASH I/O benchmark. • OPAL provided better bandwidth and scalability than default Cray XT MPI-I/O library.

MPI-I/O Partitioned Collective I/O File area partitioning I/O aggregator distribution Independent I/O Intermediate file views Extended two-phase MPI Lustre Partitioned collective I/O (ParColl) • Collective I/O is good for small I/O request aggregation. • But global synchronization within is a barrier to scalability. • ParColl partitions global processes, I/O aggregators, and the shared global file appropriately for scalable I/O aggregation.

12000 500 10000 400 8000 300 6000 200 4000 100 2000 Cray 1 2 4 8 16 32 64 0 0 Cray ParColl % Example results: ParColl • Evaluated benchmarks: MPI-Tile-I/O and Flash I/O. • ParColl improves collective I/O for various benchmarks. MPI-Tile-I/O with ParColl Flash I/O with ParColl 20000 15000 Bandwidth (MB/s) Improvement Bandwidth (MB/s) 10000 5000 0 16 32 64 128 256 512 1024 Number of Processors Number of Groups

Contacts • Jeffrey S. Vetter • Leader • Future Technologies Group Computer Science and Mathematics Division (865) 356-1649 • vetter@ornl.gov • Weikuan Yu • (865) 574-7990 • wyu@ornl.gov • Philip C. Roth • (865) 241-1543 • rothpc@ornl.gov • For more information, including code downloads, see http://ft.ornl.gov/projects/io/ 10 Vetter_ftio_SC07

Understanding and Optimizing Data Input/Output of Large-Scale Scientific Applications