PIDX

PIDX • PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. • PIDX enables simulations to write out IDX data directly in parallel • Real-time interactive visualization and analyze of data. • monitor the health of the simulations which can assist in steering the simulation as well • Usage S3D combustion application to demonstrate the efficacy of PIDX for a real-world scientific simulation.

PIDX I/O phases • Describe data model • Create an IDX block bitmap • The bitmap indicates which IDX blocks must be populated in order to store an arbitrary N-dimensional dataset. • Create underlying file and directory hierarchy • The IDX file and directory hierarchy is created by the rank 0 process in the application before any I/O is performed. • Perform HZ encoding • The HZ encoding step is performed independently on each process. • In order to minimize memory access complexity, all samples are copied into intermediate buffers in a linear Z ordering. • Aggregate data • Write data to storage

HPC data models with PIDX /* define variables across all processes * / var1 = PIDX_variable_ global_define (“var1”, samples, datatype ) ; /* add local variables to the dataset */ PIDX_variable_local_add (dataset , var1 , global_index, count ) ; /* describe memory layout */ PIDX_variable_local_layout (dataset,var1,memory_address, datatype) ; /* write all data */ PIDX_write ( dataset ) ;

Aggregation Phases Using RMA to transmit each contiguous data segment to an intermediate aggregator. Aggregator Process Performs one single large I/O operation. SeparateIO By each process leads to a large number of small accesses to each file. Bundle noncontiguous memory into a single MPI indexed data types. Reduces the number of small network messages

Throughput comparison of all the versions of the API (Aggregation Strategy) EXPERIMENT SETUP Each process writes out a (64)3 sub-volume with 4 variables. PERFORMANCE RESULTS At 256 processes, we achieve up to a 18-fold speed up, and at 2048 processes, we achieve up to 30-fold speed up over a scheme with no aggregation. The aggregation strategy that utilized MPI datatypes yielded a 20% improvement over the aggregation strategy that issued a separate MPI_Put() for each contiguous region.

Performance Evaluation With S3D EXPERIMENT SETUP In each run, S3D I/O wrote out 10 time-steps wherein each process contributed 32MiB data set PERFORMANCE RESULTS At 8192 processes, PIDX achieves a maximum I/O throughput of 18 GiB/s ( 90% of the IOR throughput). IOR and Fortran I/O achieve similar throughput for all the process counts. Fortran I/O in S3D behaves similarly to IOR test case with each process populating a unique output file.

Impact of PIDX file parameters on Lustre EXPERIMENT SETUP Procs : 256 to 4K. Proc Size : 643 (doubles) 512 MiB (256 procs) and 4 GiB (4K procs). Elements per block 215 to 218 Blocks per file 128, 256 and 512 PERFORMANCE RESULTS As the number of files increases a noticeable speed up :- The number of aggregators is increased. The Lustre file system performs better as data is distributed across a larger number of files. Design is flexible enough to be tuned to generate small number of large shared files or a large number of files depending on which is optimal for the target system.

Time taken by the various PIDX I/Ocomponents

PIDX

PIDX

Presentation Transcript

INTRODUCTION TO PIDX

Chevron eCatalog Initiative Presented to PIDX by Randy Nance October 10, 2000

PowerTrack Presentation To PIDX Symposium October 9, 2000