Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University

Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang, kaeli}@ece.neu.edu

Outline • Introduction • Related work • Profile-guided I/O partitioning • Benchmarks • Experimental results • Conclusions and future work

Introduction • The I/O bottleneck • The growing gap between the speed of processors and I/O devices • Some applications access disks very frequently • I/O intensive applications • Multimedia applications • Database applications • Parallel scientific applications

Related work • Fast disks • FC-connected SCSI disks • Smart caching I/O controller (EMC, IO Integrity) • Parallel I/O • Parallel disks (i.e., RAID) • Parallel file systems (NFS, PIOF, HPS, etc.) • Runtime parallel systems (MPI-IO, ROMIO, ADIO) • Compiler technology • (Loop tiling, compiler-directed collective I/O) • To achieve high performance, I/O should be parallelized at multiple levels (application, file system, disks)

I/O Partitioning • Our target applications are parallel scientific codes running on Beowulf clusters • I/O is parallelized at both the application level (using MPI and MPI-IO) and the disk level (using file partitioning) • Ideally, every process will only access files on local disk (though this is typically not possible due to data sharing) • How to recognize the access patterns ? • dynamically (profiling) • statically (compiler)

Profile generation Run the application Capture I/O traces Apply our partitioning algorithm Rerun the tuned application

I/O traces and partitioning • For every process, for every contiguous file access, we capture the following I/O profile information: • Process ID • File ID • Address • Chunk size • I/O operation (read/write) • Timestamp • Generate a partition for every process • Partitioning is NP-complete

Our Greedy Algorithm For each MPI-IO process create a file partition; For each contiguous data chunk identify the process that most frequently accesses this chunk; assign the chunk to the associated partition; For each partition reorder data in the partition based on first access to each chunk;

Benchmarks • NASA Parallel Benchmark (NPB2.4)/BT • Computational fluid dynamics • Generates a file (~1.6 GB) dynamically and then reads it • Writes/reads sequentially in chunk sizes of 2040 Bytes • SPEChpc96/seismic • Seismic processing • Generates a file (~1.5 GB) dynamically and then reads it back • Writes sequential chunks of 96 KB and reads sequential chunks of 2 KB • mpi-tile-io • Parallel Benchmarking Consortium • Tile access to a two-dimensional matrix (~1 GB) with overlap • Writes/reads sequentially chunks of 32 KB, with 2KB of overlap • All applications uses MPI and MPI-IO for computation, communication and I/O

Conclusions and future work • We obtain scalable speedup due to: • creating parallel I/O channels • reducing disk seek time • reducing communication overhead • I/O access patterns are generally independent of data values, for the applications studied • Investigating static (compile time) approaches to I/O partitioning

Northeastern University Computer Architecture Research Grouphttp://www.ece.neu.edu/groups/nucar This project is supported by the NSF-funded Center for Subsurface Sensing and Imaging System (CenSSIS)

Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University