1 / 11

Design and Evaluation of Non-Blocking Collective I/O Operations

Design and Evaluation of Non-Blocking Collective I/O Operations. Vishwanath Venkatesan 1 , Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston <venkates, gabriel>@ cs.uh.edu. Outline. I/O Challenge in HPC MPI File I/O

bryson
Download Presentation

Design and Evaluation of Non-Blocking Collective I/O Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan1, Edgar Gabriel1 1 Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston <venkates, gabriel>@cs.uh.edu Vishwanath Venkatesan

  2. Outline • I/O Challenge in HPC • MPI File I/O • Non-blocking Collective Operations • Non-blocking Collective I/O Operations • Experimental results • Conclusion Vishwanath Venkatesan

  3. I/O Challenge in HPC • A 2005 Paper from LLNL [1] states • Applications on leadership class machines require 1 GB/s I/O Bandwidth per teraflop of computing capability • Jaguar of ORNL , (Fastest in 2008) • Excess 250 Teraflops peak compute performance with peak I/O performance of 72 GB/s [3] • Fastest Supercomputer K, (2011) • 10 Petaflops (nearly) peak compute performance with realized I/O bandwidth of 96 GB/s [2] [1] Richard Hedges, Bill Loewe, T. McLarty, and Chris Morrone. Parallel File System Testing for the Lunatic Fringe: the care and feeding of restless I/O Power Users, In Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (2005) [2] Shinji Sumimoto. An Overview of Fujitsu’s Lustre Based File System. Technical report, Fujitsu, 2011 [3] M. Fahey, J. Larkin, and J. Adams. I/O performance on a massively parallel Cray XT3/XT4. In Parallel and Distributed Processing, Vishwanath Venkatesan

  4. MPI File I/O • MPI has been de-facto standard for parallel programming in the last decade • MPI I/O • File view: portion of a file visible to a process • Individual and collective I/O operations • Example to illustrate the advantage of collective I/O ` • 4 processes accessing a 2D matrix stored in row-major format • MPI-I/O can detect this access pattern and issue one large I/O request followed by a distribution step for the data among the processes Vishwanath Venkatesan

  5. Non-blocking Collective Operations • Non-blocking Point-to-Point Operations • Asynchronous data transfer operation • Hide communication latency by overlapping with computation • Demonstrated benefits for a number of applications [1] • Non-blocking collective communication operations were implemented using LibNBC [2] • Schedule based design: a process-local schedule of p2p operations is created • Schedule execution is represented as a state machine (with dependencies) • State and schedule are attached to every request • Non-blocking collective communication operations voted into the upcoming MPI-3 specification [2] • Non-blocking collective I/O operations not (yet) added to the document. [1] Buettner. D, Kunkel. J, and Ludwig. T. 2009. Using Non-blocking I/O Operations in High Performance Computing to Reduce Execution Times. In Proceedings of the 16th European PVM/MPI Users [2] Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI, Supercomputing 2007/ Vishwanath Venkatesan

  6. Non-blocking collective I/O Operations MPI_File_iwrite_all(MPI_Filefile,void*buf, intcnt, MPI_Datatyepdt,MPI_Request*request) • Different from Non-blocking collective communication operations • Every process is allowed to provide different amounts of data per collective read/write operation • No process has a ‘global’ view how much data is read/written • Create a schedule for a non-blocking All-gather(v) • Determine the overall amount of data written across all processes • Determine the offsets for each data item within each group • Upon completion: • Create a new schedule for the shuffle and I/O steps • Schedule can consist of multiple cycles Vishwanath Venkatesan

  7. Experimental Evaluation • Crill cluster at the University of Houston • Distributed PVFS2 file system using with 16 I/O servers • 4x SDR InfiniBand message passing network (2 ports per node) • Gigabit Ethernet I/O network • 18 nodes, 864 compute cores • LibNBC integrated with OpenMPI trunk rev. 24640 • Focusing on collective write operations Vishwanath Venkatesan

  8. Latency I/O Overlap Tests • Overlapping non-blocking coll. I/O operation with equally expensive compute operation • Best case: overall time = max (I/O time, compute time) • Strong dependence on ability to make progress • Best case: time between subsequent calls to NBC_Test = time to execute one cycle of coll. I/O Vishwanath Venkatesan

  9. Parallel Image Segmentation Application • Used to assist in diagnosing thyroid cancer • Based on microscopic images obtained through Fine Needle Aspiration (FNA) [1] • Executes convolution operation for different filters and writes data • Code modified to overlap write of iteration i with computations of iteration i+1 • Two code versions generated: • NBC: Additional calls to progress engine added between different code blocks • NBC w/FFTW: Modified FFTW to insert further calls to progress engine [1] Edgar Gabriel,Vishwanath Venkatesan and ShishirShah, Towards High Performance Cell Segmentation in Multispectral Fine Needle Aspiration Cytology of Thyroid Lesions.Computer Methods and Programs in Biomedicine, 2009. Vishwanath Venkatesan

  10. Application Results • 8192 x 8192 pixels, 21 spectral channels • 1.3 GB input data, ~3 GB output data • 32 aggregators with 4 MB cycle buffer size Vishwanath Venkatesan

  11. Conclusions • Specification of non-blocking collective I/O operations straight forward • Implementation challenging, but doable • Results show strong dependence on the ability to make progress • (Nearly) perfect for micro benchmark • Mostly good results with application scenario • Is up for first voting in the MPI Forum. Vishwanath Venkatesan

More Related