1 / 40

Parallel and Grid I/O Infrastructure

Parallel and Grid I/O Infrastructure. Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002. Participants. Argonne National Laboratory Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham, Anthony Chan Northwestern University

indra
Download Presentation

Parallel and Grid I/O Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002

  2. Participants • Argonne National Laboratory • Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham, Anthony Chan • Northwestern University • Alok Choudhary, Wei-keng Liao, Avery Ching, Kenin Coloma, Jianwei Li • Collaborators • Lawrence Livermore National Laboratory • Ghaleb Abdulla, Tina Eliassi-Rad, Terence Critchlow • Application groups

  3. Focus Areas in Project • Parallel I/O on clusters • Parallel Virtual File System (PVFS) • MPI-IO hints • ROMIO MPI-IO implementation • Grid I/O • Linking PVFS and ROMIO with Grid I/O components • Application interfaces • NetCDF and HDF5 • Everything is interconnected! • Wei-keng Liao will drill down into specific tasks

  4. Parallel Virtual File System • Lead developer R. Ross (ANL) • R. Latham (ANL), developer • A. Ching, K. Coloma (NWU), collaborators • Open source, scalable parallel file system • Project began in mid 90’s at Clemson University • Now a collaborative between Clemson and ANL • Successes • In use on large Linux clusters (OSC, Utah, Clemson, ANL, Phillips Petroleum, …) • 100+ unique downloads/month • 160+ users on mailing list, 90+ on developers list • Multiple Gigabyte/second performance shown

  5. Keeping PVFS Relevant: PVFS2 • Scaling to thousands of clients and hundreds of servers requires some design changes • Distributed metadata • New storage formats • Improved fault tolerance • New technology, new features • High-performance networking (e.g. Infiniband, VIA) • Application metadata • New design and implementation warranted (PVFS2)

  6. PVFS1, PVFS2, and SDM • Maintaining PVFS1 as a resource to community • Providing support, bug fixes • Encouraging use by application groups • Adding functionality to improve performance (e.g. tiled display) • Implementing next-generation parallel file system • Basic infrastructure for future PFS work • New physical distributions (e.g. chunking) • Application metadata storage • Ensuring that a working parallel file system will continue to be available on clusters as they scale

  7. Data Staging for Tiled Display • Contact: Joe Insley (ANL) • Commodity components • projectors, PCs • Provide very high resolutionvisualization • Staging application preprocesses “frames” into a tile stream for each “visualization node” • Uses MPI-IO to access data from PVFS file system • Streams of tiles are merged into movie files on visualization nodes • End goal is to display frames directly from PVFS • Enhancing PVFS and ROMIO to improve performance

  8. Example Tile Layout • 3x2 display, 6 readers • Frame size is 2532x1408 pixels • Tile size is 1024x768 pixels (overlapped) • Movies broken into frames with each frame stored in its own file in PVFS • Readers pull data from PVFS and send to display

  9. Tested access patterns • Subtile • Each reader grabs a piece of a tile • Small noncontiguous accesses • Lots of accesses for a frame • Tile • Each reader grabs a whole tile • Larger noncontiguous accesses • Six accesses for a frame • Reading individual pieces is simply too slow

  10. Noncontiguous Access in ROMIO • ROMIO performs “data sieving” to cut down number of I/O operations • Uses large reads which grab multiple noncontiguous pieces • Example, reading tile 1:

  11. Noncontiguous Access in PVFS • ROMIO data sieving • Works for all file systems (just uses contiguous read) • Reads extra data (three times desired amount) • Noncontiguous access primitive allows requesting just desired bytes (A. Ching, NWU) • Support in ROMIO allowstransparent use of new optimization (K. Coloma,NWU) • PVFS and ROMIO supportimplemented

  12. Metadata in File Systems • Associative arrays of information related to a file • Seen in other file systems (MacOS, BeOS, ReiserFS) • Some potential uses: • Ancillary data (from applications) • Derived values • Thumbnail images • Execution parameters • I/O library metadata • Block layout information • Attributes on variables • Attributes of dataset as a whole • Headers • Keeps header out of data stream • Eliminates need for alignment in libraries

  13. Metadata and PVFS2 Status • Prototype metadata storage for PVFS2 implemented • R. Ross (ANL) • Uses Berkeley DB for storage of keyword/value pairs • Need to investigate how to interface to MPI-IO • Other components of PVFS2 coming along • Networking in testing (P. Carns, Clemson) • Client side API under development (Clemson) • PVFS2 beta early fourth quarter?

  14. ROMIO MPI-IO Implementation • Written by R. Thakur (ANL) • R. Ross and R. Latham (ANL), developers • K. Coloma (NWU), collaborator • Implementation of MPI-2 I/O specification • Operates on wide variety of platforms • Abstract Device Interface for I/O (ADIO) aids in porting to new file systems • Successes • Adopted by industry(e.g. Compaq, HP, SGI) • Used at ASCI sites(e.g. LANL Blue Mountain)

  15. ROMIO Current Directions • Support for PVFS noncontiguous requests • K. Coloma (NWU) • Hints - key to efficient use of HW & SW components • Collective I/O • Aggregation (synergy) • Performance portability • Controlling ROMIO Optimizations • Access patterns • Grid I/O • Scalability • Parallel I/O benchmarking

  16. ROMIO Aggregation Hints • Part of ASCI Software Pathforward project • Contact: Gary Grider (LANL) • Implementation by R. Ross, R. Latham (ANL) • Hints control what processes do I/O in collectives • Examples: • All processes on same node as attached storage • One process per host • Additionally limit number of processes who open file • Good for systems w/out shared FS (e.g. O2K clusters) • More scalable

  17. Aggregation Example • Cluster of SMPs • Only one SMP box has connection to disks • Data is aggregated to processes on single box • Processes on that box perform I/O on behalf of the others

  18. Optimization Hints • MPI-IO calls should be chosen to best describe the I/O taking place • Use of file views • Collective calls for inherently collective operations • Unfortunately sometimes choosing the “right” calls can result on lower performance • Allow application programmers to tune ROMIO with hints rather than using different MPI-IO calls • Avoid the misapplication of optimizations (aggregation, data sieving)

  19. Optimization Problems • ROMIO checks for applicability of two-phase optimization when collective I/O is used • With tiled display application using subtile access, this optimization is never used • Checking for applicability requires communication between processes • Results in 33% drop in throughput (on test system) • A hint that tells ROMIO not to apply the optimization can avoid this without changes to the rest of the application

  20. Access Pattern Hints • Collaboration between ANL and LLNL (and growing) • Examining how access pattern information can be passed to MPI-IO interface, through to underlying file system • Used as input to optimizations in MPI-IO layer • Used as input to optimizations in FS layer as well • Prefetching • Caching • Writeback

  21. Status of Hints • Aggregation control finished • Optimization hints • Collectives, data sieving read finished • Data sieving write control in progress • PVFS noncontiguous I/O control in progress • Access pattern hints • Exchanging log files, formats • Getting up to speed on respective tools

  22. Parallel I/O Benchmarking • No common parallel I/O benchmarks • New effort (consortium) to: • Define some terminology • Define test methodology • Collect tests • Goal: provide a meaningful test suite with consistent measurement techniques • Interested parties at numerous sites (and growing) • LLNL, Sandia, UIUC, ANL, UCAR, Clemson • In infancy…

  23. Grid I/O • Looking at ways to connect our I/O work with components and APIs used in the Grid • New ways of getting data in and out of PVFS • Using MPI-IO to access data in the Grid • Alternative mechanisms for transporting data across the Grid (synergy) • Working towards more seamless integration of the tools used in the Grid and those used on clusters and in parallel applications (specifically MPI applications) • Facilitate moving between Grid and Cluster worlds

  24. Local Access to GridFTP Data • Grid I/O Contact: B. Allcock (ANL) • GridFTP striped server provides high-throughput mechanism for moving data across Grid • Relies on proprietary storage format on striped servers • Must manage metadata on stripe location • Data stored on servers must be read back from servers • No alternative/more direct way to access local data • Next version assumes shared file system underneath

  25. GridFTP Striped Servers • Remote applications connect to multiple striped servers to quickly transfer data over Grid • Multiple TCP streams better utilize WAN network • Local processes would need to use same mechanism to get to data on striped servers

  26. PVFS under GridFTP • With PVFS underneath, GridFTP servers would store data on PVFS I/O servers • Stripe information stored on PVFS metadata server

  27. Local Data Access • Application tasks that are part of a local parallel job could access data directly off PVFS file system • Output from application could be retrieved remotely via GridFTP

  28. MPI-IO Access to GridFTP • Applications such as tiled display reader desire remote access to GridFTP data • Access through MPI-IO would allow this with no code changes • ROMIO ADIO interface provides the infrastructure necessary to do this • MPI-IO hints provide means for specifying number of stripes, transfer sizes, etc.

  29. WAN File Transfer Mechanism • B. Gropp (ANL), P. Dickens (IIT) • Applications • PPM and COMMAS (Paul Woodward, UMN) • Alternative mechanism for moving data across Grid using UDP • Focuses on requirements for file movement • All data must arrive at destination • Ordering doesn’t matter • Lost blocks can be retransmitted when detected, but need not stop the remainder of the transfer

  30. WAN File Transfer Performance • Comparing TCP utilization to WAN FT technique • See 10-12% utilization with single TCP stream (8 streams to approach max. utilization) • With WAN FT obtain near 90% utilization, more uniform performance

  31. Grid I/O Status • Planning with Grid I/O group • Matching up components • Identifying useful hints • Globus FTP client library is available • 2nd generation striped server being implemented • XIO interface prototyped • Hooks for alternative local file systems • Obvious match for PVFS under GridFTP

  32. NetCDF • Applications in climate and fusion • PCM • John Drake (ORNL) • Weather Research and Forecast Model (WRF) • John Michalakes (NCAR) • Center for Extended Magnetohydrodynamic Modeling • Steve Jardin (PPPL) • Plasma Microturbulence Project • Bill Nevins (LLNL) • Maintained by Unidata Program Center • API and file format for storing multidimensional datasets and associated metadata (in a single file)

  33. NetCDF Interface • Strong points: • It’s a standard! • I/O routines allow for subarray and strided access with single calls • Access is clearly split into two modes • Defining the datasets (define mode) • Accessing and/or modifying the datasets (data mode) • Weakness: no parallel writes, limited parallel read capability • This forces applications to ship data to a single node for writing, severely limiting usability in I/O intensive applications

  34. Parallel NetCDF • Rich I/O routines and explicit define/data modes provide a good foundation • Existing applications are already describing noncontiguous regions • Modes allow for a synchronization point when file layout changes • Missing: • Semantics for parallel access • Collective routines • Option for using MPI datatypes • Implement in terms of MPI-IO operations • Retain file format for interoperability

  35. Parallel NetCDF Status • Design document created • B. Gropp, R. Ross, and R. Thakur (ANL) • Prototype in progress • J. Li (NWU) • Focus is on write functions first • Biggest bottleneck for checkpointing applications • Read functions follow • Investigate alternative file formats in future • Address differences in access modes between writing and reading

  36. FLASH Astrophysics Code • Developed at ASCI Center at University of Chicago • Contact: Mike Zingale • Adaptive mesh (AMR) code for simulating astrophysical thermonuclear flashes • Written in Fortran90, uses MPI for communication, HDF5 for checkpointing and visualization data • Scales to thousands of processors, runs for weeks, needs to checkpoint • At the time, I/O was a bottleneck (½ of runtime on 1024 processors)

  37. HDF5 Overhead Analysis • Instrumented FLASH I/O to log calls to H5Dwrite H5Dwrite MPI_File_write_at

  38. HDF5 Hyperslab Operations • White region is hyperslab “gather” (from memory) • Cyan is “scatter” (to file)

  39. Hand-Coded Packing • Packing time is in black regions between bars • Nearly order of magnitude improvement

  40. Wrap Up • Progress being made on multiple fronts • ANL/NWU collaboration is strong • Collaborations with other groups maturing • Balance of immediate payoff and medium term infrastructure improvements • Providing expertise to application groups • Adding functionality targeted at specific applications • Building core infrastructure to scale, ensure availability • Synergy with other projects • On to Wei-keng!

More Related