IO Best Practices For Franklin
Download
1 / 42

Outline - PowerPoint PPT Presentation


  • 203 Views
  • Updated On :

IO Best Practices For Franklin Katie Antypas User Services Group [email protected] NERSC User Group Meeting September 19, 2007. Outline. Goals and scope of tutorial IO Formats Parallel IO strategies Striping Recommendations.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Outline' - sherlock_clovis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

IO Best Practices For Franklin

Katie Antypas

User Services Group

[email protected]

NERSC User Group Meeting

September 19, 2007


Outline l.jpg
Outline

  • Goals and scope of tutorial

  • IO Formats

  • Parallel IO strategies

  • Striping

  • Recommendations

Thanks to Julian Borrill, Hongzang Shan, John Shalf and Harvey Wasserman for slides and data, Nick Cardo for Franklin/Lustre tutorials and NERSC-IO group for feedback

NERSC User Group Meeting, September 17, 2007


Goals l.jpg
Goals

  • Very high level answer question of “how should I do my IO on Franklin?”

  • With X GB of data to output running on Y processors -- do this.

NERSC User Group Meeting, September 17, 2007


Axis of io l.jpg
Axis of IO

Total Output Size

File System Hints

Transfer Size

Blocksize

Collective vs Independent

Weak vs Strong Scaling

This is why IO is complicated…..

Number of Files per Ouput Dump

Number of Processors

Strided or Contiguous Access

Striping

Chunking

IO Library

File Size Per Processor

NERSC User Group Meeting, September 17, 2007


Axis of io5 l.jpg
Axis of IO

Total Output Size

File System Hints

Transfer Size

Blocksize

Collective vs Independent

Weak vs Strong Scaling

Number of Files per Ouput Dump

Number of Processors

Strided or Contiguous Access

Striping

Chunking

IO Library

File Size Per Processor

NERSC User Group Meeting, September 17, 2007


Axis of io6 l.jpg
Axis of IO

Primarily large block IO, transfer size same as blocksize

Total File Size

Transfer Size

Blocksize

Strong Scaling

Number of Writers

Number of Processors

Striping

Some Basic Tips

IO Library

File Size Per Processor

Used HDF5

NERSC User Group Meeting, September 17, 2007


Parallel i o a user perspective l.jpg
Parallel I/O: A User Perspective

  • Wish List

    • Write data from multiple processors into a single file

    • File can be read in the same manner regardless of the number of CPUs that read from or write to the file. (eg. want to see the logical data layout… not the physical layout)

    • Do so with the same performance as writing one-file-per-processor (only writing one-file-per-processor because of performance problems)

    • And make all of the above portable from one machine to the next

NERSC User Group Meeting, September 17, 2007


I o formats l.jpg
I/O Formats

NERSC User Group Meeting, September 17, 2007


Common storage formats l.jpg

Many NERSC users at this level. We would like to encourage users to transition to a higher IO library

Common Storage Formats

  • ASCII:

    • Slow

    • Takes more space!

    • Inaccurate

  • Binary

    • Non-portable (eg. byte ordering and types sizes)

    • Not future proof

    • Parallel I/O using MPI-IO

  • Self-Describing formats

    • NetCDF/HDF4, HDF5, Parallel NetCDF

    • Example in HDF5: API implements Object DB model in portable file

    • Parallel I/O using: pHDF5/pNetCDF (hides MPI-IO)

  • Community File Formats

    • FITS, HDF-EOS, SAF, PDB, Plot3D

    • Modern Implementations built on top of HDF, NetCDF, or other self-describing object-model API

NERSC User Group Meeting, September 17, 2007


Hdf5 library l.jpg
HDF5 Library users to transition to a higher IO library

HDF5 is a general purpose library and file format for storing scientific data

  • Can store data structures, arrays, vectors, grids, complex data types, text

  • Can use basic HDF5 types integers, floats, reals or user defined types such as multi-dimensional arrays, objects and strings

  • Stores metadata necessary for portability - endian type, size, architecture

NERSC User Group Meeting, September 17, 2007


Hdf5 data model l.jpg
HDF5 Data Model users to transition to a higher IO library

  • Groups

    • Arranged in directory hierarchy

    • root group is always ‘/’

  • Datasets

    • Dataspace

    • Datatype

  • Attributes

    • Bind to Group & Dataset

  • References

    • Similar to softlinks

    • Can also be subsets of data

“/”

(root)

“author”=Jane Doe

“date”=10/24/2006

“subgrp”

“Dataset0”

type,space

“Dataset1”

type, space

“time”=0.2345

“validity”=None

“Dataset0.1”

type,space

“Dataset0.2”

type,space

NERSC User Group Meeting, September 17, 2007


A plug for self describing formats l.jpg
A Plug for Self Describing Formats ... users to transition to a higher IO library

  • Application developers shouldn’t care about about physical layout of data

  • Using own binary file format forces user to understand layers below the application to get optimal IO performance

  • Every time code is ported to a new machine or underlying file system is changed or upgraded, user is required to make changes to improve IO performance

  • Let other people do the work

    • HDF5 can be optimized for given platforms and file systems by HDF5 developers

    • User can stay with the high level

  • But what about performance?

NERSC User Group Meeting, September 17, 2007


Io library overhead l.jpg
IO Library Overhead users to transition to a higher IO library

Very little, if any overhead from HDF5 for one file per processor IO compared to Posix and MPI-IO

Data from Hongzhang Shan

NERSC User Group Meeting, September 17, 2007


Ways to do parallel io l.jpg
Ways to do Parallel IO users to transition to a higher IO library

NERSC User Group Meeting, September 17, 2007


Serial i o l.jpg
Serial I/O users to transition to a higher IO library

0

1

2

3

4

5

processors

  • Each processor sends its data to the master who then writes the data to a file

  • Advantages

    • Simple

    • May perform ok for very small IO sizes

  • Disadvantages

    • Not scalable

    • Not efficient, slow for any large number of processors or data sizes

    • May not be possible if memory constrained

File

NERSC User Group Meeting, September 17, 2007


Parallel i o multi file l.jpg
Parallel I/O Multi-file users to transition to a higher IO library

0

1

2

3

4

5

processors

File

File

File

File

File

File

  • Each processor writes its own data to a separate file

  • Advantages

    • Simple to program

    • Can be fast -- (up to a point)

  • Disadvantages

    • Can quickly accumulate many files

    • With Lustre, hit metadata server limit

    • Hard to manage

    • Requires post processing

    • Difficult for storage systems, HPSS, to handle many small files

NERSC User Group Meeting, September 17, 2007


Flash center io nightmare l.jpg
Flash Center IO Nightmare… users to transition to a higher IO library

  • Large 32,000 processor run on LLNL BG/L

  • Parallel IO libraries not yet available

  • Intensive I/O application

    • checkpoint files .7 TB, dumped every 4 hours, 200 dumps

      • used for restarting the run

      • full resolution snapshots of entire grid

    • plotfiles - 20GB each, 700 dumps

      • coarsened by a factor of two averaging

      • single precision

      • subset of grid variables

    • particle files 1400 particle files 470MB each

  • 154 TB of disk capacity

  • 74 million files!

  • Unix tool problems

  • 2 Years Later still trying to sift though data, sew files together

NERSC User Group Meeting, September 17, 2007


Parallel i o single file l.jpg
Parallel I/O Single-file users to transition to a higher IO library

1

2

3

4

5

0

processors

File

  • Each processor writes its own data to the same file using MPI-IO mapping

  • Advantages

    • Single file

    • Manageable data

  • Disadvantages

    • Lower performance than one file per processor at some concurrencies

NERSC User Group Meeting, September 17, 2007


Parallel io single file l.jpg

3 users to transition to a higher IO library

5

2

9

2

4

3

1

9

8

2

4

Parallel IO single file

0

1

2

3

4

5

processors

array of data

Each processor writes to a section of a data array. Each must know its offset from the beginning of the array and the number of elements to write

NERSC User Group Meeting, September 17, 2007


Trade offs l.jpg
Trade offs users to transition to a higher IO library

  • Ideally users want speed, portability and usability

    • speed - one file per processor

    • portability - high level IO library

    • usability

      • single shared file and

      • own file format or community file format layered on top of high level IO library

It isn’t hard to have speed, portability or usability. It is hard to have speed, portability and usability in the same implementation

NERSC User Group Meeting, September 17, 2007


Benchmarking methodology and results l.jpg
Benchmarking Methodology and Results users to transition to a higher IO library

NERSC User Group Meeting, September 17, 2007


Disclaimer l.jpg
Disclaimer users to transition to a higher IO library

  • IO runs done during production time

  • Rates dependent on other jobs running on the system

  • Focus on trends rather than one or two outliers

  • Some tests ran twice, others only once

NERSC User Group Meeting, September 17, 2007


Peak io performance on franklin l.jpg
Peak IO Performance on Franklin users to transition to a higher IO library

  • Expectation that IO rates will continue to rise linearly

  • Back end saturated around ~250 processors

  • Weak scaling IO, ~300 MB/proc

  • Peak performance ~11GB/Sec (5 DDNs * ~2GB/sec)

Image from Julian Borrill

NERSC User Group Meeting, September 17, 2007


Description of ior l.jpg
Description of IOR users to transition to a higher IO library

  • Developed by LLNL used for purple procurement

  • Focuses on parallel/sequential read/write operations that are typical in scientific applications

  • Can exercise one file per processor or shared file access for common set of testing parameters

  • Exercises array of modern file APIs such as MPI-IO, POSIX (shared or unshared), HDF5 and parallel-netCDF

  • Parameterized parallel file access patterns to mimic different application situations

NERSC User Group Meeting, September 17, 2007


Benchmark methodology l.jpg

1 users to transition to a higher IO library

2

3

4

5

0

processors

File

0

1

2

3

4

5

processors

File

File

File

File

File

File

Benchmark Methodology

Focus on performance difference between single shared and one file per processor

NERSC User Group Meeting, September 17, 2007


Benchmark methodology26 l.jpg
Benchmark Methodology users to transition to a higher IO library

  • Using IOR HDF5 Interface

  • Contiguous IO

  • Not intended to be a scaling study

  • Blocksize and transfer size always the same but vary from run to run

  • Goal is to fill out opposite chart with best IO strategy

4096

2048

Processors

1024

512

256

100 MB

1 GB

10 GB

100 GB

1 TB

Aggregate Output Size

NERSC User Group Meeting, September 17, 2007


Small aggregate output sizes 100 mb 1gb l.jpg
Small Aggregate Output Sizes users to transition to a higher IO library100 MB - 1GB

One File per Processor vs Shared File - GB/Sec

Aggregate File Size 100 MB

Aggregate File Size 1 GB

Peak performance line - Anything greater than this is due to caching effect or timer granularity

Clearly the ‘one file per processor’ strategy wins in the low concurrency cases correct?

NERSC User Group Meeting, September 17, 2007


Small aggregate output sizes 100 mb 1gb28 l.jpg
Small Aggregate Output Sizes users to transition to a higher IO library100 MB - 1GB

One File per Processor vs Shared File - Time

Aggregate File Size 1 GB

Aggregate File Size 100 MB

But when looking at absolute time, the difference doesn’t seem so big...

NERSC User Group Meeting, September 17, 2007


Aggregate output size 100gb l.jpg
Aggregate Output Size 100GB users to transition to a higher IO library

One File per Processor vs Shared File

Rate: GB/Sec

Time: Seconds

Peak performance line

2.5 mins

390 MB/proc

24 MB/proc

Is there anything we can do to improve the performance of the 4096 processor shared file case ?

NERSC User Group Meeting, September 17, 2007


Hybrid model l.jpg
Hybrid Model users to transition to a higher IO library

1

2

3

4

5

0

  • Examine 4096 processor case more closely

  • Group subsets of processors to write to separate shared files

  • Try grouping 64, 256, 512, 1024, and 2048 processors to see performance difference from file per processor case vs single shared file case

processors

File

File

NERSC User Group Meeting, September 17, 2007


Effect of grouping processors into separate smaller shared files l.jpg
Effect of Grouping Processors into Separate Smaller Shared Files

100GB Aggregate Output Size on 4096 procs

  • Each processor writes out 24MB

  • Only difference between runs is number of files to which processors are grouped

  • Created a new MPI communicator in IOR for multiple shared files

  • User gains some from grouping files

  • Since very little data is written per processor, overhead for synchronization dominates

Number of Files

64 procs write to single file

512 procs write to single file

2048 procs write to single file

1 file per proc

Single Shared File

NERSC User Group Meeting, September 17, 2007


Aggregate output size 1tb l.jpg
Aggregate Output Size 1TB Files

One File per Processor vs Shared File

Rate: GB/Sec

Time: Seconds

~ 3 mins

976 MB/proc

244 MB/proc

Is there anything we can do to improve the performance of the 4096 processor shared file case ?

NERSC User Group Meeting, September 17, 2007


Effect of grouping processors into separate smaller shared files33 l.jpg

Effect of Grouping Processors into Separate Smaller Shared Files

  • Each processor writes out 244MB

  • Only difference between runs is number of files to which processors are grouped

  • Created a new MPI communicator in IOR for multiple shared files

64 procs write to single file

2048 procs write to single file

1 file per proc

512 procs write to single file

Single Shared File

NERSC User Group Meeting, September 17, 2007


Effect of grouping processors into separate smaller shared files34 l.jpg
Effect of Grouping Processors into Separate Smaller Shared Files

  • Each processor writes out 488MB

  • Only difference between runs is number of files to which processors are grouped

  • Created a new MPI communicator in IOR for multiple shared files

64 procs write to single file

1 file per proc

512 procs write to single file

Single Shared File

NERSC User Group Meeting, September 17, 2007


What is striping l.jpg
What is Striping? Files

  • Lustre file system on Franklin made up of an underlying set of file systems calls Object Storage Targets (OSTs), essentially a set of parallel IO servers

  • File is said to be striped when read and write operations access multiple OSTs concurrently

  • Striping can be a way to increase IO performance since writing or reading from multiple OSTs simultaneously increases the available IO bandwidth

NERSC User Group Meeting, September 17, 2007


What is striping36 l.jpg
What is Striping? Files

  • File striping will most likely improve performance for applications which read or write to a single (or multiple) large shared files

  • Striping will likely have little effect for the following type of IO patterns

    • Serial IO where a single processor performs all the IO

    • Multiple node perform IO, but access files at different times

    • Multiple nodes perform IO simultaneously to different files that are small (each < 100 MB)

    • One file per processor

NERSC User Group Meeting, September 17, 2007


Striping commands l.jpg
Striping Commands Files

  • Striping can be set at a file or directory level

  • Set striping on an directory then all files created in that directory with inherit striping level of the directory

  • Moving a file into a directory with a set striping will NOT change the striping of that file

  • stripe-size -

    • Number of bytes in each stripe (multiple of 64k block)

  • OST offset -

    • Always keep this -1

    • Choose starting OST in round robin

  • stripe count -

    • Number of OSTs to stripe over

    • -1 stripe over all OSTs

    • 1 stripe over one OST

lfs setstripe <directory|file> <stripe size> <OST Offset> <stripe count>

NERSC User Group Meeting, September 17, 2007


Stripe count suggestions l.jpg
Stripe-Count Suggestions Files

  • Franklin Default Striping

    • 1MB stripe size

    • Round robin starting OST (OST Offset -1)

    • Stripe over 4 OSTs (Stripe count 4)

  • Many small files, one file per proc

    • Use default striping

    • Or 0 -1, 1

  • Large shared files

    • Stripe over all available OSTs (0 -1 -1)

    • Or some number larger than 4 (0 -1 X)

  • Stripe over odd numbers?

  • Prime numbers?

NERSC User Group Meeting, September 17, 2007


Recommendations l.jpg
Recommendations Files

Legend

4096

Single Shared File, Default or No Striping

2048

Single Shared File, Stripe over some OSTs (~10)

1024

Processors

Single Shared File, Stripe over many OSTs

512

Single Shared File, Stripe over many OSTs OR File per processor with default striping

256

Benefits to mod n shared files

100 MB

1 GB

10 GB

100 GB

1 TB

Aggregate File Size

NERSC User Group Meeting, September 17, 2007


Recommendations40 l.jpg
Recommendations Files

  • Think about the big picture

    • Run time vs Post Processing trade off

    • Decide how much IO overhead you can afford

    • Data Analysis

    • Portability

    • Longevity

      • H5dump works on all platforms

      • Can view an old file with h5dump

      • If you use your own binary format you must keep track of not only your file format version but the version of your file reader as well

    • Storability

NERSC User Group Meeting, September 17, 2007


Recommendations41 l.jpg
Recommendations Files

  • Use a standard IO format, even if you are following a one file per processor model

  • One file per processor model really only makes some sense when writing out very large files at high concurrencies, for small files, overhead is low

  • If you must do one file per processor IO then at least put it in a standard IO format so pieces can be put back together more easily

  • Splitting large shared files into a few files appears promising

    • Option for some users, but requires code changes and output format changes

    • Could be implemented better in IO library APIs

  • Follow striping recommendations

  • Ask the consultants, we are here to help!

NERSC User Group Meeting, September 17, 2007


Questions l.jpg

Questions? Files

NERSC User Group Meeting, September 17, 2007


ad