Data Management at NERSC

Data Management at NERSC Quincey Koziol NERSC Data and Analytics Services Group August 22, 2016 koziol@lbl.gov

Outline • Best Practices and Guidelines • I/O Libraries • Databases • Related topics

Why Manage Your Data? • “Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets.”* *DAMA-DMBOK Guide (Data Management Body of Knowledge) Introduction & Project Status

Data @ NERSC NERSC offers a variety of services to support data-centric workloads. We provide tools in the areas of: • Data Analytics (statistics, machine learning, imaging) • Data Management (storage, representation) • Data Transfer • Workflows • Science Gateways • Visualization http://www.nersc.gov/users/data-analytics/

General Recommendations • NERSC recommends the use of modern, scientific I/O libraries (HDF5, NetCDF, ROOT) to represent and store scientific data. • We provide database technologies (MongoDB, SciDB, MySQL, PostGreSQL) for our users as a complementary mechanism for storing and accessing data. • Low-level, POSIX I/O from applications to NERSC file systems, if necessary. Details here: http://www.nersc.gov/users/storage-and-file-systems/

Notes on NERSC File I/O • Use the local scratch file system on Edison and Cori for best I/O rates. • For some types of I/O you can further optimize I/O rates using a technique called file striping. • Keep in mind that data in the local scratch directories are purged, so you should always backup important files to HPSS* or project space. • You can share data with your collaborators using project directories. These are directories that are shared by all members of a NERSC repository. *HPSS: http://www.nersc.gov/users/storage-and-file-systems/hpss/getting-started/

Introduction to Scientific I/O I/O is commonly used by scientific applications to achieve goals like: • Storing numerical output from simulations for later analysis • Implementing 'out-of-core' techniques for algorithms that process more data than can fit in system memory and must page data in from disk • Checkpointingto files, to save the state of an application in case of system failure.

Types of Application I/O to Parallel File Systems

File-per-processor • The simplest to implement • Parallel file systems often perform well with this type of access up to several thousand files, but synchronizing metadata for a large collection of files introduces a bottleneck. • One way to mitigate this is to use a 'square-root' file layout, for example by dividing 10,000 files into 100 subdirectories of 100 files. • However, large collections of files can be unwieldy to manage. • May not be a concern in the case of checkpointing. • Another disadvantage of file-per-processor access is that later restarts (probably) require the same number and layout of processors.

Shared-file (Independent) • Allows many processors to share a common file handle, but write independently to exclusive regions of the file. • However, there can be significant overhead for write patterns where the regions in the file may be contested by two or more processors. • In these cases, the file system uses a 'lock manager' to serialize access to the contested region and guarantee file consistency. • The advantage of shared file access lies in data management and portability, especially when a higher-level I/O format such as HDF5 or netCDF is used to encapsulate the data in the file.

Shared-file (Collective Buffering) • Improves the performance of shared-file access by offloading some of the coordination work from the file system to the application*. • A subset of the processors is chosen to be 'aggregators’, which collect the data from other processors, pack it into contiguous buffers in memory, and is then written to the file system. • Originally, collective buffering was developed to reduce the number of small, noncontiguous writes, but another benefit that is important for file systems such as Lustre is that the buffer size can be set to a multiple of the ideal transfer size preferred by the file system. • *Also implemented within MPI-IO for collective I/O operations (and used by HDF5, etc).

Lustre • Scalable, POSIX-compliant parallel file system designed for large, distributed-memory systems • Uses a server-client model with separate servers for file metadata and file content

Lustre Files • Lustre performs much better when the I/O accesses are aligned to Lustre's fundamental unit of storage, which is called a stripe and has a default size (on NERSC systems) of 1MB. • Striping divides up a file across many OSTs. • Each stripe is stored on a different OST, and the assignment of stripes to OSTs is round-robin.

Lustre Striping • Striping a file increases the available bandwidth. • Writing to several OSTs in parallel aggregates the bandwidth of each of the individual OSTs. • Same principal behind the use of striping in a RAID array of disks. In fact, the disk arrays backing a Lustre file system are also RAID striped, so you can think of Lustre striping as a second layer of striping that allows you to access every single physical disk in the file system if you want the maximum available bandwidth (i.e. by striping over all available OSTs).

Lustre Striping • Lustre allows you to modify three striping parameters for a shared file: • Stripe count controls how many OSTs the file is striped over • For example, the stripe count is 4 for the file shown above • Stripe size controls the number of bytes in each stripe • Start index chooses where in the list of OSTs to start the round-robin assignment • The default value of -1 allows the system to choose the offset in order to load balance the file system.

Lustre Striping • Stripe parameters can be changed and viewed on a per-file or per-directory basis using the commands: • lfssetstripe [file,dir] -c [count] -s [size] -i [index] • lfsgetstripe [file,dir] • Striping parameters must be set before a file is written to (either by touching a file and setting the parameters or by setting the parameters on the directory). • To change the parameters after a file has been written to, the file must be copied over to a new file with the desired parameters, for instance with these commands: • lfssetstripenewfile [new parameters] • cpoldfilenewfile

Lustre Striping • A file automatically inherits the striping parameters of the directory it is created in. • Files inherit the striping configuration of the directory in which they are created. • Importantly, the desired striping must be set on the directory before creating the files (later changes of the directory striping are not inherited). • When copying an existing striped file into a striped directory, the new copy will inherit the directory's striping configuration. • This provides another approach to changing the striping of an existing file.

Lustre Striping • Striping on NERSC scratch file systems: • Default 2 on Edison's $SCRATCH, max 96 • Backed by either /scratch1 or /scratch2 • 8 on Edison's $SCRATCH3, max 36 • 1 on Cori's $SCRATCH, max 248* • Selecting the best striping can be complicated since striping a file over too few OSTs will not take advantage of the system's available bandwidth but striping over too many will cause unnecessary overhead and lead to a loss in performance. *We recommend the striping on Cori to be less than 160 OSTs, otherwise it will cause filesystem problems

Lustre Striping • NERSC has also created the following shortcut commands for applying roughly optimal striping parameters for a range of common file sizes and access patterns: • stripe_fpp[file,dir] • stripe_small[file,dir] • stripe_medium[file,dir] • stripe_large[file,dir] • stripe_small will set the number of ostto 8, stripe_medium will have 24 ost and stripe_large will have 72. • In all cases, the stripe size is 1MB.

Lustre Striping • Inheritance of striping provides a convenient way to set the striping on multiple output files at once, if all such files are written to the same output directory. • For example, if a job will produce multiple 10-100 GB output files in a known output directory, the striping of the latter can be configured before job submission: stripe_medium [myOutputDir]

MPI Collective I/O • Collective I/O refers to a set of optimizations available in many implementations of MPI-IO that improve the performance of large-scale IO to shared files. • To enable these optimizations, you must use the collective calls in the MPI-IO library that end in _all • For instance: MPI_File_write_at_all(). • And, all MPI tasks in the given MPI communicator must participate in the collective call, even if they are not performing any IO operations. • The MPI-IO library has a heuristic to determine whether to enable collective buffering, the primary optimization used in collective mode.

MPI-IO Collective Buffering • Collective buffering breaks an I/O operation into two stages. • For a collective read • The first stage uses a subset of MPI tasks (called aggregators) to communicate with the IO servers (OSTs in Lustre) and read a large chunk of data into a temporary buffer. • In the second stage, the aggregators ship the data from the buffer to its destination among the remaining MPI tasks using point-to-point MPI calls. • A collective write does the reverse: • Sending data through MPI into buffers on the aggregator nodes. • Then writing from the aggregator nodes to the IO servers. • The advantage of collective buffering is that fewer nodes are communicating with the IO servers, which reduces contention while still attaining high performance through concurrent I/O transfers. • Lustreprefers a one-to-one mapping of aggregator nodes to OSTs.

MPI-IO Collective Buffering • Since the release of mpt/3.3, Cray has included a Lustre-aware implementation of the MPI-IO collective buffering algorithm. • This implementation is able to buffer data on the aggregator nodes into stripe-sized chunks so that all read and writes to the Lustre filesystem are automatically stripe aligned without requiring any padding or manual alignment from the developer. • Because of the way Lustre is designed, alignment is a key factor in achieving optimal performance.

MPI-IO Hints • Several environment variables can be used to control the behavior of collective buffering on Edison and Cori. • The MPICH_MPIIO_HINTS variable specifies hints to the MPI-IO library that can, for instance, override the built-in heuristic and force collective buffering on: setenvMPICH_MPIIO_HINTS="*:romio_cb_write=enable:romio_ds_write=disable" • Placing this command in your batch file before calling aprun will cause your program to use these hints. • The * indicates that the hint applies to any file opened by MPI-IO, while romio_cb_write controls collective buffering for writes and romio_ds_write controls data sieving for writes, an older collective mode optimization that is no longer used and can interfere with collective buffering. • The options for these hints are enabled, disabled, or automatic (the default value, which uses the built-in heuristic). • It is also possible to control the number of aggregator nodes using the cb_nodes hint, although the MPI-IO library will automatically set this to the stripe count of your file.

MPI-IO Hints • When set to 1, the MPICH_MPIIO_HINTS_DISPLAY variable causes your program to dump a summary of the current MPI-IO hints to stderr each time a file is opened. • This is useful for debugging and as a sanity check against spelling errors in your hints. • The MPICH_MPIIO_XSTATS variable enables profiling of the collective buffering algorithm, including binning of read/write calls and timings for the two phases. • Setting it to 1 provides summary data, while setting it to 2 or 3 provides more detail. • More detail on MPICH runtime environment variables, including a full list and description of MPI-IO hints, is available from the intro_mpi man page.

Why I/O Middleware? • The complexity of I/O systems poses significant challenges in investigating the root cause of performance loss. • Use of I/O middleware for writing parallel applications has been shown to greatly enhance developer’s productivity. • Such an approach hides many of the complexities associated with performing parallel I/O, rather than relying purely on programming language aids and parallel library support, such as MPI.

I/O Middleware @ NERSC • HDF5 • A data model and set of libraries & tools for storing and managing large scientific datasets. • netCDF • Aset of libraries and machine-independent data formats for creation, access, and sharing of array-oriented scientific data. • ROOT • Aself-describing, column-based binary file format that allows serialization of a large collection of C++ objects and efficient subsequent analysis. • Others • http://www.nersc.gov/users/data-analytics/data-management/i-o-libraries/i-o-library-list/

HDF5 • The Hierarchical Data Format v5 (HDF5) library is a portable I/O library used for storing scientific data. • The HDF5 technology suite includes: • A versatile data model that can represent very complex data objects and a wide variety of metadata. • A completely portable file format with no limit on the number or size of data objects in the collection. • A software librarythatruns on a range of computationalplatforms, from laptops to massivelyparallelsystems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces. • A rich set of integrated performance featuresthatallow for access time and storagespaceoptimizations. • Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection. • HDF5's 'object database' data model enables users to focus on high-level concepts of relationships between data objects rather than descending into the details of the specific layout of every byte in the data file.

HDF5 • An HDF5 file has a root group, under which you can add groups, datasets of various dimensions, single-value attributes, and links between groups and datasets. • The HDF5 library provides a 'logical view' of the file's contents as a graph. The flexibility to create arbitrary graphs of objects allows HDF5 to express a wide variety of data models. • The library transparently handles how these logical objects of the file map to the actual bytes in the file. • In many ways, HDF5 provides an abstracted filesystem-within-a-file that is portable to any system with the HDF5 library, regardless of the underlying storage type, filesystem, or conventions about binary data (i.e. 'endianess'). • Additional information can be found in the HDF5 Tutorial from the HDF Group: • https://www.hdfgroup.org/HDF5/Tutor/index.html

Parallel HDF5 • The HDF5 library can be compiled to provide parallel support using the MPI library. • The HDF Group maintains a tutorial on parallel topics: https://www.hdfgroup.org/HDF5/Tutor/parallel.html • An HDF5 file can be opened in parallel from an MPI application by specifying a parallel 'file driver' with an MPI communicator and info structure. This information is communicated to HDF5 through a 'property list,' an HDF5 structure that is used to modify the default behavior of the library. In the following code, a file access property list is created and set to use the MPI-IO file driver: fapl_id= H5Pcreate(H5P_FILE_ACCESS); H5Pset_fapl_mpio(fapl_id, mpi_comm, mpi_info); file_id= H5Fcreate("myparfile.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id); • The MPI-IO file driver defaults to independent mode, where each processor can access the file independently and any conflicting accesses are handled by the underlying parallel file system.

Parallel HDF5 • Alternatively, the MPI-IO file driver can be used to perform collective I/O. • This is not specified during file creation, but rather during each access (read or write) to a dataset by passing a dataset transfer property list to the read or write call. The following code shows how to write collectively: dxpl_id= H5Pcreate(H5P_DATASET_XFER); H5Pset_dxpl_mpio(dxpl_id, H5FD_MPIO_COLLECTIVE); /* describe a 1D array of elements on thisprocessor */ memspace= H5Screate_simple(1, count, NULL); /* map this processor's elements into the shared file */ filespace= H5Screate_simple(1, mpi_size*count, NULL); offset = mpi_rank * count; H5Sselect_hyperslab(filespace, H5S_SELECT_SET, &offset, NULL, &count, NULL); H5Dwrite(dset_id, H5T_NATIVE_FLOAT, memspace, filespace, dxpl_id, somedata0); • For more examples, see: • http://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1#toc-anchor-5

HDF5 @ NERSC • There are HDF5 libraries provided by Cray • Use the command "module avail cray-hdf5" to see the available Cray versions. • Cray serial HDF5 on Edison and Cori: module swap PrgEnv-intel PrgEnv-gnu (Optional) module load cray-hdf5 ftn... (for Fortran code) cc ... (for C code) CC ... (for C++ code) • Cray parallel HDF5 on Edison andCori: moduleswapPrgEnv-intel PrgEnv-gnu (orsimilarcommandtouseanothercompiler) moduleload cray-hdf5-parallel cc ... (for C code) ftn... (for Fortran code) • With Cray's HDF5, Cray's compiler wrapper cc/ftn should be used

Additional HDF5 Tools @ NERSC • H5hut- H5hut (HDF5 Utility Toolkit) is a veneer API for storing particle-based simulation data in HDF5 • http://www.nersc.gov/users/data-analytics/data-management/i-o-libraries/hdf5-2/h5part/ • H5py - A Pythonic interface to HDF5 data • http://www.nersc.gov/users/data-analytics/data-management/i-o-libraries/hdf5-2/h5py/ • H5spark- Supports accessing HDF5/NetCDF4 data in Spark • http://www.nersc.gov/users/data-analytics/data-management/i-o-libraries/hdf5-2/h5spark/

netCDF • netCDF (Network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. • Typically used in the climate field • More constrained than HDF5 and at a higher level of abstraction • More netCDF information here: http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/

netCDF @ NERSC • To see the available Cray installations and versions use the following command: module avail cray-netcdf • Loading the Cray netCDF module files will also load a cray-hdf5 module file. • Using the Cray netCDF Library • No compile or link options are required as long as you use the Cray compiler wrappers. Below is an example for using the Cray compiler on Edison: module swap PrgEnv-intel PrgEnv-cray (# or similar if needed to swap to another compiler) module load cray-netcdf ftn ... cc ... CC ...

Additional netCDF Tools @ NERSC • NetCDF Operators (NCO) - Asuite of file operators that facilitate manipulation and analysis of self-describing data stored in the netCDF or HDF4 formats. • To access, use: ‘module load nco’ • NCCMP- Compares two netCDF files bitwise or with a user-defined tolerance (absolute or relative percentage). • To access, use: ‘module load nccmp’ • Climate Data Operators (CDO) - A collection of command line operators to manipulate and analyse climate model data. Supported data formats are GRIB, netCDF, SERVICE, EXTRA and IEG. There are more than 350 operators available. • To access, use: ‘module load cdo’ • NCView- Avisual browser for NetCDF format files. • To access, use: ‘module load ncview’ • netCDF-python - An object oriented python interface to the netCDF C library http://www.nersc.gov/users/data-analytics/data-management/i-o-libraries/netcdf-2/

ROOT • A set of object oriented frameworks with the functionality needed to handle and analyze large amounts of data in a very efficient way. • ROOT is written in C++ and uses an indexed tree format as it base data unit, with substructures called branches and leaves. • Originally designed for particle physics, its usage has extended to other data-intensive fields like astrophysics and neuroscience. • ROOT is mainly used for data analysis at NERSC. • ROOT Docs: https://root.cern.ch/drupal/

ROOT @ NERSC • You can access ROOT interactively on NERSC systems with % module swap PrgEnv-intel PrgEnv-gnu % module load root % root • For submitting ROOT jobs, see here: http://www.nersc.gov/users/data-analytics/data-management/i-o-libraries/root/

Databases @ NERSC • NERSC supports the provisioning of databases to hold large scientific datasets, as part of the science gateways effort. • Data-centric science often benefits from database solutions to store scientific data or metadata about data stored in more traditional file formats like HDF5, netCDF or ROOT. • Our database offereings are targeted toward large data sets and high performance. Currently we support: • MySQL • PostgreSQL • MongoDB • SciDB • If you would like to request a database at NERSC please fill out this form and you'll be contacted by NERSC staff: http://www.nersc.gov/users/data-analytics/data-management/databases/science-database-request-form/

PostgreSQL • PostgreSQL is an object-relational database. It is known for having powerful and advanced features and extensions as well as supporting SQL standards. • NERSC provides a set of database nodes for users that wish to use PostgreSQL with their scientific applications. • PostgreSQL documentation here: http://www.postgresql.org/docs/

PostgreSQL @ NERSC • The psql command line client is available on login at NERSC on Edison and Cori, and can be used to directly connect to a Postgres database via the command line (you do not need to load a module). • The system version of the client is 9.1 so if you use this to connect to the NERSC posgres (scidb) servers which are at version 9.4 then you will get a warning. You can get a v9.4 client by loading the module: % module load postgresql • On a NERSC system, type the following commands to use the psql command line client*: %psql -h scidb1.nersc.gov <yourdb> <dbuser> • NERSC PostgreSQL page: http://www.nersc.gov/users/data-analytics/data-management/databases/postgresql/ *You may need to replace “scidb1” with “scidb2”.

MySQL • MySQL is a very popular and powerful open-source relational database. • It has many features: • Pluggable Storage Engine Architecture, with multiple storage engines: • InnoDB • MyISAM • NDB (MySQL Cluster) • Memory • Merge • Archive • CSV • and more • Replicationto improve application performance and scalability • Partitioningto improve performance and management of large database applications • Stored Procedures to improve developer productivity • Viewsto ensure sensitive information is not compromised • … • MySQL user documentation: http://dev.mysql.com/doc/

MySQL @ NERSC • The mysqlcommand line client is available on login at NERSC on Edison and Cori, and can be used to directly connect to a MySQL database via the command line (you do not need to load a module). • On a NERSC system, type the following commands to use the psql command line client*: % mysql -yourdb -u dbuser -h scidb1.nersc.gov -p • NERSC MySQL page: http://www.nersc.gov/users/data-analytics/data-management/databases/mysql/ *You may need to replace “scidb1” with “scidb2”.

SciDB • SciDB is a parallel database for array-structured data, good for TBs of time series, spectra, imaging, etc. • A full ACID database management system that stores data in multidimensional arrays with strongly-typed attributes (aka fields) within each cell. • SciDB User Documentation: https://paradigm4.atlassian.net/wiki/display/ESD/SciDB+Documentation • To request access to NERSC SciDB instances, please email consult@nersc.gov

MongoDB • A cross-platform document-oriented database. • Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas, making the integration of data in certain types of applications easier and faster. • MongoDB user documentation: https://docs.mongodb.com/v2.6/

MongoDB • Here are some considerations for when to use MongoDB as your scientific datastore. MongoDB is suitable for use cases where • Your data model and schema are evolving rapidly • You need a simple query interface to access the data i.e. you would like to interact with JSON objects directly • Your data is mostly non-relational - i.e. you data exists in self contained objects with minimal cross linking between objects • The MongoDB setup at NERSC may not be ideal in cases where • You need to do high volume transactional data writes. We can handle bulk loads and a small volume of writes at a time, but our current setup is not optimized for heavy write volumes • Your data model is heavily relational (lots of JOINs across data tables).

Data Management at NERSC

Data Management at NERSC

Presentation Transcript

Using the Batch System at NERSC

Data Science Engagement Group Lead NERSC

Data Management at UNC Charlotte

NERSC

Grid Security at NERSC/LBL

Grid Services at NERSC Shreyas Cholia Open Software and Programming Group, NERSC scholia@lbl

Data Management at Gaia Data Processing Centers

Introduction to the NERSC HPCF NERSC User Services

NERSC Storage

Visualization at NERSC Update

NERSC Reliability Data

STAR Off-line Computing Capabilities at LBNL/NERSC

Distributed Data Management at DKRZ

Research Data Management at CSU

Unified Parallel C at NERSC

UPC at NERSC/LBNL

Performance Evaluation Activities at NERSC

Using the Batch System at NERSC

NERSC Science Highlights

Data Management at Gaia Data Processing Centers

Physics Data Management at CERN

Unified Parallel C at NERSC