The Vesta Parallel File System

The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson

Outline • Introduction • Motivation and Design Guidelines • Abstractions and Interface • Implementation • Conclusion

Introduction The Vesta parallel file system: • For the AIX on the IBM SP2 • Design to provide parallel file access • Can achieve high efficiency on parallel I/O hardware • Deal exclusively with persistent on-line storage of files, particularly those that must be accessed by parallel applications

Introduction cont.

Introduction cont. • Method for Vesta file system: • Introduce a new abstraction of parallel files, by which application programmers can express the required partitioning of file data among the processes of a parallel application • Reduce the need for synchronization and concurrency control, and allows for a more streamlined implementation • Provide explicit control over the way data is distributed across the I/O nodes, and allows the distribution to be tailored for the expected access patterns

Motivation and Design Guidelines • Motivation • Users are able to create distributed files without full control over the mapping of data to disks • Design Guidelines • Parallelism • Scalability • Layering • Providing commonly expected service

Simple stripping method to get a parallel view Simple stripping technique: Assuming that the number of I/O nodes is N, block i of the file is located on I/O node i mod N.

Method of Vesta to get a parallel view • Two steps: • Abstract away from a direct dependency on the number of I/O nodes • Allow a variety of partitioned views of the data, in addition to partitioning according to the physical distribution of data to the I/O nodes • All these parallel views partition the file into disjoint subfiles, that are typically accessed by different processes of a parallel application • Guarantee that the accesses by the different processes are non-overlapping at the byte level • Allow each process to access its data directly

Cell abstraction of Vesta • Abstracting away from I/O nodes is done by introducing the notion of cells • Cells can be thought as containers where data can be deposited • When a file is created, the number of cells is given as a parameter • If the number of cells is no more than the number of I/O nodes, then each cell will reside on a different I/O node • If there are more cells than I/O nodes, the cells will be distributed to the I/O nodes in round-robin manner

2-d structure of Vesta • 2-dimensional structure: • Cell dimension (horizontal) specifies the parallelism in accessing the data • Data within the cells (vertical) • The data in each cell is viewed as a sequence of basic striping units (BSUs). • The BSU size can be an arbitrary number of bytes, and should be chosen to reflect the minimal unit of data access

Two parameters to define the structure • The number of cells • The BSU size • The two parameters are defined when the file is created , and can’t be changed thereafter. • Attach -- new call to do this • Every process in the application must attach every file before it can open the file.

Partition files for parallel access

Partition files for parallel access • Define the template of Vesta subfiles • Define the block size used to distribute the data • Data decomposition scheme:

Handling awkward cases • Ghost cell: The extra cells are added to make the total a multiple of Hbs  Hn • Ghost cell has no effect for reading and writing • Hole cell: leaving a hole in the middle of a cell for cells with different length • Writing to a hole causes it to be filled with valid data • Call the Vesta stat function to find how much data is contained in the whole file

Data ordering

Feature of Vesta system • Key feature: The capability to perform direct access from a compute node to an I/O node without referencing any centralized metadata • The form of the abstraction • The 2-d structure of BSUs within cells • The interface used to access the abstraction • partition is also an innovative feature • The partitioning is defined in advance, and then processes can perform independent accesses to any part of their partition (subfile)

Implementation • Create dedicated I/O nodes • A client library linked with application code running on the compute nodes • A server that runs on the I/O nodes • Achieve direct access from a compute node to the I/O node • Find metadata distributed among all the I/O nodes • Can identify the I/O nodes using a combination of the metadata , parameter, and the offset and count of data

Access to MetaData • Vesta objects: files, cells and Xrefs • Each I/O node maintains the Vesta objects in a memory-mapped table. • The I/O nodes are logically numbered • Each entry in the table contains information, the file name, its owner ID, group and access permissions, creation, access, and last modification times, the number of cells, the BSU size, the base and highest numbered I/O nodes used, and the current file status. • 7-bit uniquifier field to distinguish two files or Xrefs with different names • 1-bit field to distinguish files from Xrefs • 8-bit level field are used to number cells of a file

Attaching and opening • The file is attached to the application: Access the metadata to get parameters, such as the base and maximal I/O nodes, the number of cells, and the BSU size • Open a subfile: call open function to set the partitioning parameter that define which subfile is being accessed

Directory Structure • Vesta files are accessed directly by hashing their pathnames and don’t need to maintain directories to find files. • For users to be easy to organize their files, a hierarchical structure of directories is created using Xrefs. • Xrefs simply contain lists of internal Ids of files and other Xrefs.

Access to File Data • Access is done by providing a byte offset and a byte count • Vesta does not have a separate seek function • File data is not cached on compute nodes • Three mechanisms for reducing access latency: • Use of buffer caches on the I/O node • Asynchronous I/O operations • Explicit prefetch and flush operations

Access to File Data

Sharing • Vesta supports sharing in two main ways: • Partition the file into disjoint subfiles that can be accessed with no synchronization among the sharing processes • Share a subfile • Each process can have an independent file pointer into the shared subfile • Each process can share a single pointer • When an application process opens a subfile for the first time, it gets a local, private pointer. • When a pointer is shared, a random I/O node is chosen, and the pointer is moved to that I/O node. The identity of this node and pointer’s ID on that node are passed to all processes that share its use. When a data access based on a shared pointer is performed, the accessing node first communicates with the I/O node holing the pointer. The current pointer value is returned to the accessing node.

Concurrency Control • Concurrency control appears: • Write data to a shared subfile • Overlapping subfiles using independent offsets • When an application interleaves file metadata operations , they also affect the file data • One application writes a file while others read it • Vesta uses a fast token-passing mechanism among the I/O nodes to guarantee concurrency atomicity of request that span multiple I/O nodes, and to provide sequential consistency and linearizability among requests • When the token reaches the last I/O node, it sends an acknowledgement to the requesting compute node.

Concurrency Control • Each I/O node maintains a set of 64 token buckets, each with an in counter and an out counter • Each file is assigned to one bucket of the set • When each token is sent, the out counter is incremented • When a node receives a token, it first tries to match the token’s value with the value of the bucket’s in counter. Token that do not match are delayed until other tokens that should be processed before they arrive, and increment the in counter.

Structures for Storing Data • Blocklists for cells are maintained at the I/O nodes • All I/O node metadata, including the block list, are pinned into memory • The block list of each cell is organized as a 16-ary tree.

Conclusion • Vesta is a new approach to parallel I/O file systems • The basis of this approach is 2-d structure of Vesta files, one dimension represents the parallelism and the other represents sequential data • Vesta introduces the notion of partitioning the data • Vesta are fully implemented on an IBS SP1 multi-computer, using the EUI-H message passing library and the MPX job control facility • Vesta is the base technology for the AIX Parallel I/O File System used with the IBM SP2

Question • What is the 2-dimensional structure of Vesta files? • What is key feature of the Vesta Parallel File system? • What mechanism does the Vesta file system use to control concurrency?

The Vesta Parallel File System

The Vesta Parallel File System

Presentation Transcript

The UNIX File System

The Google File System

Hestia( Vesta )

parallel virtual file system

PVFS (parallel Virtual file system)

The Google File System

Parallel File System Simulator

Hestia( Vesta )

The Google File System

The File System

The File System

The UNIX File System

The Google File System

The Google File System

THE FILE SYSTEM

General Parallel File System

The FAT File System

The “File System”

PVFS: A Parallel File System for Linux Clusters

Panasas Parallel File System

The Spensa File System