1 / 23

IBM Research Lab in Haifa

Architectural and Design Issues in the General Parallel File System. IBM Research Lab in Haifa. May 12, 2002. Benny Mandler - mandler@il.ibm.com. Agenda. What is GPFS? a file system for deep computing GPFS uses General architecture How does GPFS meet its challenges - architectural issues

marie
Download Presentation

IBM Research Lab in Haifa

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architectural and Design Issues in the General Parallel File System IBM Research Lab in Haifa May 12, 2002 Benny Mandler - mandler@il.ibm.com

  2. Agenda • What is GPFS? • a file system for deep computing • GPFS uses • General architecture • How does GPFS meet its challenges - architectural issues • performance • scalability • high availability • concurrency control

  3. Scalable Parallel Computing • RS/6000 SP Scalable Parallel Computer • 1-512 nodes connected by high-speed switch • 1-16 CPUs per node (Power2 or PowerPC) • >1 TB disk per node • 500 MB/s full duplex per switch port • Scalable parallel computing enables I/O-intensive applications: • Deep computing - simulation, seismic analysis, data mining • Server consolidation - aggregating file, web servers onto a centrally-managed machine • Streaming video and audio for multimedia presentation • Scalable object store for large digital libraries, web servers, databases, ... What is GPFS?

  4. GPFS addresses SP I/O requirements • High Performance - multiple GB/s to/from a single file • concurrentreads and writes, parallel data access - within a file and across files • Support fully parallel access both to file data and metadata • client caching enabled by distributed locking • wide striping, large data blocks, prefetch • Scalability • scales up to 512 nodes (N-Way SMP). Storage nodes, file system nodes, disks, adapters... • High Availability • fault-tolerancevia logging, replication, RAID support • survives node and disk failures • Uniform access via shared disks - Single image file system • High capacitymultiple TB per file system, 100s of GB per file. • Standards compliant (X/Open 4.0 "POSIX") with minor exceptions What is GPFS?

  5. GPFS vs. local and distributed file systems on the SP2 • Native AIX File System (JFS) • No file sharing - application can only access files on its own node • Applications must do their own data partitioning • DCE Distributed File System (follow-up of AFS) • Application nodes (DCE clients) share files on server node • Switch is used as a fast LAN • Coarse-grained (file or segment level) parallelism • Server node is performance and capacity bottleneck • GPFS Parallel File System • GPFS file systems are striped across multiple disks on multiple storage nodes • Independent GPFS instances run on each application node • GPFS instances use storage nodes as "block servers" - all instances can access all disks

  6. Tokyo Video on Demand Trial • Video on Demand for new "borough" of Tokyo • Applications: movies, news, karaoke, education ... • Video distribution via hybrid fiber/coax • Trial "live" since June '96 • Currently 500 subscribers • 6 Mbit/sec MPEG video streams • 100 simultaneous viewers (75 MB/sec) • 200 hours of video on line (700 GB) • 12-node SP-2 (7 distribution, 5 storage)

  7. Engineering Design • Major aircraft manufacturer • Using CATIA for large designs, Elfini for structural modeling and analysis • SP used for modeling/analysis • Using GPFS to store CATIA designs and structural modeling data • GPFS allows all nodes to share designs and models GPFS uses

  8. Shared Disks - Virtual Shared Disk architecture • File systems consist of one or more shared disks • Individual disk can contain data, metadata, or both • Disks are designated to failure group • Data and metadata are striped to balance load and maximize parallelism • Recoverable Virtual Shared Disk for accessing disk storage • Disks are physically attached to SP nodes • VSD allows clients to access disks over the SP switch • VSD client looks like disk device driver on client node • VSD server executes I/O requests on storage node. • VSD supports JBOD or RAID volumes, fencing, multi-pathing (where physical hardware permits) • GPFS only assumes a conventional block I/O interface General architecture

  9. GPFS Architecture Overview • Implications of Shared Disk Model • All data and metadata on globally accessible disks (VSD) • All access to permanent data through disk I/O interface • Distributed protocols, e.g., distributed locking, coordinate disk access from multiple nodes • Fine-grained locking allows parallel access by multiple clients • Logging and Shadowing restore consistency after node failures • Implications of Large Scale • Support up to 4096 disks of up to 1 TB each (4 Petabytes) • The largest system in production is 75 TB • Failure detection and recovery protocols to handle node failures • Replication and/or RAID protect against disk / storage node failure • On-line dynamic reconfiguration (add, delete, replace disks and nodes; rebalance file system) General architecture

  10. GPFS Architecture - Node Roles • Three types of nodes: file system, storage, and manager • Each node can perform any of these functions • File system nodes • run user programs, read/write data to/from storage nodes • implement virtual file system interface • cooperate with manager nodes to perform metadata operations • Manager nodes (one per “file system”) • global lock manager • recovery manager • global allocation manager • quota manager • file metadata manager • admin services fail over • Storage nodes • implement block I/O interface • shared access from file system and manager nodes • interact with manager nodes for recovery (e.g. fencing) • file data and metadata striped across multiple disks on multiple storage nodes General architecture

  11. GPFS Software Structure General architecture

  12. Disk Data Structures: Files • Large block size allows efficient use of disk bandwidth • Fragments reduce space overhead for small files • No designated "mirror", no fixed placement function: • Flexible replication (e.g., replicate only metadata, or only important files) • Dynamic reconfiguration: data can migrate block-by-block • Multi level indirect blocks • Each disk address: • list of pointers to replicas • Each pointer: • disk id + sector no. General architecture

  13. Large File Block Size • Conventional file systems store data in small blocks to pack data more densely • GPFS uses large blocks (256KB default) to optimize disk transfer speed Performance

  14. Parallelism and consistency • Distributed locking - acquire appropriate lock for every operation - used for updates to user data • Centralized management - conflicting operations forwarded to a designated node - used for file metadata • Distributed locking + centralized hints - used for space allocation • Central coordinator - used for configuration changes I/O slowdown effects Additional I/O activity rather than token server overload

  15. Parallel File Access From Multiple Nodes • GPFS allows parallel applications on multiple nodes to access non-overlapping ranges of a single file with no conflict • Global locking serializes access to overlapping ranges of a file • Global locking based on "tokens" which convey access rights to an object (e.g. a file) or subset of an object (e.g. a byte range) • Tokens can be held across file system operations, enabling coherent data caching in clients • Cached data discarded or written to disk when token is revoked • Performance optimizations: required/desired ranges, metanode, data shipping, special token modes for file size operations Performance

  16. Deep Prefetch for High Throughput • GPFS stripes successive blocks across successive disks • Disk I/O for sequential reads and writes is done in parallel • GPFS measures application "think time" ,disk throughput, and cache state to automatically determine optimal parallelism • Prefetch algorithms now recognize strided • and reverse sequential access • Accepts hints • Write-behind policy Application reads at 15 MB/sec Each disk reads at 5 MB/sec Three I/Os executed in parallel Performance

  17. GPFS Throughput Scaling for Non-cached Files • Hardware: Power2 wide nodes, SSA disks • Experiment: sequential read/write from large number of GPFS nodes to varying number of storage nodes • Result: throughput increases nearly linearly with number of storage nodes • Bottlenecks: • microchannel limits node throughput to 50MB/s • system throughput limited by available storage nodes Scalability

  18. Disk Data Structures: Allocation map • Segmented Block Allocation MAP: • Each segment contains bits representing blocks on all disks • Each segment is a separately lockable unit • Minimizes contention for allocation map when writing files on multiple nodes • Allocation manager service provides hints which segments to try Similar: inode allocation map Scalability

  19. High Availability - Logging and Recovery • Problem: detect/fix file system inconsistencies after a failure of one or more nodes • All updates that may leave inconsistencies if uncompleted are logged • Write-ahead logging policy: log record is forced to disk before dirty metadata is written • Redo log: replaying all log records at recovery time restores file system consistency • Logged updates: • I/O to replicated data • directory operations (create, delete, move, ...) • allocation map changes • Other techniques: • ordered writes • shadowing High Availability

  20. Node Failure Recovery • Application node failure: • force-on-steal policy ensures that all changes visible to other nodes have been written to disk and will not be lost • all potential inconsistencies are protected by a token and are logged • file system manager runs log recovery on behalf of the failed node • after successful log recovery tokens held by the failed node are released • actions taken: restore metadata being updated by the failed node to a consistent state, release resources held by the failed node • File system manager failure: • new node is appointed to take over • new file system manager restores volatile state by querying other nodes • New file system manager may have to undo or finish a partially completed configuration change (e.g., add/delete disk) • Storage node failure: • Dual-attached disk: use alternate path (VSD) • Single attached disk: treat as disk failure High Availability

  21. Handling Disk Failures • When a disk failure is detected • The node that detects the failure informs the file system manager • File system manager updates the configuration data to mark the failed disk as "down" (quorum algorithm) • While a disk is down • Read one / write all available copies • "Missing update" bit set in the inode of modified files • When/if disk recovers • File system manager searches inode file for missing update bits • All data & metadata of files with missing updates are copied back to the recovering disk (one file at a time, normal locking protocol) • Until missing update recovery is complete, data on the recovering disk is treated as write-only • Unrecoverable disk failure • Failed disk is deleted from configuration or replaced by a new one • New replicas are created on the replacement or on other disks

  22. Cache Management Stats Total Cache Seq / random optimal, total General Pool: Clock list, merge, re-map Seq / random optimal, total Block Size pool: Clock list Seq / random optimal, total Block Size pool: Clock list Seq / random optimal, total Block Size pool: Clock list Balance dynamically according to usage patterns Avoid fragmentation - internal and external Unified steal Periodical re-balancing

  23. Epilogue • Used on six of the ten most powerful supercomputers in the world, including the largest (ASCI white) • Installed at several hundred customer sites, on clusters ranging from a few nodes with less than a TB of disk, up to 512 nodes with 140 TB of disk in 2 file systems • IP rich - ~20 filed patents • State of the art • TeraSort • world record of 17 minutes • using 488 node SP. 432 file system and 56 storage nodes (604e 332 MHz) • total 6 TB disk space • References • GPFS home page: http://www.haifa.il.ibm.com/projects/storage/gpfs.html • FAST 2002: http://www.usenix.org/events/fast/schmuck.html • TeraSort - http://www.almaden.ibm.com/cs/gpfs-spsort.html • Tiger Shark: http://www.research.ibm.com/journal/rd/422/haskin.html

More Related