1 / 50

GPFS, Glamour and Panache Towards a Global Scalable File Service

GPFS, Glamour and Panache Towards a Global Scalable File Service. Renu Tewari IBM Almaden. World-wide Sharing of Data. Example: grid computing workflows Massive amounts of data gathered at remote locations Distributed compute resources process input data and store their results

becky
Download Presentation

GPFS, Glamour and Panache Towards a Global Scalable File Service

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPFS, Glamour and PanacheTowards a Global Scalable File Service Renu Tewari IBM Almaden

  2. World-wide Sharing of Data • Example: grid computing workflows • Massive amounts of data gathered at remote locations • Distributed compute resources process input data and store their results • Results are analyzed or visualized by people at many locations Ingest Compute Visualize

  3. What is the function of a Filesystem • Namespace • Consistency • Security and Access control • Fault Tolerance • Performance

  4. How do we get to a Scalable Global FS • Scale Up • Enterprise class storage systems • Large parallel machines • Scale Out • Internet companies and “cloud computing” are clearly embracing scale-out HW architectures • Large clusters of commodity servers connected by high-bandwidth Ethernet • For many applications, scale-out has proven to be cheaper than scale-up • Question of network bandwidth compared to disk bandwidth

  5. System Scaling Up: BlueGene/L System 64 Racks, 64x32x32 131,072 processors Rack 32 Node Cards 2048 processors Node Card 360 TF/s 32 TB 1.3M Watts 16 compute cards (16 compute, 0-2 IO cards) 64 processors 5.6 TF/s 512 GB 20 KWatts 1 m2 footprint Compute Card 2 chips 4 processors 180 GF/s 16 GB Chip 2 processors 11.2 GF/s 1.0 GB 5.6 GF/s 4 MB Coming Soon: BlueGene/P BlueGene/C BlueGene/Q

  6. Technology keeps growing … • Technology blurs distinction among SAN, LAN, and WAN • “Why can’t I use whatever fabric I have for storage?” • “Why do I need to separate fabrics for storage and communication?” • “Why can’t I share files across my entire SAN like I can with Ethernet?” • “Why can’t I access storage over long distances?”

  7. Outline • GPFS • Scaling the “local” filesystem for a large data store • pNFS • Scaling the pipe for remote data access • WAN Caching • Bringing data closer to where it is consumed • Federated filesystems • Separating the physical location from the logical namespace

  8. GPFS

  9. Parallel File Systems 101 • Manage data sharing in large data stores • Two architectures: • Asymmetric • E.g., PVFS2, Lustre, High Road • Symmetric • E.g., GPFS, GFS, Polyserve

  10. GPFS File System Nodes Switching fabric (System or storage area network) Shared disks (SAN-attached or network block device) GPFS: parallel file access Parallel Cluster File System Based on Shared Disk (SAN) Model • Cluster – fabric-interconnected nodes (IP, SAN, …) • Shared disk - all data and metadata on fabric-attached disk • Parallel - data and metadata flows from all of the nodes to all of the disks in parallel under control of distributed lock manager.

  11. What GPFS is NOT Not a client-server file system like NFS, DFS, or AFS: no single-server bottleneck, no protocol overhead for data transfer • Not like some SAN file systems (e.g. Veritas, CXFS): no distinct metadata server (which is a potential bottleneck)

  12. GPFS Features High capacity: • Large number of disks in a single FS High BW access to single file • Large block size, full-stride I/O to RAID • Wide striping – one file over all disks • Small files spread uniformly • Multiple nodes read/write in parallel High availability • Nodes: log recovery restores consistency after a node failure • Data: RAID or internal replication • On-line management (add/removedisks or nodes without un-mounting) • Rolling software upgrade Single-system image, standard POSIX interface • Distributed locking for read/write semantics Advanced file system features • Memory mapped files • Direct I/O • Replication • Snapshots • Quotas • Extended attributes

  13. Distributed Locking • GPFS architecture: • Clients share data data and metadata • ... synchronized by means of distributed locking • Distributed locking essential to … • synchronize file system operations for POSIX semantics, • synchronize updates to file system metadata on disk to prevent corruption, • maintain cache consistency of data and metadata cached on different nodes. • Synchronization requires communication … • Problem: sending a lock message for every operation will not scale. • Solution: Token-based lock manager allows“lock caching”.

  14. Token-based Lock Manager • Token server grants tokens. • Token represents right toread, cache, and/or updatea particular piece of dataor metadata. • Single message to tokenserver allows repeatedaccess to the same object. • Conflicting operation onanother node will revokethe token. • Force-on-steal: dirty data & metadata flushed to disk when token is stolen. • Numerous optimizations: • Large byte ranges • Lookup/open

  15. Applications GPFS File structures Locks Tokens Foo.1 Foo.1 Foo.2 Foo.3 Data buffers Blk.02 Block 2 Blk.19 Blk.936 Blk.237 Block 237 Consistency control: Locks and Tokens Client systems Token Servers • Request / release • Revoke • Multiple modes • Distributed via hash • Recoverable service Local consistency Cached capability, global consistency

  16. Distributed locking: optimizations • Byte range locking • Opportunistic locking: small “required” range and large “requested” range • Minimizes lock traffic for piecewise sharing • Metadata token optimizations • During lookup, client speculatively acquires tokens to open the file • Inode reuse: node deleting a file caches its inode for subsequent create • Directory scans: readdir() “prefetches” inode tokens for subsequent stat() • “metanode” dynamically assigned to handle metadata updates when write-sharing occurs • Block allocation • Segmented allocation map allows each node to allocate space on all disks for data striping • New optimizations for multi-cluster scaling • Token manager split across multiple nodes (hashed tokens) • Metanode dynamically migrated to minimize communication latency

  17. Performance • 130+ GB/s parallel read/write on achieved on ASCII Purple • 12 GB/s parallel read/8 GB/s write on 40-node IA64 Linux cluster with 120 TB SAN-attached FAStT 600 • 7.8GB/s streaming read, 7.3GB/s write to single file on p690 • 15.8 GB/s read/14.5 GB/s write aggregate to multiple files • 1000+ files/sec/node (3051 files/sec on 3 Linux nodes, 50 drives on one FAStT controller (256M files in 24 hours 26 mins).

  18. The Largest GPFS systems

  19. GSA Experience August, 2006 142TB total space

  20. GPFS on ASC Purple/C Supercomputer • 1536-node, 100 TF pSeries cluster at Lawrence Livermore National Laboratory • 2 PB GPFS file system (one mount point) • 500 RAID controller pairs, 11000 disk drives • 130 GB/s parallel I/O measured to a single file (135GB/s to multiple files)

  21. GPFS Directions • Petascale computing work (HPCS) • File systems to a trillion files • Space to exabytes • Clusters to 50K nodes • Improved robustness with increasing numbers of components • Data Integrity • Fixing more than 2 failures • Covering the entire data path (disks, controllers, interconnect) • Storage management • Extensions for content managers, intelligent data store facilities • File Serving • pNFS (Parallel NFSv4) • Metadata intensive workload improvements • Large cache exploitation • Distributed file systems • Disk-based caching at remote sites • Independent, disconnected operation (even writes) • Multi-update reconciliation • Multi-site replication for availability • Automated operations improvements • Improved installation process • Automatic setting of most performance control knobs • GUI for management overview

  22. GPFS beyond the Cluster • Problem: In a large data center, multiple clusters and other nodes need to share data over the SAN • Solution: eliminate the notion of a fixed cluster • “control” nodes for admin, managing locking, recovery, … • File access from “client” nodes • Client nodes authenticate to control nodes to mount a file system • Client nodes are trusted to enforce access control • Clients still directly access disk data and metadata

  23. Multi-cluster GPFS and Grid Storage • Multi-cluster supported in GPFS • Nodes of one cluster mount file system(s) of another cluster • Tight consistency identical to that of a single cluster • User ID’s mapped across clusters • Server and remote client clusters can have different userid spaces • File userids on disk may not match credentials on remote cluster • Pluggable infrastructure allows userids to be mapped across clusters • Multi-cluster works within a site or across a WAN • Lawrence Berkeley Labs (NERSC) • multiple supercomputer clusters share large GPFS file systems • DEISA (European computing grid) • RZG, CINECA, IDRIS, CSC, CSCS, UPC, IBM • pSeries and other clusters interconnected with multi-gigabit WAN • Multi-cluster GPFS in “friendly-user” production 4/2005 • Teragrid • SDSC, NCSA, ANL, PSC, CalTech, IU, UT, ….. • Sites linked via 30 Gb/sec dedicated WAN • 500TB GPFS file system at SDSC shared across sites, 1500 nodes • Multi-cluster GPFS in production 10/2005

  24. Remote Data Access Problem:Gap Between Users and Parallel File Systems Q: Use an existing parallel file system? A: Not likely Q: Build “yet another” parallel file system? A: See above

  25. Possible Solutions:Heterogeneous and Remote Data Access  

  26. pNFS, NFSv4.1

  27. Why not vanilla NFS • NFS is a frequently used for remote data access • Pros • Simple, reliable, easy to manage • Practically every OS comes with a free NFS client • Lots of vendors make NFS servers • Cons • NFS “loose consistency” not suitable for some applications • Capacity and bandwidth don’t scale • Attempts to make NFS scale make it unreliable and difficult to manage • Single-server bottleneck (capacity and throughput limitations) • To overcome bottleneck, user has to explicitly partition namespace across servers • Partitioning and re-partitioning is tedious and hard to get right • Various attempts to scale conventional NFS, each with limitations • Clustered NFS servers (NFS on top of a cluster file system) • Each server handles all data, but each client only mounts from one server

  28. NFSv4 Optimized for WAN • Compound ops • Redirection support • Delegations

  29. Model for Remote Data Access over NFS

  30. Parallel NFS (pNFS) – extending NFS to HPC • Include pNFS in NFSv4.1 to support parallel access • Minor version extension of NFSv4 protocol • NFSv4 client mounts from pNFS-enabled NFSv4 metadata server • Non-pNFS-enabled client gets data directly from metadata server • pNFS-enabled client is redirected to (multiple) data servers that store the file • Benefits: • Individual high-performance client can benefit from striping across multiple data servers, e.g. 10g client with 1g data servers (or servers with few disks) • Multiple clients can use data servers like a parallel file system • Still NFS semantics, vs. most parallel file systems which support Posix • Three variants of protocol • File: data servers are NFSv4 file servers • Object: data servers are ANSI T10 OSD servers • Block: data servers are block devices (Fibrechannel FCP, iSCSI)

  31. pNFS Timeline ICS06 pNFS Small Writes May-Aug pNFS/PVFS2 Prototype @ SNL Planned Linux Kernel Acceptance SC06 pNFS/PVFS2 prototype demonstration Sept. PDSI Est. June Object draft 2003 2004 2005 2006 2007 2008 Connectathon/ Bakeathon Interop. Testing: Dec. NEPS Workshop @ CITI Dec. - NFSv4.1 draft - Block draft HPDC07 Direct-pNFS MSST05 pNFS published

  32. pNFS protocol extensions to NFSv4.1 • The basic pNFS architecture: • Clients call metadata server to create/delete/open/close • Clients get layout (map of data) from metadata server • Layout directs clients to data servers for read/write operations • pNFS protocol operations: • GETDEVICELIST, • retrieve device list • GETDEVICEINFO – • retrieve information for a single device • get data server identities and attributes • LAYOUTGET(file_handle, type, byte_range) – asks server for “layout” that tells client how to map file byte ranges to data servers • Retrieves data layout information • Layout valid until returned or recalled by server • LAYOUTRETURN(file_handle, byte_range) – tells server that client is no longer accessing byte range; server can release state • LAYOUTCOMMIT(file_handle, byte_range, attributes, layout) – tells server to update metadata to make data visible to other clients • update eof and block status • CB_LAYOUTRECALL – server tells client to stop using a layout • CB_RECALLABLE_OBJ_AVAIL – tells client that delegation is now available for a file for which it was not previously available

  33. SGI Linux AIX pNFS Architecture Parallel FS Storage Nodes NFSv4.1 Clients Parallel FS I/O NFSv4.1 Metadata Parallel FS Management Protocol NFSv4.1 Server/ Parallel FS Metadata Server

  34. [MSST05] pNFS Internal Architecture

  35. pNFS Discussion • Heterogeneity • High layout driver development cost • Reduce costs with standard storage protocols • Transparency • File system semantics • Mix of NFSv4 and underlying file system • E.g., pNFS/PVFS2 prototype caches only metadata • Security • Storage protocol dependent • Restrict WAN capability • Validation of opaque layout design • Negligible layout retrieval overhead • Maintains file system independence • pNFS client utilize multiple storage protocols simultaneously • Underlying file system must manage layout information • Layout driver enables scalability and performance • Avoid indirection penalty and single server bottleneck • Escape NFSv4 block size restrictions

  36. [ICS06] pNFS Small Write Performance • Scientific I/O workloads • 90% of requests transfer 10% of data • Gap between small and large I/O requests • File system characteristics • Distributed file systems • Asynchronous client requests • Client write back cache • Server write gathering • Parallel file systems • Large transfer buffers limits asynchrony • Write-through or no cache • Use mixture of distributed and parallel storage protocols • Increase overall write performance

  37. pNFS performance scaling • Look at pNFS peformance in two scenarios: • High-performance (e.g. 10GbE) client driving a server cluster • Parallel/cluster clients driving a server cluster • Test runs on small GPFS cluster at CITI • Single 10GbE client accessing server cluster of varying size • iozone -aec -i 0 -i 1 -+n -r 256K -s 1G -f /mnt/nfs4/io1 -U /mnt/nfs4 • Varying number of 1GbE clients accessing 1GbE server cluster • mpirun -np {1-3} -machinefile /tmp/gpfs.nodes /nas/test/IOR -b 1g -t 32k -g -e -F -o /mnt/ior/t1

  38. Accessing Remote Data: Approaches • Admin controlled Replication • Requires grouping data apriori and then triggering replication • Replicas need to be kept in sync with master copy • Replica update done periodically or admin driven • Explicit file transfer, e.g., Grid FTP • Requires keeping track of multiple copies of the data • Requires planning ahead (all data must be copied in advance) • New data has to get migrated back • File access over the WAN, e.g., GPFS Multicluster over Teragrid • True shared file system • One copy of the data • Updates propagated and visible instantly • Requires high-bandwidth, low-latency WAN • Requires constant connection between sites, e.g., for the duration of a parallel job

  39. WAN Caching

  40. WAN Caching: The Wish List • Scalable cache for high performance applications • Single namespace (not absolutely necessary but convenient) • Files moved transparently upon access (or pre-staged, e.g. with “touch”) • Files can be stored on disk at cache site for as long as they are needed • Files automatically moved back as they age out of cache or are needed elsewhere • Possible (and desirable) to support disconnected operation • Cache “browsable” by client • Local ‘ls’ should show the list of files as the remote • Cache consistent both locally across nodes and with remote system

  41. Panache: Key Design Points • A clustered cache embedded in GPFS caches data from a remote cluster • Applications • Grid computing - allowing data to move transparently during grid workflows • Facilitates content distribution for global enterprises • “Follow-the-sun” engineering teams • Key features • Cluster-wide cache designed for scalability and performance • Integrated with GPFS for consistent and concurrent data access from all nodes of the cluster • No special WAN protocol or hardware • Uses soon to be industry standard pNFS (NFSv4.1) protocol for parallel access • No additional hardware at both ends as required by WAFS vendors

  42. Panache Architecture • Completely transparent to GPFS clients and applications • Application can run on local cache with same view as if it was running on remote cluster • Any client can browse the cache as a traditional filesystem • Cache presents a POSIX compliant view of the cached data • Cache data can be re-exported over NFS • pNFS for parallel data transfer over WAN • Cache consistency is “tunable” • NFSv4 file/dir delegations for across the WAN consistency • Supports disconnected operations • Resync after reconnect with policy for conflict resolution • Supports policy for pre-fetching and pre-populating the cache along with on-demand access • Supports whole-file or partial file caching • Plan for P2P style sharing across caches pNFS over WAN Application Nodes pNFS Clients pNFS Data Servers Metadata Server Local GPFS cache site Remote cluster

  43. Federated Filesystems

  44. Project Glamour • An architecture for filesystem virtualization • Decouple the file from the physical fileserver/filesystem • Unified “file-based” view of storage across an enterprise • Consolidate the “fragmented view” of heterogeneous independent fileservers • Could be distributed across multiple geographical locations • Simplify management • Single point of access for setting policies, control, capacity management

  45. Approaches to Filesystem Virtualization • Layer-7 gateway/proxy appliance • In the data path can limit scalability • Uses standard protocols (including NFSv3) • Works with existing file servers and NAS appliances • E.g., NeoPath, Acopia • Server redirection • Requests forwarded among backend servers using proprietary protocols • Vendor lockin • Does not work with existing/heterogeneous systems • Can be optimized for performance • E.g., Spinnaker, Panasas • Client redirection • Redirect clients to the server with required data • Needs protocol support (CIFS/NFS v4 only) • Does not work with v3 clients • Works with standard v4, CIFS clients • Works well for distributed clients and servers • Can be very scalable • E.g., MS-Dfs, Glamour, NuView

  46. Almaden bob alice glamor glamor papers talk tmp src bin bin src src Haifa Watson Almaden Watson exports local /home/alice as /users/home/alice Glamour: Common Namespace Replicates: Almaden:/usr/local/glamor as /storage/proj/backup/glamor export local /usr/local/glamor as /storage/proj/latest/glamor Watson / exports local /home/bob as /users/home/bob Tools users storage proj home generic CMVC SVN Haifa backup latest

  47. Architecture Admin Tools (CLI,GUI, etc) Glamour admin Namespace LDAP Server Client Interface Glamor LDAP Plugin Command Central LDB ( Open LDAP) Slapd Namespace DB LDB_LDAP Glamour Schema Data Server Interface NIS Map Schema RNFS Schema GPFS Client NFSv4 Server NFSv3 Client autofs autofs RNFS Glamor-enabled NFSv4 server Local glamour agent

  48. Namespace Features Glamour Schema Fileset LDIF Entry dn: DmDir=/dir1,DmNamespace=/, ou=locations,dc=company,dc=com objectclass: DmDir DmDir: /fileset11 DmParent: /dir1 DmNamespace: / DmType: Leaf • Multi protocol support • v3 Client Ldap option in /etc/auto.* with nismap schema dn: cn=dir1,nisMapName=/foo,dc=company,dc=com cn:dir1 nisMapEntry: -fstype=autofs ldap://server,nisMapName=/foo/dir1,dc=… objectclass:nisObject • Linux v4 Server Ldap option in /etc/exports with rnfs schema dn: cn=dir1,ou=locations,dc=company,dc=com cn:dir1 fslocations: server1:/path/at/server1 fslocations: server2:/path/at/server2 referral: TRUE objectclass:rnfs • CIFS in progress • Multi system support • AIX, Linux, … • Nested mounts • Only supported for v4 currently • No cycles • Common Glamour schema • Convert to desired schema for v3 or v4 or GPFS clients • Understood by agent in a Glamour-enabled fileserver Locations LDIF entry dn: DmLocation=server:/path/at/server, ou=locations,dc=company,dc=com objectclass: DmLocation DmLocation: server:/path/at/server Dmdir: /fileset11 DmParent: /dir1 DmNamespace: / DmSite: us DmNfsType: 3 DmHost: DmDomain: DmPath: DmVersion: DmStatus:

  49. Towards a Federated Filesystems Standard • IBM continuing the Glamour effort to incorporate NFSv4, NFSv3, CIFS, and even local file systems such as GPFS • Prototype on Linux as reference implementation • Collaboration with NetApp and the extended NFS community for an IETF standard on administering federated filesystems

  50. Questions www.almaden.ibm.com/StorageSystems/areas/filesys tewarir at us ibm com

More Related