The Google File System

The Google File System - GFS Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google (2003 ACM Symposium on Operating Systems Principles) • Presented by Binh Tran • 03/23/2010

Outline • Introduction • Design overview • GFS architecture • System Interactions, operations • Fault tolerance and diagnosis • Measurements • Conclusion

Introduction • The rapidly growing demands of Google’s data processing needs. • Distributed file system such as performance, scalability, reliability, and availability. • Re-examine traditional choices and explore radically different points in the design space • Component failures (application bugs, OS bugs, human errors, failures of disks, memory, connectors, networking, and power supplies): norm not exception-> constant monitoring, error detection, fault tolerance, and automatic recovery. • Fast growing data sets of many TBs, and multi-GB files are common -> I/O operation and block sizes have to be revisited. • Most files are mutated by appending new data rather than overwriting existing data->performance optimization and atomicity guarantees. • An atomic append operation is needed so that multiple clients can append concurrently to a file without extra synchronization between them. • 1000 storage nodes, over 300 TB of disk storage, and heavily accessed by hundreds of clients on distinct machines. • Google File System (GFS)

Design Overview (1/7): Assumptions • Built from many inexpensive commodity components that often fail ->constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis. • Stores a few million files, each typically 100 MB or larger in size. Small files must be supported, need not optimize from them. • Two kinds of reads: large streaming reads (>hundreds of KBs, 1 MB or more) and small random reads (few KBs at some arbitrary offset). • Large, sequential writes that append data to files. Once written, files are seldom modified again. • Multiple clients that concurrently append to the same file. Our files are often used as producer-consumer queues. Hundreds of producers, running one per machine, will concurrently append to a file. Atomicity with minimal synchronization overhead is essential. • High sustained bandwidth is more important than low latency.

Design Overview (2/7): Interface • A familiar file system interface, not implement a standard API such as POSIX • Files are organized in directories and identified by path-name • Usual operations: create, delete, open, close, read and write files. • Other operations: snapshot and record append. • Snapshot: creates a copy of a file or a directory tree at low cost. • Record append: allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append.

Design Overview (3/7): Architecture (1/2) • A single master and multiple chunk-servers and is accessed by multiple clients.

Design Overview (3/7): Architecture (2/2) Files are divided into fixe-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by a master at the time of chunk creation. Chunk-servers store chunk on local disks and read or write chunk data specified by a chunk handle and byte range. Each chunkis replicated on multiple chunk-servers( 3 replicas by default). Master maintains all file system metadata: namespace, access control information, the mapping from files to chunks, and the current locations of chunks. Master also controls system-wide activities: chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunk-servers. The master periodically communicates with each chunk-servers in HeartBeat messages to give it instructions and collect its state. Clientsinteract with the master for metadata operations, but all data-bearing communication goes directly to the chunk-servers Both clients and servers do not cache file data (clients do cache metadata, however).

Design Overview (4/7): Single Master (1/2) • Single master simplifies the design and enables the master to make sophisticated chunk placement and replication decision using global knowledge. • Clientsnever read and write file data through the master. • Clientasks the master which chunk-servers it should contact. It caches this information for a limited time and interacts with the chunk-servers directly for many subsequent operations.

Design Overview (4/7): Single Master (2/2) • Interactions for a single read: • Using the fixed chunk size, the client translates the file name and byte offsets specified by the application into a chunk index within the file. • Clientsends the master a request {file name, chunk index} • Master replies with the corresponding chunk handle and locations of the replicas. • Clientcaches this info using the file name and chunk index as the key. • Client then sends a request (chunk handle, byte range with that chunk) to one of the replicas • Further reads of the same chunk requireno more client-master interaction until cache expired or the file is reopened. • In fact, client typically asks for multiple chunks the same request and master can include the info for chunks immediately following those requests

Design Overview (5/7): Chunk Size • Chunk size is one of the key design parameters • 64 MB/chunk size, much larger than typical file system block sizes • Each chunk is stored on a chunk-server. • Large chunk size: • Advantages: • Reduces clients’ need to interact with the master • Client is more likely to perform many operations on a given chunk, reduce network overhead then. • Reduces the size of the metadata stored on the master. • Disadvantages: • A small file consists of a small number of chunks, perhaps just one. The chunk-servers storing those chunks may become hot spots if many clients are accessing the same file. • Fixed this problem by storing such executables with a higher replication factor. • A potential long-term solution: allow clients to read data from other clients in such situation.

Design Overview (6/7): Metadata • Three major types of metadata (1) The file and chunk namespaces. (2) The mapping from files to chunks. (3) The locations of each chunk’s replicas. • All metadata is kept in the master’s memory. • (1) and (2) are also kept persistent by logging mutations via operation log stored on the master’s local disk and replicated on remote machines. • Using log allows us to update the master state simply, reliably, and without risking inconsistencies in the event of a master crash. • The master does not store chunk location information persistently. Instead, it asks each chunk-server about its chunks at master startup and whenever a chunk-server joins the cluster.

Design Overview (6/7): Metadata- In-Memory Data Structure (1/3) • Metadata is stored in memory, master operations are fast. • Easy and efficient for the master to periodically scan through its entire state in the background • The periodic scanning is used to implement • Chunk garbage collection • Re-replication in the presence of chunk-server failures • Chunk migration to balance load and disk space usage across chunk-servers • Memory-only approach? • The master maintains less than 64 bytes of metadata for each 64 MB chunk. • Most chunks are full because most files contain many chunks, only the last partially filled. • If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity,reliability, performance, and flexibility we gain by storing the metadata in memory.

Design Overview (6/7): Metadata- Chunk Locations (2/3) • Master does not keep a persistent record of which chunk-servers have a replica of a given chunk. • Master controls all chunk placement and monitors chunk-server status with regular HeartBeat messages • This eliminated the problem of keeping the master and chunk-servers in sync as chunk-servers join and leave the cluster, change names, fail, restart, and so on. • Another reason for this design is that a chunk-server has the final word over what chunks it does or does not have on its own disks. Errors on a chunk-server may cause chunks to vanish spontaneously (disk may go bad and be disabled) or an operator may rename a chunk-server.

Design Overview (6/7): Metadata- Operation Log (2/3) • The operation log contains a historical record of critical metadata changes. • It also serves as a logical time line that defines the order of concurrent operations. • Files and chunks and eternally identified by the logical times at which they were created. • The operation log is critical, must store it reliably and not make changes visible to clients until metadata changes are made persistent. • The master recovers its file system state by replaying the operation log. • Master checkpoints its state whenever the log grow beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the limited number of log records after that. • The master switches to a new log file and creates the new checkpoint in a separate thread. • A failure during check-pointing does not affect correctness because the recovery code detects and skips incomplete checkpoints.

Design Overview (7/7): Consistency Model • A relaxed consistency model that support highly distributed applications well but remains relatively simple and efficient to implement. • Guarantees by GFS • File namespace mutations (file creation) are atomic, handled exclusively by the master. • Namespace locking guarantees atomicity and correctness. • The master’s operation log defines a global total order of these operations. • Data mutations: writes or record appends • Write: data written at an application-specified file offset • Record append: data to be appended atomically at least once even in the presence of concurrent mutations.

System Interactions (1/4): Leases and Mutation Order (1/2) • Goal: to minimize the master’s involvement in all operations • Interactions: data mutations, atomic record append, and snap-shot • Leases and Mutation Order: • Mutation: changes the contents or metadata of a chunk such as a write or an append operation. Each mutation is performed at all the chunk’s replicas. • Leases: maintain a consistent mutation order across replicas. • Primary: the master grants a chunk lease to one of the replicas. Primary picks a serial order for all mutations to the chunks. • The lease mechanism: to minimize management overhead at the master • The master may sometimes try to revoke a lease before it expires, safely grant a new lease to another replica after the old lease expires.

System Interactions (1/4): Leases and Mutation Order -Write Control and Data Flow (2/2) 1. Which chunk-server holds the current lease for the chunk and the locations of the other replicas? The control flow of a write 2. The identity of the primary and the locations of the secondary replicas. The client caches this data for future mutations 3. Push the data to all the replicas in any order. Each chunk-server will store the data in an internal LRU buffer cache until the data is used/aged out 4. Once all the replicas have acknowledged receiving the data, send a write request . The request identifies the data pushed earlier to all of the replicas. The primaryassigns serial numbersconsecutives, possibly from multiple clients, which provides the necessary serialization. It applies the mutation to its own local state in serial number order 5. Forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary 6. Completed the operation 7. Any errors encountered at any of the replicas

System Interactions (2/4): Data Flow • Goals: decouple the flow of data from the flow of control to use the network efficiently • Fully utilize each machine’s network bandwidth • Data is pushed linearly along a chain of chunk-servers rather than distributed in some other topology (like tree) • Avoid network bottlenecks and high-latency links • Forwards the data to the “closet” machine in the network topology that has not received it. • Minimize the latency to push through all the data. • Pipelining the data transfer over TCP connections. Pipelining using a switched network with full-duplex links.

System Interactions (3/4): Atomic Record Appends • Record append (an atomic append operation) • Traditional write: client specifies the offset at which data is to be written. Concurrent write to the same region are notserializable: the region may end up containing data fragments from multiple clients • Record append: client specifies only the data. GFS appends it to the file at least once atomically (continuous sequence of bytes) at an offset of GFS’s choosing and returns that offset to the client. • In distributed applications, with traditional write, need complicated and expensive synchronization like distributed lock manager. This approach uses multiple-producer/single-consumer queues/contain merged results from many different clients

System Interactions (4/4): Snapshot • Snapshot: makes a copy of a file or a directory tree almost instantaneously, while minimizing any interruptions of ongoing mutations. • Use standard copy-on-write techniques • When the master receives a snapshot request, • It first revokes any outstanding leases of the chunks about snapshot • After the leases have been revoked or have expired, the master logs the operation to disk. • Apply this log record to its in-memory state by duplicating the metadata for the source file or directory tree. • After snapshot, client wants to write to a chunk C, it sends a request to the master to find the current lease holder. • The reference count for chunk C is greater than one, master ask each chunk-server that has a current replica of C to create a new chunk call C’. • The new chunk on the same chunk-servers as the original, data can be copied locally, request handling is no different from that for any chunk. • The master grants one of the replicas a lease on the new chunk C’ and replies to the client.

Master Operation (1/5): Namespace Management and Locking (1/2) • All namespace operation. • Make placement decisions, create new chunks (replicas), co-ordinates various system-wide activities to keep chunks fully replicated, balance load across all chunk-servers, reclaim unused storage. • Allow multiple operations to be active and use locks over regions of the namespace to ensure proper serialization • Not per-directory data structure, not aliases for the same file or directory, GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. With prefix compression, represented in memory, each nodes in the namespace tree has an associated read-write lock. • Each master operation acquires a set of locks before it runs • w

Master Operation (1/5): Namespace Management and Locking (2/2) • Example: locking mechanism can prevent a file /home/user/foofrom being created while /home/user is being snapshotted to /save/user • Snapshot operation: read locks on /home and /save write locks on /home/user and /save/user • File creation: read locks on /home and /home/user write locks on/home/user/foo • Conflicting locks on /home/user • The two operations will be serialized properly because all try to obtain conflicting locks on /home/user • File creation does not require a write lock on the parent directory • The read lock on the name is sufficient to protect the parent directory from deletion • This nice property of this locking scheme allows concurrentmutation in the samedirectory. Eg. Multiple file creations: each acquires a read lock on the directory name and a write lock on the file name • The read locks on the directory name: prevent deletion, rename, snapshot • The write lock on the file name: serialize attempts to create a file with the same name twice.

Master Operation (2/5): Replica Placement • Hundreds of chunk-servers spread across many machine racks. • Communication between two machines on different racks may cross one or more network switches. Multi-level distribution presents a unique challenge to distribute data for scalability, reliability, and availability. • Two purposes of chunk replica placement policy: • Maximize data reliability and availability • Maximize network bandwidth utilization. • Solutions: • Spread chunk replicas across cracks -> some replicas of a chunk will survive and remain available even if an entire rack is damaged or offline. Traffic (read) for a chunk can exploit the aggregate bandwidth of multiple racks

Master Operation (3/5): Creation, Re-replication, Rebalancing • Chunk replicas with three reasons: chunk creation, re-replication, and rebalancing • Create a chunk: • Place new replicas on chunk-servers with below-average disk space utilization • Limit the number of recent creations on each chunk-server • Spread replicas of a chunk across racks • Re-replicates: when the number of available replicas falls below a user-specified goal when • A chunk-server becomes unavailable • One of its disks is disable because of errors • Replication goal is increased. • Rebalances: master rebalances replicas periodically. It examines the current replica distribution and moves replicas for betterdisk space and load balancing. Policies: fill up with a new chunk-server, not swamps, and removes replicas with below-average free space (equalize disk space usage)

Master Operation (4/5): Garbage Collection • After a file is deleted, GFS does not immediately reclaim the available physical storage. It does so only lazily during regular garbage collection at both the file and chunk levels. • Goal: simplier and more reliable • Mechanism: • When a file is deleted • Logs the deletion immediately • Renamed to a hidden name + the deletion timestamp • During master’s regular scan, remove any such hidden files (if existing and more than 3 days), hidden name can be undeleted and back to normal • When removed from the namespace, its in-memory metadata is erased • Orphaned chunks (not reachable from any file) and its metadata also erased

Master Operation (5/5): Stale Replica Detection • Chunk replicas may become stale if a chunk-server fails and misses mutations to the chunk while it is down. • For each chunk, the master maintains a chunk version number to distinguish between up-to-date and stale replicas • Increase the chunk version number when master grant a new lease on a chunk • The master removesstale replicas in its regular garbage collection.

Fault Tolerance and Diagnosis (1/2): High Availability • Hundreds of servers in a GFS clusters, some are bound to be unavailable at any give time. Overall system keep highly available: fast recovery and replication • Fast recovery: master and the chunk-servers designed to restore their state and start in seconds • Chunk replication: each chunk is replicated on multiple chunk-servers on different racks. • Master replication: is replicated for reliability. Its operation log and checkpoints are replicated on multiple machines. Shadow master provide read-only access to the file system even the primary master is down.

Fault Tolerance and Diagnosis (2/2): Data Integrity + diagnostic tools • Each chunk-servers uses checksumming to detect corruption of stored data • Each chunk-server must independently verify the integrity of its own copy by maintaining checksums. • Chunk of 64 KB blocks with 32 bit checksum • For reads, align reads at checksum block boundaries. • For writes, checksum computation append to the end of a chunk (opposed to writes that overwrite existing data) • Diagonostic tools: • Extensive and detaileddisgnosticlogging has helped in problem isolation, debugging, and performance analysis, while incurring only a minimal cost.

Measurements (1/) : Micro-benchmarks • A GFS cluster: one master + two master replias + 16chunk-servers + 16 clients • All machines: 1.4 GHz PIII processors, 2 GB of mem, two 80 GB 5400 rpm disks, 100 Mbps full-duplex Ethernet connection to an HP 2524 switch.

Measurements (1/) : Micro-benchmarks - Reads The limit peaks at an aggregate of 125 MB/s when 1 Gbpslink between 2 switches is saturated, or 12.5 MBs per client when 100 Mbps network interface get saturated The observed READ rate is 10Mb/s or 80% of the per-client limit, when just one client is reading. The aggregate read rate = 94MB/s, 75% of 125MB/s link limit for 16 readers, or 6 MB/s per client The efficiencydrops from 80% to 75% because readers increases, the probability that multiple readers simultaneously read from the same chunk-server N clients read simultaneously from the file system. Each client reads a randomly selected 4MB region from a 320 GB file set

Measurements (1/) : Micro-benchmarks - Writes The limit peaks at an aggregate of 67 MB/s because we need to write each byte to 3 of the 16 chunk-servers, each with a 12.5 MB/s input connection Aggregate write rate 35MB/s for 16 clients (2.2 MB/s per client), about half the theoretical limit. It becomes more likely that multiple clients write concurrently to the same chunk-servers as clients increase. Collision is more because each client write involve three different replicas Writes are slower than we would like N clients write simultaneously to N distinct files. Each client writes 1GB data to a new file in a series of 1 MB writes.

Measurements (1/) : Micro-benchmarks – Record Appends It starts at 6.0 MB/s for one client and drops to 4.8 MB/s for 16 clients. Due to congestion and variances in network transfer rates seen by different clients N clients append simultaneously to a single file. Performance is limited by the network bandwidth of the chunk-servers that store that last chunk of the file

Real World Clusters • Examine two clusters in use within Google. • Cluster A: reach and development by over a hundred engineers. • Run up several hours • Read through a few MB to a few TBs of data • Transform/ analyze the data • Write the result back to the cluster • Cluster B: production data processing • Generate and process multi-TB datasets with only occasional human intervention Read rates were much higher than write rates. Peak read rate of Cluster B = 1300 MB/s Average write rate was less than 30 MB/s. B was in middle of a burst of write activity generating about 100MB/s of data Operation sent to the master was around 200 to 500 operation/s.

Comparisons • GFS provide a location independent namespace which enables data to be moved transparently for load balance or fault tolerance (like AFS) • GFS spreads a file’s data across storage servers in a way to deliver aggregate performance and increased fault tolerance (unlike AFS) • GFS currently uses only replication for redundancy and so consumes more raw storage (unlike xFSor Swift) • GFS does not provide any caching below the file system interface, within a single application run is ok (unlike AFS, xFS, Frangipani) • GFS opts the centralized approach to simplify the design, increase its reliability, and gain flexibility (unlike Frangipani, xFS, Minesota’s GFS, GPFS) • GFS most closely resemble the NASD architecture. While the NSD is based on network-attached disk drives, GFS uses commodity machines as chunks-servers

Issues & Conclusion • Biggest problems were disk and Linux related • Linux driver not supported earlier range of IDE protocol version, caused the mismatches, and misunderstood between the driver and the kernel about the drive’s state

Goals • A scalable distributed file system for large distributed data intensive application • Fault tolerance while running on inexpensive commodity hardware • High aggregate performance to a large number of clients • Reexamine traditional choices and explore radically different design points • Large data sets, largest cluster to hundreds of terabytes of storage across thousands of disks over a thousand machines, and accessed by hundreds of clients • Distributed applications • Measurements from both micro-benchmarks and real world use. • Meet the rapidly growing demands of Google’s data processing needs • Performance, scalability, reliability, and availability.

Characteristics of two GFS clusters

Design Overview (7/7): Consistency Model – Implications for Applications (2/2)

Aggregate Throughputs

Performance Metrics for two GFS Clusters

Operations Breakdown by Size

Bytes Transferred Breakdown by Operation Size (%)

Master Requests Breakdown by Type (%)

Interface • Familiar file system interface • Operations: create, delete, open, close, read, and write files. • Other operations: snapshot and record append • Snapshot: copy of a file or a directory tree at low cost • Record append: multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. • Multi-way merge results and producer-consumer queues that many clients can simultaneously append to without additional locking

The Google File System - GFS