File Systems for the Cloud

COS 497 - Cloud Computing File Systems for the Cloud

Cloud File Systems Traditional distributed file systems (DFSs) need modifications for the Cloud. Like in traditional DFSs, Cloud needs ... - Performance - Scalability - Reliability - Availability But there are differences: - Component failures are the norm (Large number of commodity machines). - Files are really huge (100MB++). - Appending new data at the end of files is better than overwriting existing data. - High, sustained bandwidth is more important than low latency. Examples of Cloud DFSs: Google File System (GFS), Amazon Simple Storage System (S3), Hadoop Distributed File System (HDFS).

GFS – Google File System Google Proprietary

• Google File System (GFS or Google FS) is a proprietary distributed file system developed by Google for its own use. • It is designed to provide efficient, reliable access to data using large clusters of commodity hardware. • Despite having published details on the technology “The Google File System”, Google has not released the software as open source and shows little interest in selling it. • The only way it is available to another enterprise is in an embedded form - if you buy a high-end version of the Google Search Appliance, one that is delivered as a rack of servers, you get Google's technology for managing that cluster as part of the package. http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/gfs-sosp2003.pdf

History • Google was growing fast, really fast! • Had different needs and problems than everybody else. • Desperately needed a new distributed file system solution. - Mone commercially available. • Google File System grew out of an earlier Google effort, "BigFiles", developed by Larry Page and Sergey Brin in the early days of Google

Requirements • Designed specifically for Google workload • Why not use an existing file system? - Google’s problems were different from anyone else’s, different workload and design priorities. • Specifically designedfor MapReduce! Input Output Map Map Reduce Map Map

Requirements • High performance, scalable and distributed file system - But do not try to solve everything • Run on commodity hardware, failures are expected • Must provide high availability to clients • Prefer high throughput over small latency (aka time delays) • Files are write-once, mostly appended to - Perhaps concurrently • Large streaming reads

Design • Support sequential and random reads - But optimized for large sequential reads • Support append and overwrite writes - But optimized for concurrent appends • Must handle a “modest” number of files ~ 1 million • Files are big, from hundreds of MBs to GBs • Relaxed consistency model – CAP and BASIC

GFS: The Google File System • Familiar file system interface – Unix filestore - Files are organized hierarchically in directories. - Files are identified by path names. • File operations: - create - delete - open - close - read - write • Some extra operations: - snapshot: Creates a copy of a file/directory atlow cost. - append: Guarantees atomicity during multiple appends.

Application Master Chunk Server Chunk Server Chunk Server GFS Client Metadata Chunks Chunks Chunks Architecture Replicated? There are replicas of the master, standing by in case the master dies. • Master/slaves structure of servers • Single (replicated) master server • Multiple “chunk” servers – actual storage areas – slaves, workers

• The replica masters update themselves with the most recent metadata

• Serving Requests: - Client retrieves metadata for operation from master. - Read/Write data flows between client and chunk server.

Master server • Manage the namespace and operations • Manages file metadata - File and chunk namespaces - Mapping from files to chunks - Locations of each chunk’s replicas • Coordinate chunk servers - Creation / deletion. - Placement - Load balancing - Maintains replication factor (default 3)

• Masters store only metadata – data about the files - File namespace - File to chunk mappings - Chunk location information - Access control information - Chunk version numbers - Etc. • All in memory (64 bytes / chunk) - Fast - Accessible • has an operation log for persistent logging of critical metadata updates - Persistent on local disk - Replicated - Checkpoints for faster recovery

Chunks •“Chunks” of the file - 64 MB in size stored on chunk servers. • Identified by a unique 64-bit “handle” (aka identifier or filename) • Checksum for each 64KB block – to check for corruption of bits on disk • Replicated to different chunk servers - Reliability through replication - Each chunk replicated across 3+ chunk servers Note: There are hundreds of chunk servers in a GFS cluster distributed over multiple racks.

Chunk server • Manage the chunks under its control • Store chunks as plain UNIX file system files • Actively check the integrity of chunks

Master server • Why centralization (i.e. a master/slave organization)? • Simplicity - Allow better decisions with a global knowledge. • But it’s still a bottleneck … - Mitigated by out of order storage

Master Chunk Server Communication • Master and chunk server communicate regularly to obtain state: - Is chunk server down? - Are there disk failures on chunk server? - Are any chunk replicas corrupted? - Which chunk replicas does a chunk server store? • Master sends instructions to a chunk server: - Delete existing chunk. - Create new chunk.

Single master • Problems? - Single point of failure - Scalability bottleneck • GFS solutions: - Shadow masters - Minimize master involvement Never move data through it, use only for metadata and cache metadata at clients - Large chunk size • Simple, and good enough for Google’s concerns

Summary • GFS supports big data processing using commodity hardware • Nothing like the traditional file system assumptions • Highly optimized for the workload (Map Reduce) • High Availability • Replication, monitoring, check-summing • Failure is expected, not an exception • Scalable • De-coupled control and data transfer • High throughput

Change to Caffeine • In 2010, Google remodeled its search infrastructure • Old system – Based on MapReduce (on GFS) to generate index files – Batch process: next phase of MapReduce cannot start until first is complete Web crawling → MapReduce → propagation – Initially, Google updated its index every 4 months. Around 2000, it re-indexed and propagated changes every month. Process took about 10 days. Users hitting different servers might get different results. • New system, named Caffeine – Fully incremental system: Based on BigTable running on GFS2 – Support indexing many more documents: ~100PB – High degree of interactivity: web crawlers can update tables dynamically. – Analyze web continuously in small chunks. • Identify pages that are likely to change frequently. – BTW, MapReduce is not dead. Caffeine uses it in some places, as do lots of other services.

From GFS to GFS2 • GFS was designed with MapReduce in mind. – But MapReducespawned lots of other applications which needed a better file system! – Designed for batch-oriented operations. • Problems – Single master node in charge of chunk servers – SPOF. – All information (i.e. metadata) about files is stored in the master’s memory →limits total number of files. – Problems when storage grew to tens of petabytes (1012 bytes). – Automatic failover (if the master goes down) added (but still takes 10 seconds). – Designed for high throughput, but delivers high latency → master is a bottleneck. – Delays due to recovering from a failed replica chunk server delays the client.

• GFS2 • – Distributed masters, not just one – more redundancy. • - More mastersalso increases the number of files that can be • accommodated as more masters means more metadata may • be stored. • – Support for smaller files: chunks go from 64 MB to 1 MB • – Designed specifically for Google’s BigTable (but does not make GFS obsolete – still used by some applications) Google BigTable is Google’s proprietary data storage system

Hadoop File System - HDFS Apache Open Source

• HDFS built around the GFS model – research paper published by Google – with the names changed to protect the innocent! HDFS Architecture • Master/slave architecture • HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. •There are a number of DataNodes usually one per node in a cluster. - The DataNodes manage storage attached to the nodes that they run on. •HDFS exposes a file system namespace and allows user data to be stored in files. •A file is split into one or more blocks and set of blocks are stored in DataNodes. •DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from NameNode.

• HDFS is a very large distributed file system • E.g. 10K nodes, 100 million files, 10 PB of data • • Assumes commodity hardware. • • Files are replicated in order to handle hardware failure. • • System detects failures and recovers from them. • • Optimized for Batch Processing. • • Data locations exposed so that computations can move to where data resides. • • Provides very high aggregate bandwidth.

An application can specify the number of replicas of the file needed: replication factor of the file. • This information is stored in the NameNode.

Hadoop Distributed File System • • Single Namespace for entire cluster • • Data Coherency • - Write‐once/read‐many access model. • - Client can only append to existing files. • • Files are broken up into blocks. • - Typically 64 - 128 MB block size. • - Each block replicated on multiple DataNodes. • • Intelligent Client. • - Client can find location of blocks via NameNode. • - Client accesses data directly from a DataNode.

NameNode • The most vital of the HDFS components—the NameNode. •The NameNode is the master of HDFS that directs the slave DataNodetasks to perform the low-level I/O tasks. •The NameNode is the bookkeeper of HDFS - it keeps track of how files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed file system. •The function of the NameNode is memory- and I/O-intensive. •As such, the server hosting the NameNode typically does not store any user data or perform any computations for a MapReduce program in order to lower the workload on that machine.

Secondary NameNode • The Secondary NameNode (SNN) is an assistant for monitoring the state of the cluster HDFS. • Like the NameNode, each cluster has one SNN, and it typically resides on its own machine as well. No other DataNode or TaskTrackertasks run on the same server. • The SNN differs from the NameNode in that this process does not receive or record any real-time changes to HDFS. • Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration. • • The NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the downtime and loss of data

DataNode • Each slave machine in the cluster will host a DataNodetask to perform the “grunt” work of the distributed filesystem—reading and writing HDFS blocks to actual files on the local filesystem. • When you want to read or write a HDFS file, the file is broken into blocks and the NameNode will tell your client which DataNode each block resides in. • Your client communicates directly with the DataNodetasks to process the local files corresponding to the blocks. • Furthermore, a DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.

Questions?

File Systems for the Cloud

File Systems for the Cloud

Presentation Transcript

File Systems

File Systems

File Systems

File Systems

CLOUD Computing FILE STORAGE SYSTEMS

File-Systems

File Systems

File Systems

File Systems

File Systems

File Systems

File Systems

File Systems

Cloud File Storage

Caching for File Systems

File Systems

File Systems

File Systems

File Systems

File Systems