Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net
The Google File System (GFS) • A scalable distributed file system for large distributed data intensive applications • Multiple GFS clusters are currently deployed. • The largest ones have: • 1000+ storage nodes • 300+ TeraBytes of disk storage • heavily accessed by hundreds of clients on distinct machines
Introduction • Shares many same goals as previous distributed file systems • performance, scalability, reliability, etc • GFS design has been driven by four key observation of Google application workloads and technological environment
Intro: Observations 1 • 1. Component failures are the norm • constant monitoring, error detection, fault tolerance and automatic recovery are integral to the system • 2. Huge files (by traditional standards) • Multi GB files are common • I/O operations and blocks sizes must be revisited
Intro: Observations 2 • 3. Most files are mutated by appending new data • This is the focus of performance optimization and atomicity guarantees • 4. Co-designing the applications and APIs benefits overall system by increasing flexibility
The Design • Cluster consists of a single master and multiple chunkservers and is accessed by multiple clients
The Master • Maintains all file system metadata. • names space, access control info, file to chunk mappings, chunk (including replicas) location, etc. • Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state
The Master • Helps make sophisticated chunk placement and replication decision, using global knowledge • For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers • Master is not a bottleneck for reads/writes
Chunkservers • Files are broken into chunks. Each chunk has a immutable globally unique 64-bit chunk-handle. • handle is assigned by the master at chunk creation • Chunk size is 64 MB • Each chunk is replicated on 3 (default) servers
Clients • Linked to apps using the file system API. • Communicates with master and chunkservers for reading and writing • Master interactions only for metadata • Chunkserver interactions for data • Only caches metadata information • Data is too large to cache.
Chunk Locations • Master does not keep a persistent record of locations of chunks and replicas. • Polls chunkservers at startup, and when new chunkservers join/leave for this. • Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers)
Operation Log • Record of all critical metadata changes • Stored on Master and replicated on other machines • Defines order of concurrent operations • Changes not visible to clients until they propigate to all chunk replicas • Also used to recover the file system state
System Interactions: Leases and Mutation Order • Leases maintain a mutation order across all chunk replicas • Master grants a lease to a replica, called the primary • The primary choses the serial mutation order, and all replicas follow this order • Minimizes management overhead for the Master
System Interactions: Leases and Mutation Order
Atomic Record Append • Client specifies the data to write; GFS chooses and returns the offset it writes to and appends the data to each replica at least once • Heavily used by Google’s Distributed applications. • No need for a distributed lock manager • GFS choses the offset, not the client
Atomic Record Append: How? • Follows similar control flow as mutations • Primary tells secondary replicas to append at the same offset as the primary • If a replica append fails at any replica, it is retried by the client. • So replicas of the same chunk may contain different data, including duplicates, whole or in part, of the same record
Atomic Record Append: How? • GFS does not guarantee that all replicas are bitwise identical. • Only guarantees that data is written at least once in an atomic unit. • Data must be written at the same offset for all chunk replicas for success to be reported.
Replica Placement • Placement policy maximizes data reliability and network bandwidth • Spread replicas not only across machines, but also across racks • Guards against machine failures, and racks getting damaged or going offline • Reads for a chunk exploit aggregate bandwidth of multiple racks • Writes have to flow through multiple racks • tradeoff made willingly
Chunk creation • created and placed by master. • placed on chunkservers with below average disk utilization • limit number of recent “creations” on a chunkserver • with creations comes lots of writes
Detecting Stale Replicas • Master has a chunk version number to distinguish up to date and stale replicas • Increase version when granting a lease • If a replica is not available, its version is not increased • master detects stale replicas when a chunkservers report chunks and versions • Remove stale replicas during garbage collection
Garbage collection • When a client deletes a file, master logs it like other changes and changes filename to a hidden file. • Master removes files hidden for longer than 3 days when scanning file system name space • metadata is also erased • During HeartBeat messages, the chunkservers send the master a subset of its chunks, and the master tells it which files have no metadata. • Chunkserver removes these files on its own
Fault Tolerance:High Availability • Fast recovery • Master and chunkservers can restart in seconds • Chunk Replication • Master Replication • “shadow” masters provide read-only access when primary master is down • mutations not done until recorded on all master replicas
Fault Tolerance:Data Integrity • Chunkservers use checksums to detect corrupt data • Since replicas are not bitwise identical, chunkservers maintain their own checksums • For reads, chunkserver verifies checksum before sending chunk • Update checksums during writes
MapReduce: Insight ”Consider the problem of counting the number of occurrences of each word in a large collection of documents” How would you do it in parallel ?
MapReduce Programming Model • Inspired from map and reduce operations commonly used in functional programming languages like Lisp. • Users implement interface of two primary methods: • 1. Map: (key1, val1) → (key2, val2) • 2. Reduce: (key2, [val2]) → [val3]
Map operation • Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. • e.g. (doc—id, doc-content) • Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query.
Reduce operation On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.
Pseudo-code map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
MapReduce: Fault Tolerance • Handled via re-execution of tasks. • Task completion committed through master • What happens if Mapper fails ? • Re-execute completed + in-progress map tasks • What happens if Reducer fails ? • Re-execute in progress reduce tasks • What happens if Master fails ? • Potential trouble !!
MapReduce: Walk through of One more Application
MapReduce : PageRank • PageRank models the behavior of a “random surfer”. • C(t) is the out-degree of t, and (1-d) is a damping factor (random jump) • The “random surfer” keeps clicking on successive links at random not taking content into consideration. • Distributes its pages rank equally among all pages it links to. • The dampening factor takes the surfer “getting bored” and typing arbitrary URL.
PageRank : Key Insights • Effects at each iteration is local. i+1th iteration depends only on ith iteration • At iteration i, PageRank for individual nodes can be computed independently
PageRank using MapReduce Use Sparse matrix representation (M) Map each row of M to a list of PageRank “credit” to assign to out link neighbours. These prestige scores are reduced to a single PageRank value for a page by aggregating over them.
PageRank using MapReduce Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value Iterate until convergence Source of Image: Lin 2008
Phase 1: Process HTML • Map task takes (URL, page-content) pairs and maps them to (URL, (PRinit, list-of-urls)) • PRinit is the “seed” PageRank for URL • list-of-urls contains all pages pointed to by URL • Reduce task is just the identity function
Phase 2: PageRank Distribution • Reduce task gets (URL, url_list) and many (URL, val) values • Sum vals and fix up with d to get new PR • Emit (URL, (new_rank, url_list)) • Check for convergence using non parallel component
MapReduce: Some More Apps MapReduce Programs In Google Source Tree Distributed Grep. Count of URL Access Frequency. Clustering (K-means) Graph Algorithms. Indexing Systems
MapReduce: Extensions and similar apps PIG (Yahoo) Hadoop (Apache) DryadLinq (Microsoft)
Introduction BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size Petabytes of data across thousands of servers Used for many Google projects Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Google’s products