Mariposa The Google File System

MariposaThe Google File System Haowei Lu Madhusudhanan Palani EECS 584, Fall 2011

From LAN to WAN • Drawbacks of traditional distributed DBMS • Static Data Allocation • Move objects manually • Single Administrative Structure • Cost-based optimizer cannot scale well • Uniformity • Different Machine Architecture • Different Data Type EECS 584, Fall 2011

From LAN to WAN • New requirements • Scalability to a large number of cooperating sites: • Data mobility • No global synchronization • Total local autonomy • Easily configurable policies EECS 584, Fall 2011

From LAN to WAN • Solution – A distributed microeconomic approach • Well studied economic model • Reduce scheduling complexity(?!) • Invisible hands for local optimum EECS 584, Fall 2011

Mariposa • Let each site acts on its own behalf to maximize his own profit • In turn, it brings the overall performance of the DBMS ecosystem EECS 584, Fall 2011

Architecture - Glossary • Fragment – Units of storage that are bought and sold by sites • Range distribution • Hash-Based distribution • Unstructured! Whenever site wants! • Stride • Operations that can proceed in parallel EECS 584, Fall 2011

Architecture EECS 584, Fall 2011

EECS 584, Fall 2011

The bidding process EECS 584, Fall 2011

The bidding process • The Broker: Send out requests for bid for query plan • The Bidder: Responds to the request for bid with its formulated price and other information in the form: • (C,D,E) Cost, Delay, Expiration Date • The whole logic is implemented using RUSH • A low level, very efficient embedded scripting language and rule system • Form: on <condition> do <action> EECS 584, Fall 2011

The bidding process: Bidder • The Bidder: Setting the price for bid • Billing rate on a per-fragment basis • Consider site load • Actual Bid = Computed bid * Load average • Bid referencing hot list from storage manager EECS 584, Fall 2011

The bidding process EECS 584, Fall 2011

The bidding process: Broker • The Broker • Input: Fragmented query plan • In process: Decide the sites to run fragments & send out bid acceptance • Expensive bid protocol • Purchase order protocol (mainly used) • Output: hand off task to coordinator EECS 584, Fall 2011

The bidding process: Broker • Expensive bid protocol Under Budge Broker Bidder(Individual Sites) Ads Table (Locate at Name Server) Bookkeeping Table for previous winner sites (Same site as broker) EECS 584, Fall 2011

The bidding process: Broker • Purchase Order Protocol Broker The most possible bidder Accept Refuse Generate Bill Pass to another Site Return to Broker EECS 584, Fall 2011

The bidding process: Broker • The Broker finds bidder using Ad table EECS 584, Fall 2011

The bidding process: Broker • The Broker finds bidder using Ad table • Example (Sale Price) • Query-Template: SELECT * FROM TMP • Sever Id: 123 • Start Time: 2011/10/01 • Expiration Time: 2011/10/04 • Price: 10 unit • Delay: 5 seconds EECS 584, Fall 2011

The bidding process: Broker • Type of Ads (REALY FANCY) EECS 584, Fall 2011

The bidding process: Bid Acceptance • The main idea: Make the difference large as possible • Difference:= B(D) – C (D: Delay, C: Cost, B(t): The budget function) • Method: Greedy Algorithm • Pre-Step: Get the least D result • Iteration Steps: • Calculate Cost Gradient CG:= cost reduce/delay increase for each stride • Keep substitute using MAX(CG) until no increase on difference EECS 584, Fall 2011

The bidding process: Separate Bidder • Network bidder • One trip to get bandwidth • Return trip to get price • Happen at second stage EECS 584, Fall 2011

EECS 584, Fall 2011

Storage Manager • An asynchronous process which runs in tandem with the bidder • Objective • Maximize revenue income per unit time • Functions • Calculate Fragment Values • Buy Fragments • Sell Fragments • Split/Coalesce Fragments EECS 584, Fall 2011

Fragment Values • The value of a fragment is defined using the revenue history • Revenue history consists of • Query, No of records in result, time since last query, last revenue, delay, CPU & I/O used • CPU & I/O is normalized & stored in site independent units • Each site should • Convert this CPU & I/O units to site specific units via weighting functions • Adjust revenue as the current node maybe faster or slower by using the average bid curve EECS 584, Fall 2011

Buying fragments • In order to bid for a query/subquery the site should have the referenced fragments • The site can buy the fragments in advance (prefetch) or when the query comes in (on demand) • The buyer locates the owner of fragment and requests revenue history • Calculates the value of fragment • Evict old fragments to free up space (alternate fragments) • To the extent that space is available for new fragments • Buyer Offer price = value of fragment – value of alternate fragments + price received EECS 584, Fall 2011

Selling Fragments • Seller can evict the fragment being bought or any other fragment(alternate) of equivalent size (Why is this a must?) • Seller will sell if • offer price > value of fragment (sell) – value of alternate fragments + price received • If offer price is not sufficient, • then seller tries to evict fragment of higher value • lower the price of fragment as a final option EECS 584, Fall 2011

Split & Coalesce • When to Split/Coalesce? • Split if there are too few fragments otherwise parallelization will take a hit • Coalesce if there are too many fragments as overhead of dealing with the fragments & response time will take a hit • The algorithm for split/coalesce must strike the correct balance between the two EECS 584, Fall 2011

How to solve this issue???An interlude Why not extend my microeconomics analogy!?! EECS 584, Fall 2011

Stonebreaker’s Microeconomics Idea • Market pressure should correct inappropriate fragment sizes • Large fragment size => • Now everyone wants a share of the pie • But the owner does not want to lose the revenue! EECS 584, Fall 2011

The Idea Continued • Break the large fragment into smaller fragments • Smaller fragment means less revenue & less attractive for copies EECS 584, Fall 2011

It still continues…. • Smaller fragments also mean more overhead => Works against the owner! EECS 584, Fall 2011

And it ends… • So depending on the market demand these two opposing motivations will balance each other EECS 584, Fall 2011

How to solve this issue??? Why not extend my microeconomics analogy!?! A more “concrete” approach !! EECS 584, Fall 2011

A more “concrete” approach... • Mariposa will calculate expected delay (ED) due to parallel execution on multiple fragments (Numc) • It then computes the expected bid per site as • B(ED)/Numc • Vary Numc to arrive at maximum revenue per site => Num* • Sites will keep track of this Num* to base their split/coalesce decision • **The sites should also ensure that the existing contracts are not affected EECS 584, Fall 2011

Name Service Architecture Local sites Name server Name Service Broker Name server Name server EECS 584, Fall 2011

What are the different types of names? • Internal names: They are location dependent and carries info related to the physical location of the object • Full Names: They uniquely identify an object, are location independent & carries full info related to attributes of object • Common Names: They are user defined & defined within a name space. • Simple rules help translate common names to full names • The missing components are usually derived from the parameters supplied by the user or from the user’s environment • Name Context: This is similar to access modifiers in programming languages. EECS 584, Fall 2011

How are names resolved? • Name resolution helps discover the object that is bound to a name • Common Name => Full Name • Full Name => Internal Name • The broker employs the following steps to resolve a name • Searches local cache • Rule driven search to resolve ambiguities • Query one or more name servers EECS 584, Fall 2011

How is QOS of name Servers Defined? • Name servers helps translate common names to full names using name contexts provided by clients • Name service contacts various name servers • Each name server maintains a composite set of metadata of local sites under them • It’s the role of name server to periodically update its catalog • QOS is defined as the combination of price & staleness of this data EECS 584, Fall 2011

Experiment EECS 584, Fall 2011

The Query: • SELECT * FROM R1(SB), R2(B), R3(SD) WHERE R1.u1 = R2.u1 AND R2.u1 = R3.u1 • The following statistics are available to the optimizer • R1 join R2 (1MB) • R2 join R3 (3MB) • R1 join R2 join R3 (4.5MB) EECS 584, Fall 2011

The traditional Distributed RDBMS plans a query & sends the sub queries to the processing sites which is the same as purchase order protocol • Therefore the overhead due to Mariposa is the difference in elapsed time between the two protocols weighted by proportion of queries using the protocols • Bid price = (1.5 x estimated cost) x load average • Load average = 1 • A node will sell a fragment if • Offer price > 2 X scan cost / load average • Decision to buy a fragment rather than subcontract is based on • Sale price <= Total money spent on scans EECS 584, Fall 2011

The query optimizer chooses a plan based on the data transferred across the network • The initial plan generated by both Mariposa and the traditional systems will be similar • But due to migration of fragments subsequent executions of the same query will generate much better plans EECS 584, Fall 2011

EECS 584, Fall 2011

GFS - Topics Covered • Motivation • Architectural/File System Hierarchical Overview • Read/Write/Append/Snapshot Operation • Key Design Parameters • Replication & Rebalancing • Garbage Collection • Fault Tolerance EECS 584, Fall 2011

Motivation • Customized Needs • Reliability • Availability • Performance • Scalability EECS 584, Fall 2011

Customized Needs How is it different? • Runs on commodity hardware where failure is not an exception rather an expectation (PC Vs Mac anyone?) • Huge Files (in the order of Multi-GBs) • Writes involve only appending data unlike traditional systems • Applications that use these systems are in-house! • Files stored are primarily web documents EECS 584, Fall 2011

GFS - Topics Covered • Motivation • Architectural/File System Hierarchical Overview • Read/Write/Append/Snapshot Operation • Key Design Parameters • Replication & Rebalancing • Garbage Collection • Fault Tolerance EECS 584, Fall 2011

File System Hierarchy Directory Master Server File 1 File n 64 Bit Globally Unique Ids Chunk0 Chunk3 Chunk servers Chunk1 Chunk4 Chunk2 Chunk5 EECS 584, Fall 2011

Types of servers • Master server holds all meta data information such as • Directory => File mapping • File => Chunk mapping • Chunk location • It keeps in touch with the chunk servers via heartbeat messages • Chunk servers store the actual chunks on local disks as Linux files • For reliability purposes chunks maybe replicated across multiple chunk servers EECS 584, Fall 2011

GFS - Topics Covered • Motivation • Architectural /File System Hierarchical Overview • Read/Write/Append/Snapshot Operation • Key Design Parameters • Replication & Rebalancing • Garbage Collection • Fault Tolerance EECS 584, Fall 2011

EECS 584, Fall 2011

Mariposa The Google File System

Mariposa The Google File System

Presentation Transcript

The Google File System

Google File System

The Google File System

Google File System

Google File System

The Google File System

The Google File System

The Google File System

Google File System

The Google File System

The Google File System

The Google File System

The Google File System

The Google File System

The Google File System

Google File System

The Google File System

The Google File System

The Google File System

The Google File System

Google File System

The Google File System