1 / 73

Mariposa The Google File System

Mariposa The Google File System. Haowei Lu Madhusudhanan Palani. From LAN to WAN. Drawbacks of traditional distributed DBMS Static Data Allocation Move objects manually Single Administrative Structure Cost-based optimizer cannot scale well Uniformity Different Machine Architecture

nolen
Download Presentation

Mariposa The Google File System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MariposaThe Google File System Haowei Lu Madhusudhanan Palani EECS 584, Fall 2011

  2. From LAN to WAN • Drawbacks of traditional distributed DBMS • Static Data Allocation • Move objects manually • Single Administrative Structure • Cost-based optimizer cannot scale well • Uniformity • Different Machine Architecture • Different Data Type EECS 584, Fall 2011

  3. From LAN to WAN • New requirements • Scalability to a large number of cooperating sites: • Data mobility • No global synchronization • Total local autonomy • Easily configurable policies EECS 584, Fall 2011

  4. From LAN to WAN • Solution – A distributed microeconomic approach • Well studied economic model • Reduce scheduling complexity(?!) • Invisible hands for local optimum EECS 584, Fall 2011

  5. Mariposa • Let each site acts on its own behalf to maximize his own profit • In turn, it brings the overall performance of the DBMS ecosystem EECS 584, Fall 2011

  6. Architecture - Glossary • Fragment – Units of storage that are bought and sold by sites • Range distribution • Hash-Based distribution • Unstructured! Whenever site wants! • Stride • Operations that can proceed in parallel EECS 584, Fall 2011

  7. Architecture EECS 584, Fall 2011

  8. EECS 584, Fall 2011

  9. The bidding process EECS 584, Fall 2011

  10. The bidding process • The Broker: Send out requests for bid for query plan • The Bidder: Responds to the request for bid with its formulated price and other information in the form: • (C,D,E) Cost, Delay, Expiration Date • The whole logic is implemented using RUSH • A low level, very efficient embedded scripting language and rule system • Form: on <condition> do <action> EECS 584, Fall 2011

  11. The bidding process: Bidder • The Bidder: Setting the price for bid • Billing rate on a per-fragment basis • Consider site load • Actual Bid = Computed bid * Load average • Bid referencing hot list from storage manager EECS 584, Fall 2011

  12. The bidding process EECS 584, Fall 2011

  13. The bidding process: Broker • The Broker • Input: Fragmented query plan • In process: Decide the sites to run fragments & send out bid acceptance • Expensive bid protocol • Purchase order protocol (mainly used) • Output: hand off task to coordinator EECS 584, Fall 2011

  14. The bidding process: Broker • Expensive bid protocol Under Budge Broker Bidder(Individual Sites) Ads Table (Locate at Name Server) Bookkeeping Table for previous winner sites (Same site as broker) EECS 584, Fall 2011

  15. The bidding process: Broker • Purchase Order Protocol Broker The most possible bidder Accept Refuse Generate Bill Pass to another Site Return to Broker EECS 584, Fall 2011

  16. The bidding process: Broker • The Broker finds bidder using Ad table EECS 584, Fall 2011

  17. The bidding process: Broker • The Broker finds bidder using Ad table • Example (Sale Price) • Query-Template: SELECT * FROM TMP • Sever Id: 123 • Start Time: 2011/10/01 • Expiration Time: 2011/10/04 • Price: 10 unit • Delay: 5 seconds EECS 584, Fall 2011

  18. The bidding process: Broker • Type of Ads (REALY FANCY) EECS 584, Fall 2011

  19. The bidding process: Bid Acceptance • The main idea: Make the difference large as possible • Difference:= B(D) – C (D: Delay, C: Cost, B(t): The budget function) • Method: Greedy Algorithm • Pre-Step: Get the least D result • Iteration Steps: • Calculate Cost Gradient CG:= cost reduce/delay increase for each stride • Keep substitute using MAX(CG) until no increase on difference EECS 584, Fall 2011

  20. The bidding process: Separate Bidder • Network bidder • One trip to get bandwidth • Return trip to get price • Happen at second stage EECS 584, Fall 2011

  21. EECS 584, Fall 2011

  22. Storage Manager • An asynchronous process which runs in tandem with the bidder • Objective • Maximize revenue income per unit time • Functions • Calculate Fragment Values • Buy Fragments • Sell Fragments • Split/Coalesce Fragments EECS 584, Fall 2011

  23. Fragment Values • The value of a fragment is defined using the revenue history • Revenue history consists of • Query, No of records in result, time since last query, last revenue, delay, CPU & I/O used • CPU & I/O is normalized & stored in site independent units • Each site should • Convert this CPU & I/O units to site specific units via weighting functions • Adjust revenue as the current node maybe faster or slower by using the average bid curve EECS 584, Fall 2011

  24. Buying fragments • In order to bid for a query/subquery the site should have the referenced fragments • The site can buy the fragments in advance (prefetch) or when the query comes in (on demand) • The buyer locates the owner of fragment and requests revenue history • Calculates the value of fragment • Evict old fragments to free up space (alternate fragments) • To the extent that space is available for new fragments • Buyer Offer price = value of fragment – value of alternate fragments + price received EECS 584, Fall 2011

  25. Selling Fragments • Seller can evict the fragment being bought or any other fragment(alternate) of equivalent size (Why is this a must?) • Seller will sell if • offer price > value of fragment (sell) – value of alternate fragments + price received • If offer price is not sufficient, • then seller tries to evict fragment of higher value • lower the price of fragment as a final option EECS 584, Fall 2011

  26. Split & Coalesce • When to Split/Coalesce? • Split if there are too few fragments otherwise parallelization will take a hit • Coalesce if there are too many fragments as overhead of dealing with the fragments & response time will take a hit • The algorithm for split/coalesce must strike the correct balance between the two EECS 584, Fall 2011

  27. How to solve this issue???An interlude Why not extend my microeconomics analogy!?! EECS 584, Fall 2011

  28. Stonebreaker’s Microeconomics Idea • Market pressure should correct inappropriate fragment sizes • Large fragment size => • Now everyone wants a share of the pie • But the owner does not want to lose the revenue! EECS 584, Fall 2011

  29. The Idea Continued • Break the large fragment into smaller fragments • Smaller fragment means less revenue & less attractive for copies EECS 584, Fall 2011

  30. It still continues…. • Smaller fragments also mean more overhead => Works against the owner! EECS 584, Fall 2011

  31. And it ends… • So depending on the market demand these two opposing motivations will balance each other EECS 584, Fall 2011

  32. How to solve this issue??? Why not extend my microeconomics analogy!?! A more “concrete” approach !! EECS 584, Fall 2011

  33. A more “concrete” approach... • Mariposa will calculate expected delay (ED) due to parallel execution on multiple fragments (Numc) • It then computes the expected bid per site as • B(ED)/Numc • Vary Numc to arrive at maximum revenue per site => Num* • Sites will keep track of this Num* to base their split/coalesce decision • **The sites should also ensure that the existing contracts are not affected EECS 584, Fall 2011

  34. Name Service Architecture Local sites Name server Name Service Broker Name server Name server EECS 584, Fall 2011

  35. What are the different types of names? • Internal names: They are location dependent and carries info related to the physical location of the object • Full Names: They uniquely identify an object, are location independent & carries full info related to attributes of object • Common Names: They are user defined & defined within a name space. • Simple rules help translate common names to full names • The missing components are usually derived from the parameters supplied by the user or from the user’s environment • Name Context: This is similar to access modifiers in programming languages. EECS 584, Fall 2011

  36. How are names resolved? • Name resolution helps discover the object that is bound to a name • Common Name => Full Name • Full Name => Internal Name • The broker employs the following steps to resolve a name • Searches local cache • Rule driven search to resolve ambiguities • Query one or more name servers EECS 584, Fall 2011

  37. How is QOS of name Servers Defined? • Name servers helps translate common names to full names using name contexts provided by clients • Name service contacts various name servers • Each name server maintains a composite set of metadata of local sites under them • It’s the role of name server to periodically update its catalog • QOS is defined as the combination of price & staleness of this data EECS 584, Fall 2011

  38. Experiment EECS 584, Fall 2011

  39. The Query: • SELECT * FROM R1(SB), R2(B), R3(SD) WHERE R1.u1 = R2.u1 AND R2.u1 = R3.u1 • The following statistics are available to the optimizer • R1 join R2 (1MB) • R2 join R3 (3MB) • R1 join R2 join R3 (4.5MB) EECS 584, Fall 2011

  40. The traditional Distributed RDBMS plans a query & sends the sub queries to the processing sites which is the same as purchase order protocol • Therefore the overhead due to Mariposa is the difference in elapsed time between the two protocols weighted by proportion of queries using the protocols • Bid price = (1.5 x estimated cost) x load average • Load average = 1 • A node will sell a fragment if • Offer price > 2 X scan cost / load average • Decision to buy a fragment rather than subcontract is based on • Sale price <= Total money spent on scans EECS 584, Fall 2011

  41. The query optimizer chooses a plan based on the data transferred across the network • The initial plan generated by both Mariposa and the traditional systems will be similar • But due to migration of fragments subsequent executions of the same query will generate much better plans EECS 584, Fall 2011

  42. EECS 584, Fall 2011

  43. GFS - Topics Covered • Motivation • Architectural/File System Hierarchical Overview • Read/Write/Append/Snapshot Operation • Key Design Parameters • Replication & Rebalancing • Garbage Collection • Fault Tolerance EECS 584, Fall 2011

  44. Motivation • Customized Needs • Reliability • Availability • Performance • Scalability EECS 584, Fall 2011

  45. Customized Needs How is it different? • Runs on commodity hardware where failure is not an exception rather an expectation (PC Vs Mac anyone?) • Huge Files (in the order of Multi-GBs) • Writes involve only appending data unlike traditional systems • Applications that use these systems are in-house! • Files stored are primarily web documents EECS 584, Fall 2011

  46. GFS - Topics Covered • Motivation • Architectural/File System Hierarchical Overview • Read/Write/Append/Snapshot Operation • Key Design Parameters • Replication & Rebalancing • Garbage Collection • Fault Tolerance EECS 584, Fall 2011

  47. File System Hierarchy Directory Master Server File 1 File n 64 Bit Globally Unique Ids Chunk0 Chunk3 Chunk servers Chunk1 Chunk4 Chunk2 Chunk5 EECS 584, Fall 2011

  48. Types of servers • Master server holds all meta data information such as • Directory => File mapping • File => Chunk mapping • Chunk location • It keeps in touch with the chunk servers via heartbeat messages • Chunk servers store the actual chunks on local disks as Linux files • For reliability purposes chunks maybe replicated across multiple chunk servers EECS 584, Fall 2011

  49. GFS - Topics Covered • Motivation • Architectural /File System Hierarchical Overview • Read/Write/Append/Snapshot Operation • Key Design Parameters • Replication & Rebalancing • Garbage Collection • Fault Tolerance EECS 584, Fall 2011

  50. EECS 584, Fall 2011

More Related