mariposa the google file system n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Mariposa The Google File System PowerPoint Presentation
Download Presentation
Mariposa The Google File System

Loading in 2 Seconds...

play fullscreen
1 / 73

Mariposa The Google File System - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

Mariposa The Google File System. Haowei Lu Madhusudhanan Palani. From LAN to WAN. Drawbacks of traditional distributed DBMS Static Data Allocation Move objects manually Single Administrative Structure Cost-based optimizer cannot scale well Uniformity Different Machine Architecture

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Mariposa The Google File System' - nolen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
mariposa the google file system

MariposaThe Google File System

Haowei Lu

Madhusudhanan Palani

EECS 584, Fall 2011

from lan to wan
From LAN to WAN
  • Drawbacks of traditional distributed DBMS
    • Static Data Allocation
      • Move objects manually
    • Single Administrative Structure
      • Cost-based optimizer cannot scale well
    • Uniformity
      • Different Machine Architecture
      • Different Data Type

EECS 584, Fall 2011

from lan to wan1
From LAN to WAN
  • New requirements
    • Scalability to a large number of cooperating sites:
    • Data mobility
    • No global synchronization
    • Total local autonomy
    • Easily configurable policies

EECS 584, Fall 2011

from lan to wan2
From LAN to WAN
  • Solution – A distributed microeconomic approach
    • Well studied economic model
    • Reduce scheduling complexity(?!)
    • Invisible hands for local optimum

EECS 584, Fall 2011

mariposa
Mariposa
  • Let each site acts on its own behalf to maximize his own profit
  • In turn, it brings the overall performance of the DBMS ecosystem

EECS 584, Fall 2011

architecture glossary
Architecture - Glossary
  • Fragment – Units of storage that are bought and sold by sites
    • Range distribution
    • Hash-Based distribution
    • Unstructured! Whenever site wants!
  • Stride
    • Operations that can proceed in parallel

EECS 584, Fall 2011

architecture
Architecture

EECS 584, Fall 2011

the bidding process
The bidding process

EECS 584, Fall 2011

the bidding process1
The bidding process
  • The Broker: Send out requests for bid for query plan
  • The Bidder: Responds to the request for bid with its formulated price and other information in the form:
    • (C,D,E) Cost, Delay, Expiration Date
  • The whole logic is implemented using RUSH
    • A low level, very efficient embedded scripting language and rule system
    • Form: on <condition> do <action>

EECS 584, Fall 2011

the bidding process bidder
The bidding process: Bidder
  • The Bidder: Setting the price for bid
    • Billing rate on a per-fragment basis
    • Consider site load
      • Actual Bid = Computed bid * Load average
    • Bid referencing hot list from storage manager

EECS 584, Fall 2011

the bidding process2
The bidding process

EECS 584, Fall 2011

the bidding process broker
The bidding process: Broker
  • The Broker
    • Input: Fragmented query plan
    • In process: Decide the sites to run fragments & send out bid acceptance
      • Expensive bid protocol
      • Purchase order protocol (mainly used)
    • Output: hand off task to coordinator

EECS 584, Fall 2011

the bidding process broker1
The bidding process: Broker
  • Expensive bid protocol

Under Budge

Broker

Bidder(Individual Sites)

Ads Table

(Locate at Name Server)

Bookkeeping Table for previous winner sites

(Same site as broker)

EECS 584, Fall 2011

the bidding process broker2
The bidding process: Broker
  • Purchase Order Protocol

Broker

The most possible bidder

Accept

Refuse

Generate

Bill

Pass to another

Site

Return to Broker

EECS 584, Fall 2011

the bidding process broker3
The bidding process: Broker
  • The Broker finds bidder using Ad table

EECS 584, Fall 2011

the bidding process broker4
The bidding process: Broker
  • The Broker finds bidder using Ad table
  • Example (Sale Price)
    • Query-Template: SELECT * FROM TMP
    • Sever Id: 123
    • Start Time: 2011/10/01
    • Expiration Time: 2011/10/04
    • Price: 10 unit
    • Delay: 5 seconds

EECS 584, Fall 2011

the bidding process broker5
The bidding process: Broker
  • Type of Ads (REALY FANCY)

EECS 584, Fall 2011

the bidding process bid acceptance
The bidding process: Bid Acceptance
  • The main idea: Make the difference large as possible
    • Difference:= B(D) – C (D: Delay, C: Cost, B(t): The budget function)
  • Method: Greedy Algorithm
    • Pre-Step: Get the least D result
    • Iteration Steps:
      • Calculate Cost Gradient CG:= cost reduce/delay increase for each stride
      • Keep substitute using MAX(CG) until no increase on difference

EECS 584, Fall 2011

the bidding process separate bidder
The bidding process: Separate Bidder
  • Network bidder
    • One trip to get

bandwidth

    • Return trip to

get price

    • Happen at

second

stage

EECS 584, Fall 2011

storage manager
Storage Manager
  • An asynchronous process which runs in tandem with the bidder
  • Objective
    • Maximize revenue income per unit time
  • Functions
    • Calculate Fragment Values
    • Buy Fragments
    • Sell Fragments
    • Split/Coalesce Fragments

EECS 584, Fall 2011

fragment values
Fragment Values
  • The value of a fragment is defined using the revenue history
  • Revenue history consists of
    • Query, No of records in result, time since last query, last revenue, delay, CPU & I/O used
  • CPU & I/O is normalized & stored in site independent units
  • Each site should
    • Convert this CPU & I/O units to site specific units via weighting functions
    • Adjust revenue as the current node maybe faster or slower by using the average bid curve

EECS 584, Fall 2011

buying fragments
Buying fragments
  • In order to bid for a query/subquery the site should have the referenced fragments
  • The site can buy the fragments in advance (prefetch) or when the query comes in (on demand)
  • The buyer locates the owner of fragment and requests revenue history
  • Calculates the value of fragment
  • Evict old fragments to free up space (alternate fragments)
    • To the extent that space is available for new fragments
  • Buyer Offer price = value of fragment – value of alternate fragments + price received

EECS 584, Fall 2011

selling fragments
Selling Fragments
  • Seller can evict the fragment being bought or any other fragment(alternate) of equivalent size (Why is this a must?)
  • Seller will sell if
    • offer price > value of fragment (sell) – value of alternate fragments + price received
  • If offer price is not sufficient,
    • then seller tries to evict fragment of higher value
    • lower the price of fragment as a final option

EECS 584, Fall 2011

split coalesce
Split & Coalesce
  • When to Split/Coalesce?
    • Split if there are too few fragments otherwise parallelization will take a hit
    • Coalesce if there are too many fragments as overhead of dealing with the fragments & response time will take a hit
  • The algorithm for split/coalesce must strike the correct balance between the two

EECS 584, Fall 2011

how to solve this issue an interlude
How to solve this issue???An interlude

Why not extend my microeconomics analogy!?!

EECS 584, Fall 2011

stonebreaker s microeconomics idea
Stonebreaker’s Microeconomics Idea
  • Market pressure should correct inappropriate fragment sizes
  • Large fragment size =>
  • Now everyone wants a share of the pie
  • But the owner does not want to lose

the revenue!

EECS 584, Fall 2011

the idea continued
The Idea Continued
  • Break the large fragment into smaller fragments
  • Smaller fragment means less revenue & less attractive for copies

EECS 584, Fall 2011

it still continues
It still continues….
  • Smaller fragments also mean more overhead => Works against the owner!

EECS 584, Fall 2011

and it ends
And it ends…
  • So depending on the market demand these two opposing motivations will balance each other

EECS 584, Fall 2011

how to solve this issue
How to solve this issue???

Why not extend my microeconomics analogy!?!

A more “concrete” approach !!

EECS 584, Fall 2011

a more concrete approach
A more “concrete” approach...
  • Mariposa will calculate expected delay (ED) due to parallel execution on multiple fragments (Numc)
  • It then computes the expected bid per site as
    • B(ED)/Numc
  • Vary Numc to arrive at maximum revenue per site => Num*
  • Sites will keep track of this Num* to base their split/coalesce decision
  • **The sites should also ensure that the existing contracts are not affected

EECS 584, Fall 2011

name service architecture
Name Service Architecture

Local

sites

Name server

Name Service

Broker

Name server

Name server

EECS 584, Fall 2011

what are the different types of names
What are the different types of names?
  • Internal names: They are location dependent and carries info related to the physical location of the object
  • Full Names: They uniquely identify an object, are location independent & carries full info related to attributes of object
  • Common Names: They are user defined & defined within a name space.
  • Simple rules help translate common names to full names
  • The missing components are usually derived from the parameters supplied by the user or from the user’s environment
  • Name Context: This is similar to access modifiers in programming languages.

EECS 584, Fall 2011

how are names resolved
How are names resolved?
  • Name resolution helps discover the object that is bound to a name
    • Common Name => Full Name
    • Full Name => Internal Name
  • The broker employs the following steps to resolve a name
    • Searches local cache
    • Rule driven search to resolve ambiguities
    • Query one or more name servers

EECS 584, Fall 2011

how is qos of name servers defined
How is QOS of name Servers Defined?
  • Name servers helps translate common names to full names using name contexts provided by clients
  • Name service contacts various name servers
  • Each name server maintains a composite set of metadata of local sites under them
  • It’s the role of name server to periodically update its catalog
  • QOS is defined as the combination of price & staleness of this data

EECS 584, Fall 2011

experiment
Experiment

EECS 584, Fall 2011

slide39

The Query:

    • SELECT *

FROM R1(SB), R2(B), R3(SD)

WHERE R1.u1 = R2.u1

AND R2.u1 = R3.u1

  • The following statistics are available to the optimizer
    • R1 join R2 (1MB)
    • R2 join R3 (3MB)
    • R1 join R2 join R3 (4.5MB)

EECS 584, Fall 2011

slide40

The traditional Distributed RDBMS plans a query & sends the sub queries to the processing sites which is the same as purchase order protocol

  • Therefore the overhead due to Mariposa is the difference in elapsed time between the two protocols weighted by proportion of queries using the protocols
  • Bid price = (1.5 x estimated cost) x load average
    • Load average = 1
  • A node will sell a fragment if
    • Offer price > 2 X scan cost / load average
  • Decision to buy a fragment rather than subcontract is based on
    • Sale price <= Total money spent on scans

EECS 584, Fall 2011

slide41

The query optimizer chooses a plan based on the data transferred across the network

  • The initial plan generated by both Mariposa and the traditional systems will be similar
  • But due to migration of fragments subsequent executions of the same query will generate much better plans

EECS 584, Fall 2011

gfs topics covered
GFS - Topics Covered
  • Motivation
  • Architectural/File System Hierarchical Overview
  • Read/Write/Append/Snapshot Operation
  • Key Design Parameters
  • Replication & Rebalancing
  • Garbage Collection
  • Fault Tolerance

EECS 584, Fall 2011

motivation
Motivation
  • Customized Needs
  • Reliability
  • Availability
  • Performance
  • Scalability

EECS 584, Fall 2011

customized needs how is it different
Customized Needs How is it different?
  • Runs on commodity hardware where failure is not an exception rather an expectation (PC Vs Mac anyone?)
  • Huge Files (in the order of Multi-GBs)
  • Writes involve only appending data unlike traditional systems
  • Applications that use these systems are in-house!
  • Files stored are primarily web documents

EECS 584, Fall 2011

gfs topics covered1
GFS - Topics Covered
  • Motivation
  • Architectural/File System Hierarchical Overview
  • Read/Write/Append/Snapshot Operation
  • Key Design Parameters
  • Replication & Rebalancing
  • Garbage Collection
  • Fault Tolerance

EECS 584, Fall 2011

file system hierarchy
File System Hierarchy

Directory

Master Server

File 1

File n

64 Bit Globally Unique Ids

Chunk0

Chunk3

Chunk servers

Chunk1

Chunk4

Chunk2

Chunk5

EECS 584, Fall 2011

types of servers
Types of servers
  • Master server holds all meta data information such as
    • Directory => File mapping
    • File => Chunk mapping
    • Chunk location
  • It keeps in touch with the chunk servers via heartbeat messages
  • Chunk servers store the actual chunks on local disks as Linux files
  • For reliability purposes chunks maybe replicated across multiple chunk servers

EECS 584, Fall 2011

gfs topics covered2
GFS - Topics Covered
  • Motivation
  • Architectural /File System Hierarchical Overview
  • Read/Write/Append/Snapshot Operation
  • Key Design Parameters
  • Replication & Rebalancing
  • Garbage Collection
  • Fault Tolerance

EECS 584, Fall 2011

read operation
Read Operation
  • Using the fixed chunk size & user provided filename & byte offset, the client translates it into a chunk index
  • The filename & chunk index is then sent to the master to get chunk location & the replica locations
  • Client caches (limited time) this info using filename & chunk index as key
  • Client directly communicates with the closest chunk server
  • In order to minimize Client-Master interaction, the client bunches up chunk location requests & master also sends chunks next to requested ones.

EECS 584, Fall 2011

write operation
Write Operation

EECS 584, Fall 2011

write operation1
Write Operation
  • The client requests a chunk from the master
  • Master assigns a chunk lease(60 seconds renewable) to a primary among replicas
  • The client then pushes the data to be written to the nearest chunk server
    • Each chunk server in turn pushes this data into the next nearest server
    • This ensures that the network bandwidth is fully utilized
  • Once all replicas have the data, the client pushes the write request to the primary
    • The primary determines the order of mutations based on multiple requests it receives from a single/multiple client(s)

EECS 584, Fall 2011

write operation2
Write Operation
  • The primary then pushes this ordering information to all replicas
  • The replicas then acknowledge the primary once the mutations have been successfully applied
  • The primary then acknowledges the client
  • Data flow is decoupled from control flow to ensure that the network topology dictates the throughput and not the choice of primary
  • Distance between two nodes is calculated by use of IP addresses
  • Use of switched network with full duplex links allows servers to forward data as soon as they start receiving it

EECS 584, Fall 2011

record append operations
Record Append Operations
  • Appends data to a file at least once atomically and returns the offset back to the client
  • Client pushes data to all replicas
  • Sends request to the primary
  • Primary checks if chunk size would be exceeded
    • If so pads the extra space of old chunk, creates a new chunk, instructs replicas to do so and asks client to retry with new chunk
    • Else writes to chunk, instructs replicas to do so
  • If an append fails at any replica, the client retries the operation
  • Single most commonly used operation by all distributed applications in Google to write concurrently to a file
  • This operation allows simple coordination schemes rather than complex distributed locking mechanisms used in traditional writes

EECS 584, Fall 2011

snapshot operations
Snapshot Operations
  • This is used by applications for checkpointing their progress
  • Creates an instant copy of file or directory tree while minimizing interruptions to ongoing mutations
  • Master revokes any outstanding leases on the chunks
  • Master duplicates the meta data and it continues to point to same chunk
  • Upon first write request the master asks chunk server to replicate the chunks
  • Chunk is created on same chunk server thereby avoiding network traffic

EECS 584, Fall 2011

gfs topics covered3
GFS - Topics Covered
  • Motivation
  • Architectural /File System Hierarchical Overview
  • Read/Write/Append/Snapshot Operation
  • Key Design Parameters
  • Replication & Rebalancing
  • Garbage Collection
  • Fault Tolerance

EECS 584, Fall 2011

replication rebalancing
Replication & Rebalancing
  • Chunks are replicated both across racks and within racks
  • This not only boosts availability, reliability etc but also exploits aggregate bandwidth for reads
  • Placement of chunks (balancing) depends on several factors
    • To even out disk utilization across servers
    • Limit the number of recent creations on chunk servers
    • Spread replicas across racks
  • No of replicas is configurable and Master ensures that it doesn’t go below the threshold

EECS 584, Fall 2011

replication rebalancing1
Replication & Rebalancing
  • Priority on which chunks to re-replicate is assigned by the master based on various factors like
    • distance from threshold
    • Live chunks over deleted chunks
    • Chunks blocking progress of clients
  • Master as well as clients throttle the cloning operations to ensure that they do not interfere with regular operations
  • Master also does periodic rebalancing for better load balancing & disk space utilization

EECS 584, Fall 2011

gfs topics covered4
GFS - Topics Covered
  • Motivation
  • Architectural /File System Hierarchical Overview
  • Read/Write/Append/Snapshot Operation
  • Key Design Parameters
  • Replication & Rebalancing
  • Garbage Collection
  • Fault Tolerance

EECS 584, Fall 2011

garbage collection
Garbage Collection
  • A file is logged for deletion by the master
  • The file is not reclaimed immediately and is renamed to a hidden name with deletion timestamp
  • The master reclaims these files during its periodic scan if they are older than 3 days
  • While reclaiming the in memory metadata is erased, thus severing its link to its chunks
  • In a similar scan of chunk space the master identifies orphaned chunks & erases corresponding metadata
  • The files are reclaimed by the chunkservers upon confirmation during regular heartbeat messages
  • Stale replicas are also collected using version numbers

EECS 584, Fall 2011

stale replica detection
Stale Replica Detection
  • Each chunk is associated with a version number both maintained by both the master and the chunk server
  • The version number is incremented whenever a new lease is granted
  • If the chunk server version lags behind the master’s version the chunk is marked for GC
  • If the master’s version lags behind the chunk server, the master is updated
  • Also this version number is included in all sorts of communications so that the client/chunk server can verify the version number before performing any operation

EECS 584, Fall 2011

gfs topics covered5
GFS - Topics Covered
  • Motivation
  • Architectural /File System Hierarchical Overview
  • Read/Write/Append/Snapshot Operation
  • Key Design Parameters
  • Replication & Rebalancing
  • Garbage Collection
  • Fault Tolerance

EECS 584, Fall 2011

fault tolerance
Fault Tolerance
  • Both master and chunk server are designed to restore their state and start in seconds
  • Replication of chunks across racks and within racks ensure high availability
  • Monitoring infrastructure outside GFS monitors master failure and starts a new master process within the replicated master servers
  • “Shadow” masters provide read-only access even though primary master is down

EECS 584, Fall 2011

fault tolerance1
Fault Tolerance
  • A shadow master periodically applies the growing primary master log to itself to keep up to date
  • It also periodically shares heartbeat messages with chunk servers to locate replicas
  • Integrity of data is maintained through checksums at both the servers
  • This verification is done during any read, write, or chunk migration request & also periodically

EECS 584, Fall 2011

benchmark
Benchmark

EECS 584, Fall 2011

measurements results
Measurements & Results

EECS 584, Fall 2011

key design parameters
Key Design Parameters
  • The choice of chunk size(64 MB) combined with the nature of read/write offers several advantages:
    • Reduces client-master interaction
    • Many operations on the same chunk more likely
    • Reduces the size of meta data(can be held in primary memory)
  • But hotspots can develop when many clients request the same chunk
  • But this can be suppressed with replication, staggered application start ups, P2P etc

EECS 584, Fall 2011

key design parameters1
Key Design Parameters
  • Metadata information is not persistent, it is collected via heartbeat messages and is stored in main memory
  • This eliminates the need to keep the master in sync whenever chunk servers join/leave the cluster
  • Also given the chunk size, the metadata information to be stored in memory is greatly reduced
  • This small size also allows for periodic scanning of metadata for garbage collection, re replication & chunk migration without incurring much overhead

EECS 584, Fall 2011

key design parameters2
Key Design Parameters
  • Operation logmaintains the transactional information in GFS
  • It employs checkpointing to keep the log size & recovery time low
  • These logs are replicated and located in multiple servers to ensure reliability
  • Any response to clients are provided only after the logs are flushed to all these replicas

EECS 584, Fall 2011