cloud computing and data centers overview l.
Download
Skip this Video
Download Presentation
Cloud Computing and Data Centers: Overview

Loading in 2 Seconds...

play fullscreen
1 / 68

Cloud Computing and Data Centers: Overview - PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on

Cloud Computing and Data Centers: Overview. What’s Cloud Computing? Data Centers and “Computing at Scale” Case Studies: Google File System Map-Reduce Programming Model Optional Material Google Bigtable Readings: Do required readings Also do some of the optional readings if interested.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Cloud Computing and Data Centers: Overview' - pia


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cloud computing and data centers overview
Cloud Computing and Data Centers:Overview
  • What’s Cloud Computing?
  • Data Centers and “Computing at Scale”
  • Case Studies:
    • Google File System
    • Map-Reduce Programming Model

Optional Material

  • Google Bigtable

Readings:Do required readings

Also do some of the optional readings if interested

why studying cloud computing and data centers
Using Google as an example: GFS, MapReduce, etc.

mostly related to distributed systems, not really “networking” stuff

Two Primary Goals:

they represent part of current and “future” trends

how applications will be serviced, delivered, …

what are important “new” networking problems?

more importantly, what lessons can we learn in terms of (future) networking design?

closely related, and there are many similar issues/challenges (availability, reliability, scalability, manageability, ….)

(but of course, there are also unique challenges in networking)

Why Studying Cloud Computing and Data Centers
internet and web
Internet and Web
  • Simple client-server model
    • a number of clients served by a single server
    • performance determined by “peak load”
    • doesn’t scale well (e.g., server crashes), when # of clients suddenly increases -- “flash crowd”
  • From single server to blade server to server farm (or data center)
internet and web4
Internet and Web …
  • From “traditional” web to “web service” (or SOA)
    • no longer simply “file” (or web page) downloads
      • pages often dynamically generated, more complicated “objects” (e.g., Flash videos used in YouTube)
    • HTTP is used simply as a “transfer” protocol
      • many other “application protocols” layered on top of HTTP
    • web services & SOA (service-oriented architecture)
  • A schematic representation of “modern” web services

database, storage, computing, …

web rendering, request routing,

aggregators, …

front-end

back-end

data center and cloud computing
Data Center and Cloud Computing
  • Data center: large server farms + data warehouses
    • not simply for web/web services
    • managed infrastructure: expensive!
  • From web hosting to cloud computing
    • individual web/content providers: must provision for peak load
      • Expensive, and typically resources are under-utilized
    • web hosting: third party provides and owns the (server farm) infrastructure, hosting web services for content providers
    • “server consolidation” via virtualization

Under client web service control

App

Guest OS

VMM

cloud computing
Cloud Computing
  • Cloud computing and cloud-based services:
    • beyond web-based “information access” or “information delivery”
    • computing, storage, …
  • Cloud Computing: NIST Definition

"Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction."

  • Models of Cloud Computing
    • “Infrastructure as a Service” (IaaS), e.g., Amazon EC2, Rackspace
    • “Platform as a Service” (PaaS), e.g., Micorsoft Azure
    • “Software as a Service” (SaaS), e.g., Google
data centers key challenges
With thousands of servers within a data center,

How to write applications (services) for them?

How to allocate resources, and manage them?

in particular, how to ensure performance, reliability, availability, …

Scale and complexity bring other key challenges

with thousands of machines, failures are the default case!

load-balancing, handling “heterogeneity,” …

data center (server cluster) as a “computer”

“super-computer” vs. “cluster computer”

A single “super-high-performance” and highly reliable computer

vs. a “computer” built out of thousands of “cheap & unreliable” PCs

Pros and cons?

Data Centers: Key Challenges
case studies
Google File System (GFS)

a “file system” (or “OS”) for “cluster computer”

An “overlay” on top of “native” OS on individual machines

designed with certain (common) types of applications in mind, and designed with failures as default cases

Google MapReduce (cf. Microsoft Dryad)

MapReduce: a new “programming paradigm” for certain (common) types of applications, built on top of GFS

Other examples (optional):

BigTable: a (semi-) structured database for efficient key-value queries, etc. , built on top of GFS

Amazon Dynamo:A distributed <key, value> storage system high availability is a key design goal

Google’s Chubby, Sawzall, etc.

Open source systems: Hadoop, …

Case Studies
google scale and philosophy
Google Scale and Philosophy
  • Lots of data
    • copies of the web, satellite data, user data, email and USENET, Subversion backing store
  • Workloads are large and easily parallelizable
  • No commercial system big enough
    • couldn’t afford it if there was one
    • might not have made appropriate design choices
    • But truckloads of low-cost machines
      • 450,000 machines (NYTimes estimate, June 14th 2006)
  • Failures are the norm
    • Even reliable systems fail at Google scale
  • Software must tolerate failures
    • Which machine an application is running on should not matter
    • Firm believers in the “end-to-end” argument
  • Care about perf/$, not absolute machine perf
typical cluster at google

Cluster Scheduling Master

Lock Service

GFS Master

Machine 2

Machine 3

Machine 1

BigTableServer

UserTask 1

BigTableServer

BigTable Master

UserTask

User Task 2

SchedulerSlave

GFSChunkserver

SchedulerSlave

GFSChunkserver

SchedulerSlave

GFSChunkserver

Linux

Linux

Linux

Typical Cluster at Google
google system building blocks
Google: System Building Blocks
  • Google File System (GFS):
    • raw storage
  • (Cluster) Scheduler:
    • schedules jobs onto machines
  • Lock service:
    • distributed lock manager
    • also can reliably hold tiny files (100s of bytes) w/ high availability
  • Bigtable:
    • a multi-dimensional database
  • MapReduce:
    • simplified large-scale data processing
  • ....
chubby distributed lock service
Chubby: Distributed Lock Service
  • {lock/file/name} service
  • Coarse-grained locks, can store small amount of data in a lock
  • 5 replicas, need a majority vote to be active
  • Also an OSDI ’06 Paper
google file system
Google File System

Key Design Considerations

  • Component failures are the norm
    • hardware component failures, software bugs, human errors, power supply issues, …
    • Solutions: built-in mechanisms for monitoring, error detection, fault tolerance, automatic recovery
  • Files are huge by traditional standards
    • multi-GB files are common, billions of objects
    • most writes (modifications or “mutations”) are “append”
    • two types of reads: large # of “stream” (i.e., sequential) reads, with small # of “random” reads
  • High concurrency (multiple “producers/consumers” on a file)
    • atomicity with minimal synchronization
  • Sustained bandwidth more important than latency
gfs architectural design
GFS Architectural Design
  • A GFS cluster:
    • a single master + multiple chunkservers per master
    • running on commodity Linux machines
  • A file: a sequence of fixed-sized chunks (64 MBs)
    • labeled with 64-bit unique global IDs,
    • stored at chunkservers (as “native” Linux files, on local disk)
    • each chunk mirrored across (default 3) chunkservers
  • master server: maintains all metadata
    • name space, access control, file-to-chunk mappings, garbage collection, chunk migration
    • why only a single master? (with read-only shadow masters)
      • simple, and only answer chunk location queries to clients!
  • chunk servers (“slaves” or “workers”):
    • interact directly with clients, perform reads/writes, …
gfs architecture illustration
GFS Architecture: Illustration
  • GPS clients
    • consult master for metadata
    • typically ask for multiple chunk locations per request
    • access data from chunkservers

Separation of control and data flows

chunk size and metadata
Chunk Size and Metadata
  • Chunk size: 64 MBs
    • fewer chunk location requests to the master
    • client can perform many operations on a chuck
      • reduce overhead to access a chunk
      • can establish persistent TCP connection to a chunkserver
    • fewer metadata entries
      • metadata can be kept in memory (at master)
      • in-memory data structures allows fast periodic scanning
    • some potential problems with fragmentation
  • Metadata
    • file and chunk namespaces (files and chunk identifiers)
    • file-to-chunk mappings
    • locations of a chunk’s replicas
chunk locations and logs
Chunk Locations and Logs
  • Chunk location:
    • does not keep a persistent record of chunk locations
    • polls chunkservers at startup, and use heartbeat messages to monitor chunkservers: simplicity!
      • because of chunkserver failures, it is hard to keep persistent record of chunk locations
    • on-demand approach vs. coordination
      • on-demand wins when changes (failures) are often
  • Operation logs
    • maintains historical record of critical metadata changes
    • Namespace and mapping
    • for reliability and consistency, replicate operation log on multiple remote machines (“shadow masters”)
clients and apis
Clients and APIs
  • GFS not transparent to clients
    • requires clients to perform certain “consistency” verification (using chunk id & version #), make snapshots (if needed), …
  • APIs:
    • open, delete, read, write (as expected)
    • append: at least once, possibly with gaps and/or inconsistencies among clients
    • snapshot: quickly create copy of file
  • Separation of data and control:
    • Issues control (metadata) requests to master server
    • Issues data requests directly to chunkservers
  • Caches metadata, but does no caching of data
    • no consistency difficulties among clients
    • streaming reads (read once) and append writes (write once) don’t benefit much from caching at client
system interaction read
System Interaction: Read
  • Client sends master:
    • read(file name, chunk index)
  • Master’s reply:
    • chunk ID, chunk version#, locations of replicas
  • Client sends “closest” chunkserver w/replica:
    • read(chunk ID, byte range)
    • “closest” determined by IP address on simple

rack-based network topology

  • Chunkserver replies with data
system interactions write and record append
System Interactions: Write and Record Append
  • Write and Record Append (atomic)
    • slightly different semantics: record append is “atomic”
  • The master grants a chunk lease to a chunkserver (primary), and replies back to client
  • Client first pushes data to all chunkservers
    • pushed linearly: each replica forwards as it receives
    • pipelined transfer: 13 MB/second with 100 Mbps network
  • Then issues a write/append to primary chunkserver
  • Primary chunkserver determines the order of updates to all replicas
    • in record append: primary chunkserver checks to see whether record append would exceed maximum chunk size
    • if yes, pad the chuck (and ask secondaries to do the same), and then ask client to append to the next chunk
leases and mutation order
Leases and Mutation Order
  • Lease:
    • 60 second timeouts; can be extended indefinitely
    • extension request are piggybacked on heartbeat messages
    • after a timeout expires, master can grant new leases
  • Use leases to maintain consistent mutation order across replicas
  • Master grant lease to one of the replicas -> Primary
  • Primary picks serial order for all mutations
  • Other replicas follow the primary order
consistency model
Consistency Model
  • Changes to namespace (i.e., metadata) are atomic
    • done by single master server!
    • Master uses log to define global total order of namespace-changing operations
  • Relaxed consistency
    • concurrent changes are consistent but “undefined”
      • defined: after data mutation, file region that is consistent, and all clients see that entire mutation
    • an append is atomically committed at least once
      • occasional duplications
  • All changes to a chunk are applied in the same order to all replicas
  • Use version number to detect missed updates
master namespace management logs
Master Namespace Management & Logs
  • Namespace: files and their chunks
    • metadata maintained as “flat names”, no hard/symbolic links
    • full path name to metadata mapping
      • with prefix compression
  • Each node in the namespace has associated read-write lock (-> a total global order, no deadlock)
    • concurrent operations can be properly serialized by this locking mechanism
  • Metadata updates are logged
    • logs replicated on remote machines
    • take global snapshots (checkpoints) to truncate logs (but checkpoints can be created while updates arrive)
  • Recovery
    • Latest checkpoint + subsequent log files
replica placement
Replica Placement
  • Goals:
    • Maximize data reliability and availability
    • Maximize network bandwidth
  • Need to spread chunk replicas across machines and racks
  • Higher priority to replica chunks with lower replication factors
  • Limited resources spent on replication
other operations
Other Operations
  • Locking operations
    • one lock per path, can modify a directory concurrently
      • to access /d1/d2/leaf, need to lock /d1, /d1/d2, and /d1/d2/leaf
      • each thread acquires: a read lock on a directory & a write lock on a file
    • totally ordered locking to prevent deadlocks
  • Garbage Collection:
    • simpler than eager deletion due to
      • unfinished replicated creation, lost deletion messages
    • deleted files are hidden for three days, then they are garbage collected
      • combined with other background (e.g., take snapshots) ops
    • safety net against accidents
fault tolerance and diagnosis
Fault Tolerance and Diagnosis
  • Fast recovery
    • Master and chunkserver are designed to restore their states and start in seconds regardless of termination conditions
  • Chunk replication
  • Data integrity
    • A chunk is divided into 64-KB blocks
    • Each with its checksum
    • Verified at read and write times
    • Also background scans for rarely used data
  • Master replication
    • Shadow masters provide read-only access when the primary master is down
gfs summary
GFS: Summary
  • GFS is a distributed file system that support large-scale data processing workloads on commodity hardware
    • GFS has different points in the design space
      • Component failures as the norm
      • Optimize for huge files
    • Success: used actively by Google to support search service and other applications
    • But performance may not be good for all apps
      • assumes read-once, write-once workload (no client caching!)
  • GFS provides fault tolerance
    • Replicating data (via chunk replication), fast and automatic recovery
  • GFS has the simple, centralized master that does not become a bottleneck
  • Semantics not transparent to apps (“end-to-end” principle?)
    • Must verify file contents to avoid inconsistent regions, repeated appends (at-least-once semantics)
google mapreduce
Google MapReduce
  • The problem
    • Many simple operations in Google
      • Grep for data, compute index, compute summaries, etc
    • But the input data is large, really large
      • The whole Web, billions of Pages
    • Google has lots of machines (clusters of 10K etc)
    • Many computations over VERY large datasets
    • Question is: how do you use large # of machines efficiently?
  • Can reduce computational model down to two steps
    • Map: take one operation, apply to many many data tuples
    • Reduce: take result, aggregate them
  • MapReduce
    • A generalized interface for massively parallel cluster processing
mapreduce programming model
MapReduce Programming Model
  • Intuitively just like those from functional languages
    • Scheme, lisp, haskell, etc
  • Map: initial parallel computation
    • map (in_key, in_value) -> list(out_key, intermediate_value)
    • In: a set of key/value pairs
    • Out: a set of intermediate key/value pairs
    • Note keys might change during Map
  • Reduce: aggregation of intermediate values by key
    • reduce (out_key, list(intermediate_value)) -> list(out_value)
    • Combines all intermediate values for a particular key
    • Produces a set of merged output values (usually just one)
example word counting
Example: Word Counting
  • Goal
    • Count # of occurrences of each word in many documents
  • Sample data
    • Page 1: the weather is good
    • Page 2: today is good
    • Page 3: good weather is good
  • So what does this look like in MapReduce?

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));

map reduce in action

map

feed

reduce

Map/Reduce in Action
  • Worker 1:
    • (the 1)
  • Worker 2:
    • (is 1), (is 1), (is 1)
  • Worker 3:
    • (weather 1), (weather 1)
  • Worker 4:
    • (today 1)
  • Worker 5:
    • (good 1), (good 1), (good 1), (good 1)
  • Page 1: the weather is good
  • Page 2: today is good
  • Page 3: good weather is good
  • Worker 1:
    • (the 1), (weather 1), (is 1), (good 1)
  • Worker 2:
    • (today 1), (is 1), (good 1)
  • Worker 3:
    • (good 1), (weather 1), (is 1), (good 1)
  • Worker 1: (the 1)
  • Worker 2: (is 3)
  • Worker 3: (weather 2)
  • Worker 4: (today 1)
  • Worker 5: (good 4)
more examples
More Examples
  • Distributed Grep
    • Map: emit line if matches given pattern P
    • Reduce: identity function, just copy result to output
  • Count of URL Access Frequency
    • Map: parses URL access logs, outputs <URL, 1>
    • Reduce: adds together counts for the same unique URLoutputs <URL, totalCount>
  • Reverse web-link graph: who links to this page?
    • Map: go through all source pages, generate all links<target, source>
    • Reduce: for each target, concatenate all source links<target, list (sources)>
  • Many more examples, see paper [MapReduce, OSDI’04]
mapreduce architecture

Single Master node

Many worker bees

Many worker bees

MapReduce Architecture
mapreduce operation
MapReduce Operation

Master informed ofresult locations

Initial data split

into 64MB blocks

M sends datalocation to R workers

Computed, resultslocally stored

Final output written

what if workers die
What if Workers Die?
  • And you know they will…
  • Masters periodically ping workers
    • Still alive and working? Good…
  • If corpse found…
    • Allocate task to next idle worker (ruthless!)
    • If Map worker dies, need to recompute all its data, why?
  • If corpse comes back to life… (zombies!)
    • Give it a task, and clean slate
  • What if the Master dies?
    • Only 1 Master, he/she dies, the whole thing stops
    • Fairly rare occurrence
what if you find stragglers
What if You Find Stragglers?
  • Some workers can be slower than others
    • Faulty hardware
    • Software misconfiguration / bug
    • Whatever …
  • Near completion of task
    • Master looks at stragglers and their tasks
    • Assigns “backup” workers to also compute these tasks
    • Whoever finishes first wins!
    • Can now leave stragglers behind!
what if you find stragglers38
What if You Find Stragglers?
  • Some workers can be slower than others
    • Faulty hardware
    • Software misconfiguration / bug
    • Whatever …
  • Near completion of task
    • Master looks at stragglers and their tasks
    • Assigns “backup” workers to also compute these tasks
    • Whoever finishes first wins!
    • Can now leave stragglers behind!
google bigtable
Google Bigtable
  • Distributed multi-level map
    • With an interesting data model
  • Fault-tolerant, persistent
  • Scalable
    • Thousands of servers
    • Terabytes of in-memory data
    • Petabyte of disk-based data
    • Millions of reads/writes per second, efficient scans
  • Self-managing
    • Servers can be added/removed dynamically
    • Servers adjust to load imbalance
  • Key points:
    • Data Model and Implementation Structure
      • Tablets, SSTables, compactions, locality groups, …
    • API and Details: shared logs, compression, replication, …
basic data model

“contents”

COLUMNS

ROWS

t1

www.cnn.com

t2

“<html>…”

TIMESTAMPS

t3

Basic Data Model
  • Distributed multi-dimensional sparse map

(row, column, timestamp)  cell contents

  • Good match for most of Google’s applications
slide42
Rows
  • A row key is an arbitrary string
    • Typically 10-100 bytes in size, up to 64 KB.
    • Every read or write of data under a single row is atomic
  • Data is maintained in lexicographic order by row key
  • The row range for a table is dynamically partitioned
  • Each partition (row range) is named a tablet
    • Unit of distribution and load-balancing.
  • Objective: make read operations single-sited!
    • E.g., In Webtable, pages in the same domain are grouped together by reversing the hostname components of the URLs: com.google.maps instead of maps.google.com.
columns

“contents:”

“anchor:cnnsi.com”

“anchor:stanford.edu”

cnn.com

“…”

“CNN homepage”

“CNN”

Columns
  • Columns have two-level name structure:
    • Family: optional_qualifier; e.g., Language:English
  • Column family
    • A column family must be created before data can be stored in a column key
    • Unit of access control
    • Has associated type information
  • Qualifier gives unbounded columns
    • Additional level of indexing, if desired
locality groups
Locality Groups
  • column families can be assigned to a locality group
    • Used to organize underlying storage representation for performance
    • data in a locality group can be mapped in memory, and stored in SSTable
    • Avoid mingling data, e.g. page contents and page metadata
  • Can compress locality groups
  • Bloom Filters on SSTables in a locality group
    • avoid searching SSTable if bit not set
  • Tablet movement
    • Major compaction (with concurrent updates)
    • Minor compaction (to catch up with updates) without any concurrent updates
    • Load on new server without requiring any recovery action
timestamps 64 bit integers
Timestamps (64 bit integers)
  • Used to store different versions of data in a cell
    • New writes default to current time, but timestamps for writes can also be set explicitly by clients
  • Assigned by:
    • Bigtable: real-time in microseconds,
    • client application: when unique timestamps are a necessity.
  • Items in a cell are stored in decreasing timestamp order
    • Application specifies how many versions (n) of data items are maintained in a cell.
    • Bigtable garbage collects obsolete versions
  • Lookup options:
    • “Return most recent K values”
    • “Return all values in timestamp range (or all values)”
  • Column families can be marked w/ attributes:
    • “Only retain most recent K values in a cell”
    • “Keep values until they are older than K seconds”
tablets
Tablets
  • Large tables broken into tablets at row boundaries
    • Tablet holds contiguous range of rows
      • Clients can often choose row keys to achieve locality
    • Aim for ~100MB to 200MB of data per tablet
  • Serving machine responsible for ~100 tablets
    • Fast recovery:
      • 100 machines each pick up 1 tablet from failed machine
    • Fine-grained load balancing
      • Migrate tablets away from overloaded machine
      • Master makes load-balancing decisions
tablets splitting

“language”

“contents”

aaa.com

cnn.com

EN

“<html>…”

cnn.com/sports.html

TABLETS

Website.com

Zuppa.com/menu.html

Tablets & Splitting
slide48

“language”

“contents”

aaa.com

cnn.com

EN

“<html>…”

cnn.com/sports.html

TABLETS

Website.com

Yahoo.com/kids.html

Yahoo.com/kids.html?D

Zuppa.com/menu.html

Tablets & Splitting

table tablet and sstable
Table, Tablet and SSTable
  • Multiple tablets make up the table
  • SSTables can be shared
  • Tablets do not overlap, SSTables can overlap

Tablet

Tablet

apple

boat

aardvark

apple_two_E

SSTable

SSTable

SSTable

SSTable

tablet representation

Read

Write buffer in memory(random-access)

Append-only log on GFS

Write

SSTable on GFS

SSTable on GFS

SSTable on GFS

(mmap)

Tablet

Tablet Representation
  • SSTable: Immutable on-disk ordered map from stringstring
  • String keys: <row, column, timestamp> triples
tablet
Tablet
  • Contains some range of rows of the table
  • Built out of multiple SSTables

Start:aardvark

End:apple

Tablet

SSTable

SSTable

64K block

64K block

64K block

64K block

64K block

64K block

Index

Index

sstable
SSTable
  • Immutable, sorted file of key-value pairs
  • Chunks of data plus an index
    • Index is of block ranges, not values
  • A SSTable is stored in GFS

SSTable

64K block

64K block

64K block

Index

immutability
Immutability
  • SSTables are immutable
    • simplifies caching, sharing across GFS etc
    • no need for concurrency control
    • SSTables of a tablet recorded in METADATA table
    • Garbage collection of SSTables done by master
    • On tablet split, split tables can start off quickly on shared SSTables, splitting them lazily
  • Only memtable has reads and updates concurrent
    • copy on write rows, allow concurrent read/write
overall architecture
Overall Architecture

- Assigning tablets

- Detecting the addition and expiration of tablet

- Balancing tablet-server load

- Handle schema changed

Client

Master Server

Chubby

GFS master server, CMS client

- Handle master election

- Store the bootstrap location of Hbase data

- Discover region server

- Store access control lists

CMS server

Tablet Server

Tablet Server

Tablet Server

  • - Scheduling jobs
  • Managing resources on the cluster
  • dealing with machine failures

GFS chunk server, CMS client

GFS chunk server, CMS client

GFS chunk server, CMS client

tablet assignment
Tablet Assignment

Master keeps track of the set of live tablet servers the current assignment of tablets to region servers, including which tablets are unassigned.

Tablet servers

Cluster manager

Chubby

  • Start
  • a server

2) Create a lock

8) Acquire and

Delete the lock

9) Reassign unassigned tablets

3) Acquire the lock

4) Monitor

Tablet Server

5) Assign tablets

Master Server

6) Check lock status

placement of tablets
Placement of Tablets
  • A tablet is assigned to one tablet server at a time.
  • Master maintains:
    • The set of live tablet servers,
    • Current assignment of tablets to tablet servers (including the unassigned ones)
  • Chubby maintains tablet servers:
    • A tablet server creates and acquires an eXclusive lock on a uniquely named file in a specific chubby directory (named server directory),
    • Master monitors server directory to discover tablet server,
    • A tablet server stops processing requests if it loses its X lock (network partitioning).
      • Tablet server will try to obtain an X lock on its uniqely named file as long as it exists.
      • If the uniquely named file of a tablet server no longer exists then the tablet server kills itself. Goes back to a free pool to be assigned tablets by the master.
how client locates tablets
How Client Locates Tablets
  • 3-level hierarchical lookup scheme for tablets
    • Location is ip:port of relevant server
    • 1st level: bootstrapped from lock server, points to owner of META0
    • 2nd level: Uses META0 data to find owner of appropriate META1 tablet
    • 3rd level: META1 table holds locations of tablets of all other tables
      • META1 table itself can be split into multiple tablets
editing a table
Editing a Table
  • Mutations are logged, then applied to an in-memory memtable
    • May contain “deletion” entries to handle updates
    • Group commit on log: collect multiple updates before log flush

Tablet

Memtable

Insert

Memory

Insert

boat

apple_two_E

Delete

SSTable

SSTable

tablet log

GFS

Insert

Delete

Insert

client write read operations
Client Write & Read Operations
  • Write operation arrives at a tablet server:
    • Server ensures the client has sufficient privileges for the write operation (Chubby),
    • A log record is generated to the commit log file,
    • Once the write commits, its contents are inserted into the memtable.
    • Once memtable reaches a threshold:
      • memtable is frozen,
      • a new memtable is created,
      • frozen metable is converted

to an SSTable and written to GFS

  • Read operation arrives at a tablet server:
    • Server ensures client has sufficient privileges for the read operation (Chubby),
    • Read is performed on a merged view of (a) the SSTables that constitute the tablet, and (b) the memtable.
compaction
Compaction

Major compaction

Memtable + all SSTables

-> to one SSTable

Deleted data are removed

storage can be re-used

Create new memtable

V5.0

memtable

Read op

Frozen

memtable

Memory

DFS

Tablet log

V4.0

V3.0

V2.0

V1.0

Write op

Merging compaction

Memtable + a few SSTables

-> A new SSTable

Periodically done.

Deleted data are still alive.

SSTable files

Minor compaction

Memtable -> a new SSTable

V6.0

system structure again

Bigtable client

Bigtable cell

Bigtable clientlibrary

Bigtable master

performs metadata ops,load balancing

Open()

Bigtable tablet server

Bigtable tablet server

Bigtable tablet server

serves data

serves data

serves data

Cluster Scheduling Master

GFS

Lock service

handles failover, monitoring

holds tablet data, logs

holds metadata,handles master-election

System Structure (Again)
master s tasks
Master’s Tasks
  • Use Chubby to monitor health of tablet servers, restart failed servers
  • Tablet server registers itself by getting a lock in a specific directory chubby
    • Chubby gives “lease” on lock, must be renewed periodically
    • Server loses lock if it gets disconnected
  • Master monitors this directory to find which servers exist/are alive
    • If server not contactable/has lost lock, master grabs lock and reassigns tablets
    • GFS replicates data. Prefer to start tablet server on same machine that the data is already at
master s tasks63
Master’s Tasks …

When (new) master starts

  • grabs master lock on chubby
    • ensures only one master at a time
  • finds live servers (scan chubby directory)
  • communicates with servers to find assigned tablets
  • scans metadata table to find all tablets
    • Keeps track of unassigned tablets, assigns them
    • Metadata root from chubby, other metadata tablets assigned before scanning.
shared logs
Shared Logs
  • Designed for 1M tablets, 1000s of tablet servers
    • 1M logs being simultaneously written performs badly
  • Solution: shared logs
    • Write log file per tablet server instead of per tablet
      • Updates for many tablets co-mingled in same file
    • Start new log chunks every so often (64MB)
  • Problem: during recovery, server needs to read log data to apply mutations for a tablet
    • Lots of wasted I/O if lots of machines need to read data for many tablets from same log chunk
shared log recovery
Shared Log Recovery

Recovery:

  • Servers inform master of log chunks they need to read
  • Master aggregates and orchestrates sorting of needed chunks
    • Assigns log chunks to be sorted to different tablet servers
    • Servers sort chunks by tablet, writes sorted data to local disk
  • Other tablet servers ask master which servers have sorted chunks they need
  • Tablet servers issue direct RPCs to peer tablet servers to read sorted data for its tablets
application 1 google analytics
Application 1: Google Analytics
  • Enables webmasters to analyze traffic pattern at their web sites. Statistics such as:
    • Number of unique visitors per day and the page views per URL per day,
    • Percentage of users that made a purchase given that they earlier viewed a specific page.
  • How?
    • A small JavaScript program that the webmaster embeds in their web pages.
    • Every time the page is visited, the program is executed.
    • Program records the following information about each request:
      • User identifier
      • The page being fetched
application 2 personalized search
Application 2: Personalized Search
  • Records user queries and clicks across Google properties.
  • Users browse their search histories and request for personalized search results based on their historical usage patterns.
  • One Bigtable:
    • Row name is userid
    • A column family is reserved for each action type, e.g., web queries, clicks.
    • User profiles are generated using MapReduce.
      • These profiles personalize live search results.
    • Replicated geographically to reduce latency and increase availability.
recap google scale and philosophy
Recap: Google Scale and Philosophy
  • Lots of data
    • copies of the web, satellite data, user data, email and USENET, Subversion backing store
  • Workloads are large and easily parallelizable
  • No commercial system big enough
    • couldn’t afford it if there was one
    • might not have made appropriate design choices
    • But truckloads of low-cost machines
      • 450,000 machines (NYTimes estimate, June 14th 2006)
  • Failures are the norm
    • Even reliable systems fail at Google scale
  • Software must tolerate failures
    • Which machine an application is running on should not matter
    • Firm believers in the “end-to-end” argument
  • Care about perf/$, not absolute machine perf