google bigtable
Download
Skip this Video
Download Presentation
Google Bigtable

Loading in 2 Seconds...

play fullscreen
1 / 45

Google Bigtable - PowerPoint PPT Presentation


  • 389 Views
  • Uploaded on

Google Bigtable. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006. Google Scale. Lots of data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Google Bigtable' - Solomon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
google bigtable

Google Bigtable

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

Google, Inc.

OSDI 2006

google scale
Google Scale
  • Lots of data
    • Copies of the web, satellite data, user data, email and USENET, Subversion backing store
  • Many incoming requests
  • No commercial system big enough
    • Couldn’t afford it if there was one
    • Might not have made appropriate design choices
data model a big map
Data model: a big map
  • <Row, Column,Timestamp> triple as key
    • <Row, Column,Timestamp> -> string
  • API has lookup, insert and delete operations based on the key
  • Does not support a relational model
    • No table-wide integrity constraints
    • No multirow transactions
data model
Data Model

Example: Zoo

data model5
Data Model

Example: Zoo

row keycol. keytimestamp

data model6
Data Model

Example: Zoo

row keycol. keytimestamp

- (zebras, length, 2006) --> 7 ft

- (zebras, weight, 2007) --> 600 lbs

- (zebras, weight, 2006) --> 620 lbs

data model7
Data Model

Example: Zoo

row keycol. keytimestamp

- (zebras, length, 2006) --> 7 ft

- (zebras, weight, 2007) --> 600 lbs

- (zebras, weight, 2006) --> 620 lbs

Each key is sorted in Lexicographic order

data model8
Data Model

Example: Zoo

row keycol. keytimestamp

- (zebras, length, 2006) --> 7 ft

- (zebras, weight, 2007) --> 600 lbs

- (zebras, weight, 2006) --> 620 lbs

Timestamp ordering is defined as “most recent appears first”

data model9
Data Model

Example: Web Indexing

data model12
Data Model

Columns

data model14
Data Model

timestamps

data model15
Data Model

Column family

data model16
Data Model

Column family

family:qualifier

data model17
Data Model

Column family

family:qualifier

data model timestamps
Data Model - Timestamps
  • Used to store different version of data in a cell
    • New writes default to current time
  • Lookup options:
    • Return most recent K values
    • Return all values in the timestamp range
api examples write modify
API Examples: Write/Modify

atomic row modification

No support for (RDBMS-style) multi-row transactions

api examples read
API Examples: Read

Return sets can be filtered using regular expressions:

anchor: com.cnn.*

sstable
SSTable
  • The Google SSTable file format is used internally to store BigTable data
  • An SSTableprovides a persistent ordered immutable map from keys to values
  • Each SSTable contains a sequence of blocks
  • A block index (stored at the end of SSTable) is used to locate blocks
  • The index is loaded into memory when the SSTable is open
  • One disk access to get the block
sstable22
SSTable

Index (block ranges)

64KB

Block

64KB

Block

64KB

Block

tablet
Tablet
  • Contains some range of rows of the table
  • Built out of multiple SSTables

Start:aardvark

Tablet

End:apple

SSTable

SSTable

64K block

64K block

64K block

64K block

64K block

64K block

Index

Index

table
Table
  • Multiple tablets make up the table
  • SSTables can be shared
  • Tablets do not overlap, SSTables can overlap

Tablet

Tablet

apple

boat

aardvark

apple

SSTable

SSTable

SSTable

SSTable

organization
Organization
  • A Bigtable cluster stores tables
  • Each table consists of tablets
    • Initially each table consists of one tablet
    • As a table grows it is automatically split into multiple tablets
  • Tablets are assigned to tablet servers
    • Multiple tablets per server. Each tablet is 100-200 MB
    • Each tablet lives at only one server
    • Tablet server splits tablets that get too big
organization26
Organization
  • Master
    • Keeps track of the set of live tablet servers
    • Assigns tablets to table servers
    • Detects the addition and expiration of tablet servers
    • Balancing tablet-server load
    • Handles scheme changes such as table and column family creations
  • A Bigtable library is linked to every client.
finding a tablet
Finding a tablet

Similar to a B+ tree

<table_id, end_row> -> location

Location is <ip address, port> of a tablet server

finding a tablet28
Finding a tablet
  • A Bigtable library is linked to every client.
  • Client communicates directly with tablet server for reads/writes.
  • Client reads the Chubby file that points to the root tablet
    • This starts the location process
  • The client library caches tablet locations.
  • There are algorithms for prefetching
  • If there is no cached information, three network round-trips are needed
system structure
System Structure

Bigtable client

Bigtable Master

Bigtable Cell

metadata

Bigtable client

library

Performs metadata ops+

Load balancing

read/write

Bigtable tablet server

Bigtable tablet server

Bigtable tablet server

Open()

Serves data

Serves data

Serves data

Cluster scheduling system

GFS

Lock service

Handles failover, monitoring

Holds tablet data, logs

Holds metadata

Handles master-election

tablet server
Tablet Server
  • When a tablet server starts, it creates and acquires and exclusive lock on, a uniquely-named file in a specific Chubby directory
    • Let’s call this servers directory
  • A tablet server stops serving its tablets if it loses its exclusive lock
    • This may happen if there is a network connection that causes the tablet server to lose its Chubby session
tablet server31
Tablet Server
  • A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists
  • If the file no longer exists then the tablet server will never be able to serve again
    • Kills itself
    • At some point it can restarted; it goes to a pool of unassigned tablet servers
master startup operation
Master Startup Operation
  • Upon start up the master needs to discover the current tablet assignment.
    • Grabs unique master lock in Chubby
      • Prevents server instantiations
    • Scans servers directory in Chubby for live servers
    • Communicates with every live tablet server
      • Discover all tablets
    • Scans METADATA table to learn the set of tablets
      • Unassigned tables are marked for assignment
master operation
Master Operation
  • Detect tablet serverfailures/resumption
  • Master periodically asks each tablet server for the status of its locks
master operation34
Master Operation
  • Tablet server lost its lock or master cannot contact tablet server:
    • Master attempts to acquire exclusive lock on the server’s file in the servers directory
    • If master acquires the lock then the tablets assigned to the tablet server are assigned to others
      • Master deletes the server’s file in the servers directory
    • Assignment of tablets should be balanced
  • If master loses its Chubby session then it kills itself
    • An election can take place to find a new master
master operation38
Master Operation
  • Handles schema changes such as table and column family creations.
  • Handles table creation, and merging of tablets
  • Tablet servers directly update metadata on tablet split, then notify master
    • lost notification may be detected lazily by master
tablet serving
Tablet Serving
  • Updates are committed to a commit log that stores redo records
  • Recently committed updates are stored in a buffer called memtable
  • Older updates are stored in a sequence of SSTables
  • To recover a tablet, a tablet server reads its metadata from the METADATA table
    • Contains list of SSTables and a set of redo points
tablet serving40

Read Op

Memtable

Memory

GFS

Tablet Log

SST

SST

SST

SSTable Files

Write Op

Tablet Serving
  • Commit log stores the updates that are made to the data.
  • Recent updates are stored in memtable.
  • Older updates are stored in SStable files
tablet serving41

Read Op

Memtable

Memory

GFS

Tablet Log

SST

SST

SST

SSTable Files

Write Op

Tablet Serving
  • Recovery process.
  • Reads/Writes that arrive at tablet server.
    • Is the request Well-formed?
    • Authorization.
    • Chubby holds the permission file.
    • If a mutation occurs it is wrote to commit log and finally a group commit is used.
tablet serving42

Read Op

Memtable

Memory

GFS

Tablet Log

SST

SST

SST

SSTable Files

Write Op

Tablet Serving
  • Tablet recovery process
    • Read metadata containing SSTABLES and redo points
      • Redo points are pointers into any commit logs
    • Apply redo points
compactions
Compactions
  • Minor compaction – convert the memtable into an SSTable
    • Reduce memory usage
    • Reduce log traffic on restart
  • Merging compaction
    • Reduce number of SSTables
    • Good place to apply policy “keep only N versions”
  • Major compaction
    • Merging compaction that results in only one SSTable
    • No deletion records, only live data
locality groups
Locality Groups
  • Group column families together into an SSTable
    • Avoid mingling data, e.g. page contents and page metadata
    • Can keep some groups all in memory
  • Can compress locality groups
  • Tablet movement
    • Major compaction (with concurrent updates)
    • Minor compaction (to catch up with updates) without any concurrent updates
    • Load on new server without requiring any recovery action
lessons learned
Lessons learned
  • Interesting point- only implement some of the requirements, since the last is probably not needed
  • Many types of failure possible
  • Big systems need proper systems-level monitoring
  • Value simple design
ad