Google bigtable
Download
1 / 45

Google Bigtable - PowerPoint PPT Presentation


  • 384 Views
  • Updated On :

Google Bigtable. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006. Google Scale. Lots of data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Google Bigtable' - Solomon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Google bigtable l.jpg

Google Bigtable

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

Google, Inc.

OSDI 2006


Google scale l.jpg
Google Scale

  • Lots of data

    • Copies of the web, satellite data, user data, email and USENET, Subversion backing store

  • Many incoming requests

  • No commercial system big enough

    • Couldn’t afford it if there was one

    • Might not have made appropriate design choices


Data model a big map l.jpg
Data model: a big map

  • <Row, Column,Timestamp> triple as key

    • <Row, Column,Timestamp> -> string

  • API has lookup, insert and delete operations based on the key

  • Does not support a relational model

    • No table-wide integrity constraints

    • No multirow transactions


Data model l.jpg
Data Model

Example: Zoo


Data model5 l.jpg
Data Model

Example: Zoo

row keycol. keytimestamp


Data model6 l.jpg
Data Model

Example: Zoo

row keycol. keytimestamp

- (zebras, length, 2006) --> 7 ft

- (zebras, weight, 2007) --> 600 lbs

- (zebras, weight, 2006) --> 620 lbs


Data model7 l.jpg
Data Model

Example: Zoo

row keycol. keytimestamp

- (zebras, length, 2006) --> 7 ft

- (zebras, weight, 2007) --> 600 lbs

- (zebras, weight, 2006) --> 620 lbs

Each key is sorted in Lexicographic order


Data model8 l.jpg
Data Model

Example: Zoo

row keycol. keytimestamp

- (zebras, length, 2006) --> 7 ft

- (zebras, weight, 2007) --> 600 lbs

- (zebras, weight, 2006) --> 620 lbs

Timestamp ordering is defined as “most recent appears first”


Data model9 l.jpg
Data Model

Example: Web Indexing




Data model12 l.jpg
Data Model

Columns


Data model13 l.jpg
Data Model

Cells


Data model14 l.jpg
Data Model

timestamps


Data model15 l.jpg
Data Model

Column family


Data model16 l.jpg
Data Model

Column family

family:qualifier


Data model17 l.jpg
Data Model

Column family

family:qualifier


Data model timestamps l.jpg
Data Model - Timestamps

  • Used to store different version of data in a cell

    • New writes default to current time

  • Lookup options:

    • Return most recent K values

    • Return all values in the timestamp range


Api examples write modify l.jpg
API Examples: Write/Modify

atomic row modification

No support for (RDBMS-style) multi-row transactions


Api examples read l.jpg
API Examples: Read

Return sets can be filtered using regular expressions:

anchor: com.cnn.*


Sstable l.jpg
SSTable

  • The Google SSTable file format is used internally to store BigTable data

  • An SSTableprovides a persistent ordered immutable map from keys to values

  • Each SSTable contains a sequence of blocks

  • A block index (stored at the end of SSTable) is used to locate blocks

  • The index is loaded into memory when the SSTable is open

  • One disk access to get the block


Sstable22 l.jpg
SSTable

Index (block ranges)

64KB

Block

64KB

Block

64KB

Block


Tablet l.jpg
Tablet

  • Contains some range of rows of the table

  • Built out of multiple SSTables

Start:aardvark

Tablet

End:apple

SSTable

SSTable

64K block

64K block

64K block

64K block

64K block

64K block

Index

Index


Table l.jpg
Table

  • Multiple tablets make up the table

  • SSTables can be shared

  • Tablets do not overlap, SSTables can overlap

Tablet

Tablet

apple

boat

aardvark

apple

SSTable

SSTable

SSTable

SSTable


Organization l.jpg
Organization

  • A Bigtable cluster stores tables

  • Each table consists of tablets

    • Initially each table consists of one tablet

    • As a table grows it is automatically split into multiple tablets

  • Tablets are assigned to tablet servers

    • Multiple tablets per server. Each tablet is 100-200 MB

    • Each tablet lives at only one server

    • Tablet server splits tablets that get too big


Organization26 l.jpg
Organization

  • Master

    • Keeps track of the set of live tablet servers

    • Assigns tablets to table servers

    • Detects the addition and expiration of tablet servers

    • Balancing tablet-server load

    • Handles scheme changes such as table and column family creations

  • A Bigtable library is linked to every client.


Finding a tablet l.jpg
Finding a tablet

Similar to a B+ tree

<table_id, end_row> -> location

Location is <ip address, port> of a tablet server


Finding a tablet28 l.jpg
Finding a tablet

  • A Bigtable library is linked to every client.

  • Client communicates directly with tablet server for reads/writes.

  • Client reads the Chubby file that points to the root tablet

    • This starts the location process

  • The client library caches tablet locations.

  • There are algorithms for prefetching

  • If there is no cached information, three network round-trips are needed


System structure l.jpg
System Structure

Bigtable client

Bigtable Master

Bigtable Cell

metadata

Bigtable client

library

Performs metadata ops+

Load balancing

read/write

Bigtable tablet server

Bigtable tablet server

Bigtable tablet server

Open()

Serves data

Serves data

Serves data

Cluster scheduling system

GFS

Lock service

Handles failover, monitoring

Holds tablet data, logs

Holds metadata

Handles master-election


Tablet server l.jpg
Tablet Server

  • When a tablet server starts, it creates and acquires and exclusive lock on, a uniquely-named file in a specific Chubby directory

    • Let’s call this servers directory

  • A tablet server stops serving its tablets if it loses its exclusive lock

    • This may happen if there is a network connection that causes the tablet server to lose its Chubby session


Tablet server31 l.jpg
Tablet Server

  • A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists

  • If the file no longer exists then the tablet server will never be able to serve again

    • Kills itself

    • At some point it can restarted; it goes to a pool of unassigned tablet servers


Master startup operation l.jpg
Master Startup Operation

  • Upon start up the master needs to discover the current tablet assignment.

    • Grabs unique master lock in Chubby

      • Prevents server instantiations

    • Scans servers directory in Chubby for live servers

    • Communicates with every live tablet server

      • Discover all tablets

    • Scans METADATA table to learn the set of tablets

      • Unassigned tables are marked for assignment


Master operation l.jpg
Master Operation

  • Detect tablet serverfailures/resumption

  • Master periodically asks each tablet server for the status of its locks


Master operation34 l.jpg
Master Operation

  • Tablet server lost its lock or master cannot contact tablet server:

    • Master attempts to acquire exclusive lock on the server’s file in the servers directory

    • If master acquires the lock then the tablets assigned to the tablet server are assigned to others

      • Master deletes the server’s file in the servers directory

    • Assignment of tablets should be balanced

  • If master loses its Chubby session then it kills itself

    • An election can take place to find a new master





Master operation38 l.jpg
Master Operation

  • Handles schema changes such as table and column family creations.

  • Handles table creation, and merging of tablets

  • Tablet servers directly update metadata on tablet split, then notify master

    • lost notification may be detected lazily by master


Tablet serving l.jpg
Tablet Serving

  • Updates are committed to a commit log that stores redo records

  • Recently committed updates are stored in a buffer called memtable

  • Older updates are stored in a sequence of SSTables

  • To recover a tablet, a tablet server reads its metadata from the METADATA table

    • Contains list of SSTables and a set of redo points


Tablet serving40 l.jpg

Read Op

Memtable

Memory

GFS

Tablet Log

SST

SST

SST

SSTable Files

Write Op

Tablet Serving

  • Commit log stores the updates that are made to the data.

  • Recent updates are stored in memtable.

  • Older updates are stored in SStable files


Tablet serving41 l.jpg

Read Op

Memtable

Memory

GFS

Tablet Log

SST

SST

SST

SSTable Files

Write Op

Tablet Serving

  • Recovery process.

  • Reads/Writes that arrive at tablet server.

    • Is the request Well-formed?

    • Authorization.

    • Chubby holds the permission file.

    • If a mutation occurs it is wrote to commit log and finally a group commit is used.


Tablet serving42 l.jpg

Read Op

Memtable

Memory

GFS

Tablet Log

SST

SST

SST

SSTable Files

Write Op

Tablet Serving

  • Tablet recovery process

    • Read metadata containing SSTABLES and redo points

      • Redo points are pointers into any commit logs

    • Apply redo points


Compactions l.jpg
Compactions

  • Minor compaction – convert the memtable into an SSTable

    • Reduce memory usage

    • Reduce log traffic on restart

  • Merging compaction

    • Reduce number of SSTables

    • Good place to apply policy “keep only N versions”

  • Major compaction

    • Merging compaction that results in only one SSTable

    • No deletion records, only live data


Locality groups l.jpg
Locality Groups

  • Group column families together into an SSTable

    • Avoid mingling data, e.g. page contents and page metadata

    • Can keep some groups all in memory

  • Can compress locality groups

  • Tablet movement

    • Major compaction (with concurrent updates)

    • Minor compaction (to catch up with updates) without any concurrent updates

    • Load on new server without requiring any recovery action


Lessons learned l.jpg
Lessons learned

  • Interesting point- only implement some of the requirements, since the last is probably not needed

  • Many types of failure possible

  • Big systems need proper systems-level monitoring

  • Value simple design


ad