Google Bigtable

Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006

Google Scale • Lots of data • Copies of the web, satellite data, user data, email and USENET, Subversion backing store • Many incoming requests • No commercial system big enough • Couldn’t afford it if there was one • Might not have made appropriate design choices

Data model: a big map • <Row, Column,Timestamp> triple as key • <Row, Column,Timestamp> -> string • API has lookup, insert and delete operations based on the key • Does not support a relational model • No table-wide integrity constraints • No multirow transactions

Data Model Example: Zoo

Data Model Example: Zoo row keycol. keytimestamp

Data Model Example: Zoo row keycol. keytimestamp - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs - (zebras, weight, 2006) --> 620 lbs

Data Model Example: Zoo row keycol. keytimestamp - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs - (zebras, weight, 2006) --> 620 lbs Each key is sorted in Lexicographic order

Data Model Example: Zoo row keycol. keytimestamp - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs - (zebras, weight, 2006) --> 620 lbs Timestamp ordering is defined as “most recent appears first”

Data Model Example: Web Indexing

Data Model

Data Model Row

Data Model Columns

Data Model Cells

Data Model timestamps

Data Model Column family

Data Model Column family family:qualifier

Data Model - Timestamps • Used to store different version of data in a cell • New writes default to current time • Lookup options: • Return most recent K values • Return all values in the timestamp range

API Examples: Write/Modify atomic row modification No support for (RDBMS-style) multi-row transactions

API Examples: Read Return sets can be filtered using regular expressions: anchor: com.cnn.*

SSTable • The Google SSTable file format is used internally to store BigTable data • An SSTableprovides a persistent ordered immutable map from keys to values • Each SSTable contains a sequence of blocks • A block index (stored at the end of SSTable) is used to locate blocks • The index is loaded into memory when the SSTable is open • One disk access to get the block

SSTable Index (block ranges) 64KB Block 64KB Block 64KB Block …

Tablet • Contains some range of rows of the table • Built out of multiple SSTables Start:aardvark Tablet End:apple SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index

Table • Multiple tablets make up the table • SSTables can be shared • Tablets do not overlap, SSTables can overlap Tablet Tablet apple boat aardvark apple SSTable SSTable SSTable SSTable

Organization • A Bigtable cluster stores tables • Each table consists of tablets • Initially each table consists of one tablet • As a table grows it is automatically split into multiple tablets • Tablets are assigned to tablet servers • Multiple tablets per server. Each tablet is 100-200 MB • Each tablet lives at only one server • Tablet server splits tablets that get too big

Organization • Master • Keeps track of the set of live tablet servers • Assigns tablets to table servers • Detects the addition and expiration of tablet servers • Balancing tablet-server load • Handles scheme changes such as table and column family creations • A Bigtable library is linked to every client.

Finding a tablet Similar to a B+ tree <table_id, end_row> -> location Location is <ip address, port> of a tablet server

Finding a tablet • A Bigtable library is linked to every client. • Client communicates directly with tablet server for reads/writes. • Client reads the Chubby file that points to the root tablet • This starts the location process • The client library caches tablet locations. • There are algorithms for prefetching • If there is no cached information, three network round-trips are needed

System Structure Bigtable client Bigtable Master Bigtable Cell metadata Bigtable client library Performs metadata ops+ Load balancing read/write Bigtable tablet server Bigtable tablet server Bigtable tablet server Open() Serves data Serves data Serves data Cluster scheduling system GFS Lock service Handles failover, monitoring Holds tablet data, logs Holds metadata Handles master-election

Tablet Server • When a tablet server starts, it creates and acquires and exclusive lock on, a uniquely-named file in a specific Chubby directory • Let’s call this servers directory • A tablet server stops serving its tablets if it loses its exclusive lock • This may happen if there is a network connection that causes the tablet server to lose its Chubby session

Tablet Server • A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists • If the file no longer exists then the tablet server will never be able to serve again • Kills itself • At some point it can restarted; it goes to a pool of unassigned tablet servers

Master Startup Operation • Upon start up the master needs to discover the current tablet assignment. • Grabs unique master lock in Chubby • Prevents server instantiations • Scans servers directory in Chubby for live servers • Communicates with every live tablet server • Discover all tablets • Scans METADATA table to learn the set of tablets • Unassigned tables are marked for assignment

Master Operation • Detect tablet serverfailures/resumption • Master periodically asks each tablet server for the status of its locks

Master Operation • Tablet server lost its lock or master cannot contact tablet server: • Master attempts to acquire exclusive lock on the server’s file in the servers directory • If master acquires the lock then the tablets assigned to the tablet server are assigned to others • Master deletes the server’s file in the servers directory • Assignment of tablets should be balanced • If master loses its Chubby session then it kills itself • An election can take place to find a new master

Tablet Server Failure

Master Operation • Handles schema changes such as table and column family creations. • Handles table creation, and merging of tablets • Tablet servers directly update metadata on tablet split, then notify master • lost notification may be detected lazily by master

Tablet Serving • Updates are committed to a commit log that stores redo records • Recently committed updates are stored in a buffer called memtable • Older updates are stored in a sequence of SSTables • To recover a tablet, a tablet server reads its metadata from the METADATA table • Contains list of SSTables and a set of redo points

Read Op Memtable Memory GFS Tablet Log SST SST SST SSTable Files Write Op Tablet Serving • Commit log stores the updates that are made to the data. • Recent updates are stored in memtable. • Older updates are stored in SStable files

Read Op Memtable Memory GFS Tablet Log SST SST SST SSTable Files Write Op Tablet Serving • Recovery process. • Reads/Writes that arrive at tablet server. • Is the request Well-formed? • Authorization. • Chubby holds the permission file. • If a mutation occurs it is wrote to commit log and finally a group commit is used.

Read Op Memtable Memory GFS Tablet Log SST SST SST SSTable Files Write Op Tablet Serving • Tablet recovery process • Read metadata containing SSTABLES and redo points • Redo points are pointers into any commit logs • Apply redo points

Compactions • Minor compaction – convert the memtable into an SSTable • Reduce memory usage • Reduce log traffic on restart • Merging compaction • Reduce number of SSTables • Good place to apply policy “keep only N versions” • Major compaction • Merging compaction that results in only one SSTable • No deletion records, only live data

Locality Groups • Group column families together into an SSTable • Avoid mingling data, e.g. page contents and page metadata • Can keep some groups all in memory • Can compress locality groups • Tablet movement • Major compaction (with concurrent updates) • Minor compaction (to catch up with updates) without any concurrent updates • Load on new server without requiring any recovery action

Lessons learned • Interesting point- only implement some of the requirements, since the last is probably not needed • Many types of failure possible • Big systems need proper systems-level monitoring • Value simple design

Google Bigtable

Google Bigtable

Presentation Transcript

Bigtable : A Distributed Storage System for Structured Data Google, Inc.

Google Bigtable

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

Bigtable, Hive, and Pig

BigTable

BigTable and Google DataStore

Bigtable : A Distributed Storage System for Structured Data

Google Bigtable A Distributed Storage System for Structured Data

HBase and Bigtable Storage

HBase and Bigtable Storage

Lecture 7 – Bigtable

Google and Google

MapReduce & BigTable

Bigtable : A Distributed Storage System for Structured Data

BigTable

BigTable: A Distributed Storage System for Structured Data

Google Bigtable

Bigtable : A Distributed Storage System for Structured Data

BigTable & MapReduce

Google Bigtable

Google Bigtable

Presentation Transcript

Bigtable : A Distributed Storage System for Structured Data Google, Inc.

Google Bigtable

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

Bigtable, Hive, and Pig

BigTable

BigTable and Google DataStore

Bigtable : A Distributed Storage System for Structured Data

Google Bigtable A Distributed Storage System for Structured Data

HBase and Bigtable Storage

HBase and Bigtable Storage

Lecture 7 – Bigtable

Google and Google

MapReduce &amp; BigTable

Bigtable : A Distributed Storage System for Structured Data

BigTable

BigTable: A Distributed Storage System for Structured Data

Google Bigtable

Bigtable : A Distributed Storage System for Structured Data

BigTable &amp; MapReduce

MapReduce & BigTable

BigTable & MapReduce