1 / 45

Google Bigtable

Google Bigtable. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006. Google Scale. Lots of data

rlunsford
Download Presentation

Google Bigtable

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006

  2. Google Scale • Lots of data • Copies of the web, satellite data, user data, email and USENET, Subversion backing store • Many incoming requests • No commercial system big enough • Couldn’t afford it if there was one • Might not have made appropriate design choices

  3. Data model: a big map • <Row, Column,Timestamp> triple as key • <Row, Column,Timestamp> -> string • API has lookup, insert and delete operations based on the key • Does not support a relational model • No table-wide integrity constraints • No multirow transactions

  4. Data Model Example: Zoo

  5. Data Model Example: Zoo row keycol. keytimestamp

  6. Data Model Example: Zoo row keycol. keytimestamp - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs - (zebras, weight, 2006) --> 620 lbs

  7. Data Model Example: Zoo row keycol. keytimestamp - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs - (zebras, weight, 2006) --> 620 lbs Each key is sorted in Lexicographic order

  8. Data Model Example: Zoo row keycol. keytimestamp - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs - (zebras, weight, 2006) --> 620 lbs Timestamp ordering is defined as “most recent appears first”

  9. Data Model Example: Web Indexing

  10. Data Model

  11. Data Model Row

  12. Data Model Columns

  13. Data Model Cells

  14. Data Model timestamps

  15. Data Model Column family

  16. Data Model Column family family:qualifier

  17. Data Model Column family family:qualifier

  18. Data Model - Timestamps • Used to store different version of data in a cell • New writes default to current time • Lookup options: • Return most recent K values • Return all values in the timestamp range

  19. API Examples: Write/Modify atomic row modification No support for (RDBMS-style) multi-row transactions

  20. API Examples: Read Return sets can be filtered using regular expressions: anchor: com.cnn.*

  21. SSTable • The Google SSTable file format is used internally to store BigTable data • An SSTableprovides a persistent ordered immutable map from keys to values • Each SSTable contains a sequence of blocks • A block index (stored at the end of SSTable) is used to locate blocks • The index is loaded into memory when the SSTable is open • One disk access to get the block

  22. SSTable Index (block ranges) 64KB Block 64KB Block 64KB Block …

  23. Tablet • Contains some range of rows of the table • Built out of multiple SSTables Start:aardvark Tablet End:apple SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index

  24. Table • Multiple tablets make up the table • SSTables can be shared • Tablets do not overlap, SSTables can overlap Tablet Tablet apple boat aardvark apple SSTable SSTable SSTable SSTable

  25. Organization • A Bigtable cluster stores tables • Each table consists of tablets • Initially each table consists of one tablet • As a table grows it is automatically split into multiple tablets • Tablets are assigned to tablet servers • Multiple tablets per server. Each tablet is 100-200 MB • Each tablet lives at only one server • Tablet server splits tablets that get too big

  26. Organization • Master • Keeps track of the set of live tablet servers • Assigns tablets to table servers • Detects the addition and expiration of tablet servers • Balancing tablet-server load • Handles scheme changes such as table and column family creations • A Bigtable library is linked to every client.

  27. Finding a tablet Similar to a B+ tree <table_id, end_row> -> location Location is <ip address, port> of a tablet server

  28. Finding a tablet • A Bigtable library is linked to every client. • Client communicates directly with tablet server for reads/writes. • Client reads the Chubby file that points to the root tablet • This starts the location process • The client library caches tablet locations. • There are algorithms for prefetching • If there is no cached information, three network round-trips are needed

  29. System Structure Bigtable client Bigtable Master Bigtable Cell metadata Bigtable client library Performs metadata ops+ Load balancing read/write Bigtable tablet server Bigtable tablet server Bigtable tablet server Open() Serves data Serves data Serves data Cluster scheduling system GFS Lock service Handles failover, monitoring Holds tablet data, logs Holds metadata Handles master-election

  30. Tablet Server • When a tablet server starts, it creates and acquires and exclusive lock on, a uniquely-named file in a specific Chubby directory • Let’s call this servers directory • A tablet server stops serving its tablets if it loses its exclusive lock • This may happen if there is a network connection that causes the tablet server to lose its Chubby session

  31. Tablet Server • A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists • If the file no longer exists then the tablet server will never be able to serve again • Kills itself • At some point it can restarted; it goes to a pool of unassigned tablet servers

  32. Master Startup Operation • Upon start up the master needs to discover the current tablet assignment. • Grabs unique master lock in Chubby • Prevents server instantiations • Scans servers directory in Chubby for live servers • Communicates with every live tablet server • Discover all tablets • Scans METADATA table to learn the set of tablets • Unassigned tables are marked for assignment

  33. Master Operation • Detect tablet serverfailures/resumption • Master periodically asks each tablet server for the status of its locks

  34. Master Operation • Tablet server lost its lock or master cannot contact tablet server: • Master attempts to acquire exclusive lock on the server’s file in the servers directory • If master acquires the lock then the tablets assigned to the tablet server are assigned to others • Master deletes the server’s file in the servers directory • Assignment of tablets should be balanced • If master loses its Chubby session then it kills itself • An election can take place to find a new master

  35. Tablet Server Failure

  36. Tablet Server Failure

  37. Tablet Server Failure

  38. Master Operation • Handles schema changes such as table and column family creations. • Handles table creation, and merging of tablets • Tablet servers directly update metadata on tablet split, then notify master • lost notification may be detected lazily by master

  39. Tablet Serving • Updates are committed to a commit log that stores redo records • Recently committed updates are stored in a buffer called memtable • Older updates are stored in a sequence of SSTables • To recover a tablet, a tablet server reads its metadata from the METADATA table • Contains list of SSTables and a set of redo points

  40. Read Op Memtable Memory GFS Tablet Log SST SST SST SSTable Files Write Op Tablet Serving • Commit log stores the updates that are made to the data. • Recent updates are stored in memtable. • Older updates are stored in SStable files

  41. Read Op Memtable Memory GFS Tablet Log SST SST SST SSTable Files Write Op Tablet Serving • Recovery process. • Reads/Writes that arrive at tablet server. • Is the request Well-formed? • Authorization. • Chubby holds the permission file. • If a mutation occurs it is wrote to commit log and finally a group commit is used.

  42. Read Op Memtable Memory GFS Tablet Log SST SST SST SSTable Files Write Op Tablet Serving • Tablet recovery process • Read metadata containing SSTABLES and redo points • Redo points are pointers into any commit logs • Apply redo points

  43. Compactions • Minor compaction – convert the memtable into an SSTable • Reduce memory usage • Reduce log traffic on restart • Merging compaction • Reduce number of SSTables • Good place to apply policy “keep only N versions” • Major compaction • Merging compaction that results in only one SSTable • No deletion records, only live data

  44. Locality Groups • Group column families together into an SSTable • Avoid mingling data, e.g. page contents and page metadata • Can keep some groups all in memory • Can compress locality groups • Tablet movement • Major compaction (with concurrent updates) • Minor compaction (to catch up with updates) without any concurrent updates • Load on new server without requiring any recovery action

  45. Lessons learned • Interesting point- only implement some of the requirements, since the last is probably not needed • Many types of failure possible • Big systems need proper systems-level monitoring • Value simple design

More Related