Lecture 7 – Bigtable

Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

Previous Classes MapReducetoolkit for processing data RPC, etc GFSfile system

GFS vs Bigtable • GFS provides raw data storage • We need: • More sophisticated storage • Flexible enough to be useful • Store semi-structured data • Reliable, scalable, etc

Examples • URLs: • Contents, crawl metadata, links, anchors, pagerank, … • Per-user data: • User preference settings, recent queries/search results, … • Geographic locations: • Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …

Commercial DB • Why not use commercial database? • Not scalable enough • Too expensive • We need to do low-level optimizations

BigTable Features • Distributed key-value map • Fault-tolerant, persistent • Scalable • Thousands of servers • Terabytes of in-memory data • Petabytes of disk-based data • Millions of reads / writes per second, efficient scans • Self managing • Servers can be added / removed dynamically • Servers adjust to load imbalance

“<html>…” • Good match for most of our applications Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp)  cell contents “contents:” Columns Rows t3 t11 “www.cnn.com” t17 Timestamps 7

Rows • Name is an arbitrary string • Access to data in a row is atomic • Row creation is implicit upon storing data • Rows ordered lexicographically • Rows close together lexicographically usually reside on one or a small number of machines 8

“<html>…” “CNN home page” “CNN” Columns “anchor:cnnsi.com” “anchor:stanford.edu” “contents:” • Columns have two-level name structure: • family:optional_qualifier • Column family • Unit of access control • Has associated type information • Qualifier gives unbounded columns • Additional level of indexing, if desired “www.cnn.com” 9

Column Families • Must be created before data can be stored • Small number of column families • Unbounded number of columns 10

Timestamps • Used to store different versions of data in a cell • New writes default to current time, but timestamps for writes can also be set explicitly by clients 11

Timestamps • Garbage Collection • Per-column-family settings to tell Bigtable to GC • “Only retain most recent K values in a cell” • “Keep values until they are older than K seconds” 12

API • Create / delete tables and column families Table *T = OpenOrDie(“/bigtable/web/webtable”); RowMutation r1(T, “com.cnn.www”); r1.Set(“anchor:www.c-span.org”, “CNN”); r1.Delete(“anchor:www.abc.com”); Operation op; Apply(&op, &r1); 13

Locality Groups • Column families can be assigned to a locality group • Used to organize underlying storage representation for performance • scans over one locality group are O(bytes_in_locality_group) , not O(bytes_in_table) • Data in a locality group can be explicitly memory-mapped 14

SSTable • File-format for storing files • Key-Value Map • Persistent • Ordered • Immutable • Keys and values are strings 15

SSTable • Operations • Look up value for key • Iterate over all key/value pairs in specified range • Sequence of blocks (64 KB) • Block index used to locate blocks • How do we find block by block index? • Binary search on in-memory index • Or, map complete SSTable into memory 16

SSTable • Relies on lock service called Chubby • Ensure there is at most one active master • Store bootstrap location of Bigtable data • Finalize table server death • Store column family information • Store access control lists 17

Chubby • Namespace that consists of directories and small files • Each directory or file can be used as lock • Chubby client maintains session with Chubby service • Expires if unable to renew its session lease within expiration time • If expired, client loses any locks and open handles • Atomic Reads / Writes 18

Tablets Rows A - E As table grows, split tables into tablets Rows F - R Rows S - Z 19

Tablets • Large tables are broken into tablets at row boundaries • Tablet holds contiguous range of rows • Aim for ~100MB to 200MB of data per tablet • Tablet server responsible for ~100 tablets • Fine-grained load balancing: • Migrate tablets away from overloaded machine • Master makes load-balancing decisions 20

Tablet Server • Master assigns tablets to table servers • Tablet server • Handles reads / writes requests to tablets • Splits large tablets • Client does not move data through master 21

Master Startup • Grab unique master lock in Chubby • Scan servers directory in Chubby to find live servers • Communicate with every live tablet to discover which tablets are assigned • Scan METADATA table to learn set of tablets • Track unassigned tablet 22

Tablet Assignment • Master has list of unassigned tablets • When a tablet is unassigned, and a tablet server has room for it, Master sends tablet load request to tablet server 23

Tablet Serving • Persistent state of tablet is stored in GFS • Updates committed to log that stores redo records • Memtable: sorted buffer in memory of recent commits • Older updates stored in SSTable 24

What if memtable gets too big? • Minor compaction: • Create new memtable • Convert old memtable to SSTable and write to GFS • Note: every minor compaction creates a new SSTable 25

Compactions • Merge Compaction • Bound number of SSTables by executing merge compaction • Reads contents of a few SSTables and memtable, and writes out new SSTable • Major compaction: • Merging compaction that rewrites all SSTables into one SSTable • Produces SSTable that contains no deletion info • Allows Bigtable to reclaim resources from deleted data 26

Bigtable client Bigtable client library Bigtable master Bigtable tablet server Bigtable tablet server Bigtable tablet server Cluster scheduling system GFS Lock service System Structure Bigtable Cell metadata ops performs metadata ops + load balancing Open() read/write … serves data serves data serves data holds metadata, handles master-election handles failover, monitoring holds tablet data, logs 27

Google: The Big Picture • Custom solutions for unique problems! • GFS: Stores data reliably • But just raw files • BigTable: gives us key/value map • Database like, but doesn’t provide everything we need • Chubby: locking mechanism • SSTable file format • MapReduce: lets us process data from BigTable (and other sources) 28

Common Principles • One master, multiple helpers • MapReduce: master coordinates work amongst map / reduce workers • Bigtable: master knows about location of tablet servers • GFS: master coordinates data across chunkservers • Issues with a single master • What about master failure? • How do you avoid bottlenecks? 29

Next Class • Chord: A Scalable P2P Lookup Service for Internet Applications • Distributed Hash Table • Guest Speaker: John McDowell, Microsoft 30

Lecture 7 – Bigtable