Bigtable: A Distributed Storage System for Structured Data

Bigtable: A Distributed Storage System for Structured Data Presented By: Srinivasa R Santhanam CS 5204 Fall 2007

Outline • Motivation & Goals for Bigtable • Data Model • API • Building Blocks • Implementation • Refinements • Performance • Related Work • Conclusions CS 5204 Fall 2007

Motivation • Lots of (semi-)structured data at Google • URLs: • Contents, crawl metadata, links, anchors, pagerank, … • Per-user data: • User preference settings, recent queries/search results, … • Geographic locations: • Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … • Scale is large • billions of URLs, many versions/page (~20K/version) • Hundreds of millions of users, thousands of q/sec • 100TB+ of satellite image data CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Why not just use commercial DB? • Scale is too large for most commercial databases • Even if it weren’t, cost would be very high • Building internally means system can be applied across many projects for low incremental cost • Low-level storage optimizations help performance significantly • Much harder to do when running on top of a database layer CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Goals • Want asynchronous processes to be continuously updating different pieces of data • Want access to most current data at any time • Need to support: • Very high read/write rates (millions of ops per second) • Efficient scans over all or interesting subsets of data • Efficient joins of large one-to-one and one-to-many datasets • Often want to examine data changes over time • E.g. Contents of a web page over multiple crawls • Self-managing • Servers can be added/removed dynamically • Servers adjust to load balance CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Data Model • Sparse • Distributed • Persistent • Multidimensional sorted map CS 5204 Fall 2007

Rows • Row identified by row key: 64KB in size (10-100 bytes typically used) • Row key examples: • “com.cnn.www/downloads.html:ftp” • “com.cnn.www/index.html:http” • “com.cnn.www/login.html:https” • Bigtable maintains data in lexicographic order of row key • Rows lexicographically close usually on one or small number of machines • Row creation is implicit upon storing data • R/W under a single row key is atomic • No transactional support across rows CS 5204 Fall 2007

Tablets • Large tables broken into tablets at row boundaries • Tablet holds contiguous range of rows • Clients can often choose row keys to achieve locality • Aim for ~100MB to 200MB of data per tablet • Serving machine responsible for ~100 tablets • Fast recovery: • 100 machines each pick up 1 tablet from failed machine • Fine-grained load balancing: • Migrate tablets away from overloaded machine • Master makes load-balancing decisions CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

… “yahoo.com/kids.html” “yahoo.com/kids.html\0” … Tablets & Splitting “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets … “website.com” … “zuppa.com/menu.html” CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Columns • Column family • Set of column keys • Basic unit of access control • Same type of data in a family (compressed together) • Example: “anchor:” • Column keys (traditional columns) • syntax: “family:qualifier” • Example: “anchor:cnnsi.com” CS 5204 Fall 2007

Locality Groups • Column families can be assigned to a locality group • Used to organize underlying storage representation for performance • scans over one locality group are O(bytes_in_locality_group) , not O(bytes_in_table) • Data in a locality group can be explicitly memory-mapped CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Locality Groups Locality Groups “contents:” “language:” “pagerank:” … “www.cnn.com” “<html>…” EN 0.65 … CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Timestamps • Multiple versions of data indexed by timestamp (64 bit integers) • Can be assigned by Bigtable or client application • Data in decreasing order of timestamps • to read most recent data first • Lookup options: • “Return most recent K values” • “Return all values in a timestamp range” (or all values) • Two per column family settings (used for GC): • Last n versions of data • Last x days CS 5204 Fall 2007

Conceptual Vs Physical “contents:” and “anchor:” are in different locality groups (“com.cnn.www”, “contents”, t6) (“com.cnn.www”, “anchor:my.look.ca”, t8) CS 5204 Fall 2007

API • Metadata operations • Create/delete tables and column families • Change cluster, table and column family metadata (e.g. access control rights) CS 5204 Fall 2007

Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Also supports Delete in a timestamp range // Open the table Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set("anchor:www.c-span.org", "CNN"); r1.Delete("anchor:www.abc.com"); Operation op; Apply(&op, &r1); // ATOMIC API - Writes CS 5204 Fall 2007

Scanner: read arbitrary cells in a Bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask data from one row, all rows, etc Can ask for all columns, just certain column families, or specific columns // Open the table Table *T = OpenOrDie("/bigtable/web/webtable"); Scanner scanner(T); ScanStream *stream; stream = scanner.FetchColumnFamily("anchor"); stream->SetReturnAllVersions(); scanner.Lookup("com.cnn.www"); for (; !stream->Done(); stream->Next()) { printf("%s %s %lld %s\n", scanner.RowName(), stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); } API - Reads CS 5204 Fall 2007

Building Blocks • GFS – to store log and data files (stores persistent state of the system) • SSTable – file format to store data • Chubby – persistent distributed lock service • Scheduler – schedules jobs involved in Bigtable serving • MapReduce – often used to read/write Bigtable data (Bigtable can be I/P and/or O/P for MapReduce computations) CS 5204 Fall 2007

Machine N Machine 1 Machine 2 Scheduler slave Scheduler slave Scheduler slave GFS chunkserver GFS chunkserver GFS chunkserver Linux Linux Linux Typical Cluster Cluster scheduling master Lock service GFS master User app1 BigTable server BigTable server BigTable master User app1 … User app2 CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Master manages metadata Data transfers happen directly between clients/chunkservers Files broken into chunks (typically 64 MB) Chunks triplicated across three machines for safety Google File System (GFS) Misc. servers GFS Master Replicas Client Masters GFS Master Client Client C1 C1 C0 C0 C5 … Chunkserver 1 Chunkserver N Chunkserver 2 C2 C3 C2 C5 C5 CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

SSTable • Persistent ordered immutable map from keys (rowkey, columnkey, timestamp) to values (string) • Lookup options: • Value associated with a key • key/value pairs in a specified key range • SSTable contains a sequence of blocks and a block index • Block index loaded into memory when SSTable is opened • Lookups performed using single disk seek using in-memory index • Optionally, SSTable can be completely mapped into memory • lookups and scans without touching disk CS 5204 Fall 2007

Chubby • 5 active chubby replicas, 1 elected as master and serves request • Provides namespace of directories and small files • Each directory/file can be used as a lock and reads and writes to a file are atomic • Bigtable usage: • Ensures there is only one active master at any time • Stores bootstrap location of Bigtable data (Root tablet – 1st Metadata tablet) • To discover tablet servers • Finalize tablet server deaths • Store Bigtable schema information • Stores access control lists CS 5204 Fall 2007

Implementation • Three major components: • Library that is linked into every client • One master server • Many tablet servers CS 5204 Fall 2007

Bigtable client Bigtable client library System Structure Bigtable Cell metadata ops Bigtable master performs metadata ops + load balancing Open() read/write Bigtable tablet server Bigtable tablet server … Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles master-election handles failover, monitoring holds tablet data, logs CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Locating Tablets • Since tablets move around from server to server, given a row, how do clients find the right machine? • Need to find tablet whose row range covers the target row • One approach: could use the BigTable master • Central server almost certainly would be bottleneck in large system • Instead: store special tables containing tablet location info in BigTable cell itself CS 5204 Fall 2007

META0 table Locating Tablets (cont.) • Approach: 3-level hierarchical lookup scheme for tablets • Location is ip:port of relevant server, rowkey=encoding(tableID+endrow) • 1st level: bootstrapped from lock service, points to owner of META0 • 2nd level: Uses META0 data to find owner of appropriate META1 tablet • 3rd level: META1 table holds locations of tablets of all other tables • META1 table itself can be split into multiple tablets META1 table • Aggressive prefetching+caching • Most ops go right to proper machine Actual tablet in table T Pointer to META0 location Stored in lock service Row per META1 Table tablet Row per non-META tablet (all tables) CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Tablet Assignment (I) • Each tablet assigned to one tablet server • Bigtable Master: • keeps track of unassigned tablets • assigns tablet by tablet load request to a tablet server with sufficient room • detects when a tablet server no longer is serving its tablets (status of lock in chubby) and adds tablets to set of unassigned tablets CS 5204 Fall 2007

Tablet Assignment (II) • Master startup actions: • Grabs unique master lock in chubby (ensures single master) • Scans Chubby servers directory to find live servers • Communicates with live servers to discover tablets already assigned to each server • Scans METADATA table to learn set of tablets and adds unassigned tablets to a set CS 5204 Fall 2007

Tablet Representation & Serving read write buffer in memory (random-access) append-only log on GFS write SSTable on GFS (mmap) SSTable on GFS SSTable on GFS Tablet SSTable: Immutable on-disk ordered map from string->string string keys: <row, column, timestamp> triples CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Compactions • Minor • Merging • Major CS 5204 Fall 2007

Refinements • Locality Groups – explained earlier • Compression – (speed vs compression) • Caching for read performance • Bloom Filters • Commit-log Implementation (Shared Logs) • Speeding up tablet recovery • Exploiting immutability CS 5204 Fall 2007

Performance CS 5204 Fall 2007

Related Work (I) • Boxwood project • Goal: to provide infrastructure for building higher level services such as file systems or DB, but Bigtable directly supports client applications that wish to store data • CAN, Chord, Tapestry, Pastry • Highly variable bandwidth, untrusted participants, frequent reconfiguration, decentralized control and Byzantine fault tolerance (these are not Bigtable goals) • Parallel DB – complete relational model with transaction support • Oracle’s RAC (shared disks & distributed lock manager) • IBM DB2 parallel edition (shared nothing architecture) CS 5204 Fall 2007

Related Work (II) • Column based rather than row based storage (locality groups) • (LSM-tree) Log structured merge tree (memtable and SSTables) • C-Store read optimized RDBMS (Bigtable performs better on both read and write intensive applications) • Bigtable load balancer solves some of the same load and memory balancing problems faced by shared-nothing DB, but doesn’t have the following problems • Multiple copies of same data (views, indices) • User specified in-memory data, not dynamically determined (optimizer) • No complex queries to execute or optimize • Hbase - Bigtable-like structured storage for Hadoop HDFS • PNUTS – Platform for Nimble Universal Table Storage by Yahoo Research (PNUTS/YDHT cluster) by Raghu Ramakrishnan CS 5204 Fall 2007

Conclusions • Bigtable (achieved its goals) provides high performance storage system on a large scale using just commodity hardware • Self-managing • Thousands of servers • Millions of ops/second • Multiple GB/s reading/writing CS 5204 Fall 2007

Discussion Topics • Bigtable Joins • Declarative SQL to procedural use of API • Sawzall • MapReduce • Row stores vs Column stores • Integrity Constraints CS 5204 Fall 2007

Interesting Links • Hbase project (very similar to Bigtable) • http://wiki.apache.org/lucene-hadoop/Hbase • Jeff Dean’s Bigtable presentation at University of Washington • http://video.google.com/videoplay?docid=7278544055668715642&q=bigtable CS 5204 Fall 2007

Backup Slides CS 5204 Fall 2007

Minor Compactions • Memtable frozen and converted to an SSTable and written to GFS • New memtable is created • Incoming R/W can continue during compaction • Two goals: • Reduces memory usage of tablet server • Reduces amount of log reads during recovery CS 5204 Fall 2007

Merging Compactions • Every minor compaction creates a new SSTable • # of SSTables bounded by periodically executing merging compaction • I/P(Memtable + few SSTables) -> O/P New SSTable • I/P discarded after compaction CS 5204 Fall 2007

Major Compactions • Merging compaction that writes all SSTables into a single SSTable • Bigtable cycles through tables and applies major compactions regularly • Reclaim resources used by deleted data (not reclaimed by non-major compactions) • Ensures deleted data disappears from the system in a timely manner CS 5204 Fall 2007

Bloom Filters • Problem: Reads can cause a lot of disk access for all the SSTables not in memory • Solution: Bloom filter, a space efficient probabilistic data structure to test element membership in a set • row/column pair (element) in SSTable (set) • false positives are possible but false negatives are not • Most lookups for non-existent rows or columns do not need to touch disk CS 5204 Fall 2007

Shared Logs • Problem: Bigtable designed for 1M tablets, 1000s of tablet servers • 1M logs being simultaneously written performs badly • Solution: shared logs • Write log file per tablet server instead of per tablet • Updates for many tablets co-mingled in same file • Start new log chunks every so often (64 MB) • Problem: during recovery, server needs to read log data to apply mutations for a tablet • Lots of wasted I/O if lots of machines need to read data for many tablets from same log chunk CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Shared Log Recovery Recovery: • Servers inform master of log chunks they need to read • Master aggregates and orchestrates sorting of needed chunks • Assigns log chunks to be sorted to different tablet servers • Servers sort chunks by tablet, writes sorted data to local disk • Other tablet servers ask master which servers have sorted chunks they need • Tablet servers issue direct RPCs to peer tablet servers to read sorted data for its tablets CS 5204 Fall 2007 Source: Jeff Dean, Google FellowBigtable Presentation at University of Washington

Exploiting Immutability of SSTables • Do not need any synchronization for accesses to the file system when reading from SSTables • Memtable is a mutable data structure • Copy-on-write allows reads and writes to proceed in parallel • Permanently removing deleted data is transformed to GC of obsolete SSTables • Splitting tablets quickly • Instead of creating new SSTables for child tablets, the child tablets share the SSTables of the parent tablet CS 5204 Fall 2007

Compression • User specified compression format applied to each SSTable block • Read small portions without decompressing entire file • Emphasis given to speed over space reduction • Good compression achieved: • Versions of same data • Good row key space design CS 5204 Fall 2007

Caching for Read Performance • Two levels of caching • Scan cache: key-value pairs • Good for applications reading same data repeatedly • Block cache: SSTable blocks • Good for sequential reads • Random reads of different columns in the same hot row CS 5204 Fall 2007

Speeding Up Tablet Recovery • Master moves a tablet from one tablet server to another • Source tablet server first does a minor compaction • Stops serving the tablet • Before unloading the tablet, it does another minor compaction (eliminate log that arrived between the two compactions) • After second minor compaction no log entry recovery is required CS 5204 Fall 2007

Ch1, (3 6 7) File, chunk index (sample, 3) Application GFS Client Sample Ch2, (2 9 8) Handle, locations (ch3, {1 5 12}) Ch3, (1 5 12) Read(ch3, 4096) Data Chunk Server 1 Linux FS Chunk Server 2 Linux FS …… Chunk Server n Linux FS Architecture of GFS GFS Master Heart Beat Messages CS 5204 Fall 2007 Source: Jaishankar SundararamanGFS Presentation CS 5204 Fall 2006

File, chunk index Handle, locations (primary & secondary) A Sample Write OperationClient gets meta data from Master Master GFS Client CS 5204 Fall 2007 Source: Jaishankar SundararamanGFS Presentation CS 5204 Fall 2006

Bigtable: A Distributed Storage System for Structured Data

Bigtable: A Distributed Storage System for Structured Data

Presentation Transcript

Bigtable : A Distributed Storage System for Structured Data Google, Inc.

Cassandra - A Decentralized Structured Storage System

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

BigTable A System for Distributed Structured Storage

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

Cassandra - A Decentralized Structured Storage System

Bigtable : A Distributed Storage System for Structured Data

Google Bigtable A Distributed Storage System for Structured Data

HBase and Bigtable Storage

HTML5 JavaScript Storage for Structured Data

HBase and Bigtable Storage

(Distributed) (Structured) Storage Systems

Cassandra – A Decentralized Structured Storage System

Bigtable : A Distributed Storage System for Structured Data

BigTable: A Distributed Storage System for Structured Data

Cassandra - A Decentralized Structured Storage System

Bigtable : A Distributed Storage System for Structured Data

Big Table: Distributed Storage System For Structured Data