Skip this Video
Download Presentation
Google Bigtable A Distributed Storage System for Structured Data

Loading in 2 Seconds...

play fullscreen
1 / 18

Google Bigtable A Distributed Storage System for Structured Data - PowerPoint PPT Presentation

  • Uploaded on

Google Bigtable A Distributed Storage System for Structured Data. Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University of Science and Technology, [email protected] Introduction. BigTable is a distributed storage system for managing structured data.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Google Bigtable A Distributed Storage System for Structured Data' - astro

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Google BigtableA Distributed Storage System for Structured Data

Hadi Salimi,

Distributed Systems Laboratory,

School of Computer Engineering,

Iran University of Science and Technology,

[email protected]

  • BigTable is a distributed storage system for managing structured data.
  • Scales to Petabytes of data and thousands of machines.
  • Developed and in use at Google since 2005. Used for more than 60 Google products.
data model
Data Model
  • (row, column, time) => string
  • Row, column, value are arbitrary strings.
  • Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row).
  • Columns are dynamically added.
  • Timestamps for different versions of data.
    • Assigned by client application.
    • Older versions are garbage-collected.
  • Example: Web map
  • Rows are sorted lexicographically.
  • Consecutive keys are grouped together as “tablets”.
    • Allows data locality.
    • Example rows: and are likely to be in same tablet.
column families
Column Families
  • Column keys are grouped into sets called “column families”.
  • Column key is named using syntax: family:qualifier
  • Access control and disk/memory accounting are at column family level
  • Example: “”
  • Data Design
    • Creating/deleting tables and column families
    • Changing cluster, table and column family metadata like access control rights
  • Client Interactions
    • Write/Delete values
    • Read values
    • Scan row ranges
    • Single-row transactions (e.g., read/modify/write sequence for data under a row key)
  • Map/Reduce integration.
    • Read from Big Table; Write to Big Table.
building blocks
Building Blocks
  • SSTable file: Data structure for storage
    • Maps keys to values
    • Ordered. Enables data locality for efficient writes/reads.
    • Immutable. On reads, no concurrency control needed. Need to garbage collect deleted data.
    • Stored in Google File System (GFS), and optionally can be mapped into memory.
      • Replicates data for redundancy.
  • Chubby: Distributed lock service.
    • Store the root tablet, schema info, access control list
    • Synchronize and detect tablet servers

3 components:

  • Client library
  • Master Server (exactly 1).
    • Assigns tablets to tablet servers.
    • Detecting the addition and expiration of tablet servers.
    • Balancing tablet-server load
    • Garbage collection of GFS files
    • Schema changes such as table and column family creations.
  • Tablet Servers (multiple, dynamically added/removed)
    • Handles read and write requests to the tablets that it has loaded
    • Splits tablets that have grown too large. Each tablet 100-200 MB.
tablet location
Tablet Location
  • How to know which node to route client request?
  • 3-level hierarchy
    • One file in Chubby for location of Root Tablet
    • Root tablet contains location of Metadata tablets
    • Metadata table contains location of user tablets
      • Row: [Tablet’s Table ID] + [End Row]
      • Key: [Node ID]
  • Client library caches tablet locations.
tablet assignment
Tablet Assignment
  • Master keeps track of tablet assignment and live servers
  • Chubby
    • Tablet server creates & locks a unique file.
    • Tablet server stops serving if loses lock.
    • Master periodically checks tablet servers. If fails, master tries to lock the file and un-assigns the tablet.
    • Master failure does not change tablets assignments.
  • Master restart
tablet serving
Tablet Serving


  • Check well-formedness of request.
  • Check authorization in Chubby file.
  • Write to “tablet log” (i.e., a transaction log for “redo” in case of failure).
  • Write to memtable (RAM).
  • Separately, “compaction” moves memtable data to SSTable. And truncates tablet log.


  • Check well-formedness of request.
  • Check authorization in Chubby file.
  • Merge memtable and SSTables to find data.
  • Return data.

In order to control size of memtable, tablet log, and SSTable files, “compaction” is used.

  • MinorCompaction. Move data from memtable to SSTable. Truncate tablet log.
  • Merging Compaction. Merge multiple SSTables and memtable to a single SSTable.
  • Major Compaction. Remove deleted data.
  • Locality group.
    • Client can group multiple column families into a locality group. Enables more efficient reads since each locality group is a separate SSTable.
  • Compression.
    • Client can choose to compress at locality group level.
  • Two level caching in servers
    • Scan cache ( K/V pairs)
    • Block cache (SSTable blocks read from GFS)
  • Bloom filter
    • Efficient check if a SSTable contain data for a row/column pair.
  • Commit log implementation
    • Each tablet server has a single commit log (not one-per-tablet).
performance evaluation
Performance Evaluation
  • Random reads are slowest. Need to access SSTable block from disk.
  • Writes are faster than reads. Commit log is append-only. Reads require merging of SSTables and memtable.
  • Scans reduce number of read operations.
performance evaluation scaling
Performance Evaluation: Scaling
  • Not linear, but not bad up to 250 tablet servers.
  • Random read has worst scaling. Block transfers saturate network.
  • Satisfies goals of high-availability, high-performance, massively scalable data storage.
  • API. Successfully used by various Google products (>60).
  • Additional features in progress:
    • Secondary indexes
    • Cross data center replication.
    • Deploy as a hosted service.
  • Advantages of the custom development:
    • Significant flexibility due to own data model.
    • Can remove bottlenecks and inefficiencies as they arise.
big table family tree
Big Table Family Tree

Non-relational DBs (HBase, Cassandra, MongoDB, etc.)

  • Column-oriented data model.
  • Multi-level storage (commit log, RAM table, SSTable)
  • Tablet management (assignment, splitting, recovery, GC, Bloom filters)

Google related technologies and open-source equivalents

  • GFS => Hadoop Distributed File System (HDFS)
  • Chubby => Zookeeper
  • Map/Reduce => Apache Map/Reduce