1 / 18

Google Bigtable A Distributed Storage System for Structured Data - PowerPoint PPT Presentation

  • Uploaded on

Google Bigtable A Distributed Storage System for Structured Data. Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University of Science and Technology, Introduction. BigTable is a distributed storage system for managing structured data.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Google Bigtable A Distributed Storage System for Structured Data' - astro

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Google bigtable a distributed storage system for structured data

Google BigtableA Distributed Storage System for Structured Data

Hadi Salimi,

Distributed Systems Laboratory,

School of Computer Engineering,

Iran University of Science and Technology,


  • BigTable is a distributed storage system for managing structured data.

  • Scales to Petabytes of data and thousands of machines.

  • Developed and in use at Google since 2005. Used for more than 60 Google products.

Data model
Data Model

  • (row, column, time) => string

  • Row, column, value are arbitrary strings.

  • Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row).

  • Columns are dynamically added.

  • Timestamps for different versions of data.

    • Assigned by client application.

    • Older versions are garbage-collected.

  • Example: Web map


  • Rows are sorted lexicographically.

  • Consecutive keys are grouped together as “tablets”.

    • Allows data locality.

    • Example rows: and are likely to be in same tablet.

Column families
Column Families

  • Column keys are grouped into sets called “column families”.

  • Column key is named using syntax: family:qualifier

  • Access control and disk/memory accounting are at column family level

  • Example: “”

Google bigtable a distributed storage system for structured data

  • Data Design

    • Creating/deleting tables and column families

    • Changing cluster, table and column family metadata like access control rights

  • Client Interactions

    • Write/Delete values

    • Read values

    • Scan row ranges

    • Single-row transactions (e.g., read/modify/write sequence for data under a row key)

  • Map/Reduce integration.

    • Read from Big Table; Write to Big Table.

Building blocks
Building Blocks

  • SSTable file: Data structure for storage

    • Maps keys to values

    • Ordered. Enables data locality for efficient writes/reads.

    • Immutable. On reads, no concurrency control needed. Need to garbage collect deleted data.

    • Stored in Google File System (GFS), and optionally can be mapped into memory.

      • Replicates data for redundancy.

  • Chubby: Distributed lock service.

    • Store the root tablet, schema info, access control list

    • Synchronize and detect tablet servers


3 components:

  • Client library

  • Master Server (exactly 1).

    • Assigns tablets to tablet servers.

    • Detecting the addition and expiration of tablet servers.

    • Balancing tablet-server load

    • Garbage collection of GFS files

    • Schema changes such as table and column family creations.

  • Tablet Servers (multiple, dynamically added/removed)

    • Handles read and write requests to the tablets that it has loaded

    • Splits tablets that have grown too large. Each tablet 100-200 MB.

Tablet location
Tablet Location

  • How to know which node to route client request?

  • 3-level hierarchy

    • One file in Chubby for location of Root Tablet

    • Root tablet contains location of Metadata tablets

    • Metadata table contains location of user tablets

      • Row: [Tablet’s Table ID] + [End Row]

      • Key: [Node ID]

  • Client library caches tablet locations.

Tablet assignment
Tablet Assignment

  • Master keeps track of tablet assignment and live servers

  • Chubby

    • Tablet server creates & locks a unique file.

    • Tablet server stops serving if loses lock.

    • Master periodically checks tablet servers. If fails, master tries to lock the file and un-assigns the tablet.

    • Master failure does not change tablets assignments.

  • Master restart

Tablet serving
Tablet Serving


  • Check well-formedness of request.

  • Check authorization in Chubby file.

  • Write to “tablet log” (i.e., a transaction log for “redo” in case of failure).

  • Write to memtable (RAM).

  • Separately, “compaction” moves memtable data to SSTable. And truncates tablet log.


  • Check well-formedness of request.

  • Check authorization in Chubby file.

  • Merge memtable and SSTables to find data.

  • Return data.


In order to control size of memtable, tablet log, and SSTable files, “compaction” is used.

  • MinorCompaction. Move data from memtable to SSTable. Truncate tablet log.

  • Merging Compaction. Merge multiple SSTables and memtable to a single SSTable.

  • Major Compaction. Remove deleted data.


  • Locality group.

    • Client can group multiple column families into a locality group. Enables more efficient reads since each locality group is a separate SSTable.

  • Compression.

    • Client can choose to compress at locality group level.

  • Two level caching in servers

    • Scan cache ( K/V pairs)

    • Block cache (SSTable blocks read from GFS)

  • Bloom filter

    • Efficient check if a SSTable contain data for a row/column pair.

  • Commit log implementation

    • Each tablet server has a single commit log (not one-per-tablet).

Performance evaluation
Performance Evaluation

  • Random reads are slowest. Need to access SSTable block from disk.

  • Writes are faster than reads. Commit log is append-only. Reads require merging of SSTables and memtable.

  • Scans reduce number of read operations.

Performance evaluation scaling
Performance Evaluation: Scaling

  • Not linear, but not bad up to 250 tablet servers.

  • Random read has worst scaling. Block transfers saturate network.


  • Satisfies goals of high-availability, high-performance, massively scalable data storage.

  • API. Successfully used by various Google products (>60).

  • Additional features in progress:

    • Secondary indexes

    • Cross data center replication.

    • Deploy as a hosted service.

  • Advantages of the custom development:

    • Significant flexibility due to own data model.

    • Can remove bottlenecks and inefficiencies as they arise.

Big table family tree
Big Table Family Tree

Non-relational DBs (HBase, Cassandra, MongoDB, etc.)

  • Column-oriented data model.

  • Multi-level storage (commit log, RAM table, SSTable)

  • Tablet management (assignment, splitting, recovery, GC, Bloom filters)

    Google related technologies and open-source equivalents

  • GFS => Hadoop Distributed File System (HDFS)

  • Chubby => Zookeeper

  • Map/Reduce => Apache Map/Reduce