Scalable Structured Data Storage for Web 2.0

Michael Armbrust David Zhu Barret Rhoden Scalable Structured Data Storage for Web 2.0

Objectives To provide a framework for Web applications to scale to YouTube or MySpace sizes Use an alternative data store more suitable for typical Web workloads (Hbase)‏ Willing to trade consistency for scalability and availability Integrate with Ruby on Rails to give developers a clean interface to the data store Use declarative constructs from Rails to express the application's needs from the data store

State of the Art Current status of data storage for Web applications: Large relational databases running on expensive hardware Manual horizontal and vertical partitioning of data Requires redesign at each scaling milestone Other work and differences: C-Store/Vertica: Maintains full SQL semantics Dynamo: Optimized for Amazon's writes PNUTS: Hosted service, work in progress

Our Idea Use a large-scale distributed database suitable for Web applications Relaxed consistency, No ad-hoc queries Can run on 1000+ of shared-nothing commodity servers Interface with ActiveRecord-like layer in Ruby on Rails Provides simple relationships and consistency guarantees between models has_many belongs_to searchable_by (for full-text search)‏ Pre-compute joins for quick reads

Risks Hbase Performance Hbase is under development and may have implementation problems Rails Scaling Once we successfully remove the data store bottleneck from Rails, we may discover unknown bottlenecks at the Web Application processing layer

Plan Workload Simple App with Simple workload Complicated App Joins Sessions Access Locality Full Fledged App Possibly use Sun's Rails benchmark in addition to our workload ActiveRecord Talk to Hbase Single Lookup No Joins Three Basic Joins Validations and Prefetching • Data Store • Scalability of Hbase • Determine comparisons with other stores • Define data layout • Indexing options • Hard off-line queries Key: - Week 8 - Week 10 (Mid-Course)‏ - Week 14 (End)‏

Scalable Structured Data Storage for Web 2.0

Scalable Structured Data Storage for Web 2.0

Presentation Transcript

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

Automatically Extracting Structured Data for Web Search

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

Scalable Algorithms for Structured Adaptive Mesh Refinement

Bigtable : A Distributed Storage System for Structured Data

Google Bigtable A Distributed Storage System for Structured Data

HTML5 JavaScript Storage for Structured Data

Structured data, Web 2.0, libraries

Web 2.0 Data Analysis

Automatically Extracting Structured Data for Web Search

Scaleable Structured Datastorage for Web 2.0

Data Publishing on Web 2.0

Bigtable : A Distributed Storage System for Structured Data

Everything 2.0 (Web 2.0, Library 2.0, Data 2.0….)

BigTable: A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

Web 2.0 toolbox for data journalism

Big Table: Distributed Storage System For Structured Data