1 / 6

Introduction to Distributed Storage Systems

Introduction to Distributed Storage Systems . Harry Xu CS 239, Fall 2019. Problems and Challenges. Extremely large amounts of data are available these days FB Social: 721M vertices, 68.7B edges in May 2011 Google Maps: 20 petabytes of data Where to put them Single machine? Servers?

alta
Download Presentation

Introduction to Distributed Storage Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Distributed Storage Systems Harry Xu CS 239, Fall 2019

  2. Problems and Challenges • Extremely large amounts of data are available these days • FB Social: 721M vertices, 68.7B edges in May 2011 • Google Maps: 20 petabytes of data • Where to put them • Single machine? Servers? • How can we enable applications to easily access them? • What interfaces do they provide? • What guarantees do they provide? • How to enable applications to efficiently access these data? • What should be the right architecture (e.g., master+slave, peer-to-peer, etc)? • What if a machine crashes?

  3. Solution: Distributed Storage Systems • Where to put them? • On a cluster of commodity servers • How to enable applications to easily access them? • Depending on data types (e.g., files, structured data, or unstructured data) • Standard interfaces • What guarantees do they provide? • Consistency guarantees • What if a machine crashes • Fault tolerance: replication + quick recovery • Consistency between replicas

  4. Three Different Kinds of Systems • Distributed File Systems • HDFS -- Yahoo • GFS -- Google • Distributed Structured Data Storage Systems (a.k.a., databases) • Bigtable(wide column DB) • Spanner (NewSQL DB) • A mix of both • Azure

  5. Distributed File Systems • HDFS • One “metadata” server (NameNode) and a set of DataNodes • A file is divided in blocks and each block has several replicas on different DataNodes • File operations are recorded on journals, which are replayed to maintain consistency upon failure • Supports a wide variety of applications including Hadoop and everything on top of Hadoop • GFS • Using a similar architecture • Replicating both file chunks and namespaces • Using checksum to detect data corruption

  6. Data Storage Systems (NoSQL Databases) • Bigtable • Built on top of GFS, available as part of Google Cloud Platform • It is a map (or a wide column store) that maps a row key and column key to a byte array • Designed to scale to petabyte-size data • Each table has multiple dimensions and is divided into a bunch of small tablets for better integration with GFS • No notion of transaction • Spanner • A “NewSQL” database supporting externally consistent transactions • Windows Azure Storage (WAS) • Supports strong consistency and various types of data

More Related