Dive into Distributed Storage Systems Challenges and Solutions

Introduction to Distributed Storage Systems Harry Xu CS 239, Fall 2019

Problems and Challenges • Extremely large amounts of data are available these days • FB Social: 721M vertices, 68.7B edges in May 2011 • Google Maps: 20 petabytes of data • Where to put them • Single machine? Servers? • How can we enable applications to easily access them? • What interfaces do they provide? • What guarantees do they provide? • How to enable applications to efficiently access these data? • What should be the right architecture (e.g., master+slave, peer-to-peer, etc)? • What if a machine crashes?

Solution: Distributed Storage Systems • Where to put them? • On a cluster of commodity servers • How to enable applications to easily access them? • Depending on data types (e.g., files, structured data, or unstructured data) • Standard interfaces • What guarantees do they provide? • Consistency guarantees • What if a machine crashes • Fault tolerance: replication + quick recovery • Consistency between replicas

Three Different Kinds of Systems • Distributed File Systems • HDFS -- Yahoo • GFS -- Google • Distributed Structured Data Storage Systems (a.k.a., databases) • Bigtable(wide column DB) • Spanner (NewSQL DB) • A mix of both • Azure

Distributed File Systems • HDFS • One “metadata” server (NameNode) and a set of DataNodes • A file is divided in blocks and each block has several replicas on different DataNodes • File operations are recorded on journals, which are replayed to maintain consistency upon failure • Supports a wide variety of applications including Hadoop and everything on top of Hadoop • GFS • Using a similar architecture • Replicating both file chunks and namespaces • Using checksum to detect data corruption

Data Storage Systems (NoSQL Databases) • Bigtable • Built on top of GFS, available as part of Google Cloud Platform • It is a map (or a wide column store) that maps a row key and column key to a byte array • Designed to scale to petabyte-size data • Each table has multiple dimensions and is divided into a bunch of small tablets for better integration with GFS • No notion of transaction • Spanner • A “NewSQL” database supporting externally consistent transactions • Windows Azure Storage (WAS) • Supports strong consistency and various types of data

Dive into Distributed Storage Systems Challenges and Solutions

Dive into Distributed Storage Systems Challenges and Solutions

Presentation Transcript

Chapter 17 – Introduction to Distributed Systems

Introduction to Distributed Systems and Networking

Introduction of Distributed Systems

Introduction to Reliable Distributed Systems

Distributed Operating Systems - Introduction

Introduction to Storage Systems Architectures

Introduction to Distributed Systems

Distributed RT Systems Introduction

Introduction to Distributed * Systems

(Distributed) (Structured) Storage Systems

Lecture 1 – Introduction to Distributed Systems

Introduction to Distributed Systems

Distributed Operating Systems - Introduction

Introduction to RDMA Storage Systems

Introduction to Distributed Systems

Distributed Operating Systems - Introduction

Distributed Operating Systems - Introduction

Distributed Systems: Introduction

Chapter 17 – Introduction to Distributed Systems

Introduction to Distributed Systems

Introduction to Distributed * Systems