Distributed systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Distributed Systems PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on
  • Presentation posted in: General

Distributed Systems. Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012. Windows Azure Storage (WAS). A scalable cloud storage system In production since November 2008 used inside Microsoft for applications such as

Download Presentation

Distributed Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Distributed systems

Distributed Systems

Tutorial 9 – Windows Azure Storage

written by Alex Libov

Based on SOSP 2011 presentation

winter semester, 2011-2012


Windows azure storage was

Windows Azure Storage (WAS)

  • A scalable cloud storage system

  • In production since November 2008

  • used inside Microsoft for applications such as

    • social networking search, serving video, music and game content, managing medical records and more

  • Thousands of customers outside Microsoft

  • Anyone can sign up over the Internet to use the system.


Was abstractions

WAS Abstractions

  • Blobs – File system in the cloud

  • Tables– Massively scalable structured storage

  • Queues – Reliable storage and delivery of messages

  • A common usage pattern is incoming and outgoing data being shipped via Blobs, Queues providing the overall workflow for processing the Blobs, and intermediate service state and final results being kept in Tables or Blobs.


Design goals

Design goals

  • Highly Available with Strong Consistency

    • Provide access to data in face of failures/partitioning

  • Durability

    • Replicate data several times within and across data centers

  • Scalability

    • Need to scale to exabytes and beyond

    • Provide a global namespace to access data around the world

    • Automatically load balance data to meet peak traffic demands


Global partitioned namespace

Global Partitioned Namespace

  • http(s)://AccountName.<service>.core.windows.net/PartitionName/

    ObjectName

  • <service> can be a blob, table or queue.

  • AccountNameis the customer selected account name for accessing storage.

    • The Account name specifies the data center where the data is stored.

    • An application may use multiple AccountNames to store its data across different locations.

  • PartitionNamelocates the data once a request reaches the storage cluster

  • When a PartitionNameholds many objects, the ObjectNameidentifies individual objects within that partition

    • The system supports atomic transactions across objects with the same PartitionNamevalue

    • The ObjectName is optional since, for some types of data, the PartitionNameuniquely identifies the object within the account.


Storage stamps

Storage Stamps

  • A storage stamp is a cluster of N racks of storage nodes.

  • Each rack is built out as a separate fault domain with redundant networking and power.

  • Clusters typically range from 10 to 20 racks with 18 disk-heavy storage nodes per rack.

  • The first generation storage stamps hold approximately 2PB of raw storage each.

  • The next generation stamps hold up to 30PB of raw storage each.


High level architecture

High Level Architecture

Access blob storage via the URL: http://<account>.blob.core.windows.net/

Storage

Location

Service

Data access

LB

LB

Storage Stamp

Storage Stamp

Front-Ends

Front-Ends

Partition Layer

Partition Layer

Inter-stamp (Geo) replication

Stream Layer

Stream Layer

Intra-stamp replication

Intra-stamp replication


Storage stamp architecture stream layer

Storage Stamp Architecture – Stream Layer

  • Append-only distributed file system

  • All data from the Partition Layer is stored into files (extents) in the Stream layer

  • An extent is replicated 3 times across different fault and upgrade domains

    • With random selection for where to place replicas

  • Checksum all stored data

    • Verified on every client read

  • Re-replicate on disk/node/rack failure or checksum mismatch

M

Stream Layer

(Distributed

File System)

Paxos

M

M

Extent Nodes (EN)


Storage stamp architecture partiton layer

Storage Stamp Architecture – Partiton Layer

  • Provide transaction semantics and strong consistency for Blobs, Tables and Queues

  • Stores and reads the objects to/from extents in the Stream layer

  • Provides inter-stamp (geo) replication by shipping logs to other stamps

  • Scalable object index via partitioning

Partition

Master

Partition Layer

Lock Service

Partition

Server

Partition

Server

Partition

Server

Partition

Server


Storage stamp architecture front end layer

Storage Stamp Architecture – Front End Layer

  • Stateless Servers

  • Authentication + authorization

  • Request routing


Storage stamp architecture

Storage Stamp Architecture

Incoming Write Request

Ack

Front End Layer

FE

FE

FE

FE

FE

Partition

Master

Lock Service

Partition Layer

Partition

Server

Partition

Server

Partition

Server

Partition

Server

M

M

Paxos

Stream

Layer

M

Extent Nodes (EN)


Partition layer scalable object index

Partition Layer – Scalable Object Index

  • 100s of Billions of blobs, entities, messages across all accounts can be stored in a single stamp

    • Need to efficiently enumerate, query, get, and update them

    • Traffic pattern can be highly dynamic

      • Hot objects, peak load, traffic bursts, etc

  • Need a scalable index for the objects that can

    • Spread the index across 100s of servers

    • Dynamically load balance

      • Dynamically change what servers are serving each part of the index based on load


Scalable object index via partitioning

Scalable Object Index via Partitioning

  • Partition Layer maintains an internal Object Index Table for each data abstraction

    • Blob Index: contains all blob objects for all accounts in a stamp

    • Table Entity Index: contains all table entities for all accounts in a stamp

    • Queue Message Index: contains all messages for all accounts in a stamp

  • Scalability is provided for each Object Index

    • Monitor load to each part of the index to determine hot spots

    • Index is dynamically split into thousands of Index RangePartitions based on load

    • Index RangePartitions are automatically load balanced across servers to quickly adapt to changes in load


Partition layer index range partitioning

Partition Layer – Index Range Partitioning

Blob Index

  • Split index into RangePartitions based on load

  • Split at PartitionKey boundaries

  • PartitionMap tracks Index RangePartition assignment to partition servers

  • Front-End caches the PartitionMap to route user requests

  • Each part of the index is assigned to only one Partition Server at a time

Storage Stamp

Partition Map

Partition Master

A-H: PS1

H’-R: PS2

R’-Z: PS3

A-H: PS1

H’-R: PS2

R’-Z: PS3

Partition

Server

A-H

Front-End

Server

PS 1

Partition

Server

Partition

Server

R’-Z

H’-R

PS 2

PS 3

Partition

Map


Partition layer rangepartition

Partition Layer – RangePartition

  • A RangePartition uses a Log-Structured Merge-Tree to maintain its persistent data.

  • RangePartition consists of its own set of streams in the stream layer, and the streams belong solely to a given RangePartition

  • Metadata Stream – The metadata stream is the root stream for a RangePartition.

    • The PM assigns a partition to a PS by providing the name of the RangePartition’s metadata stream

  • Commit Log Stream – Is a commit log used to store the recent insert, update, and delete operations applied to the RangePartition since the last checkpoint was generated for the RangePartition.

  • Row Data Stream – Stores the checkpoint row data and index for the RangePartition.


Stream layer

Stream Layer

  • Append-Only Distributed File System

  • Streams are very large files

    • Has file system like directory namespace

  • Stream Operations

    • Open, Close, Delete Streams

    • Rename Streams

    • Concatenate Streams together

    • Append for writing

    • Random reads


Stream layer concepts

Stream Layer Concepts

Stream //foo/myfile.data

Ptr E1

Ptr E2

Ptr E3

Ptr E4

Block

Block

Block

Block

Block

Block

Block

Block

Block

Block

Block

Block

Block

Block

Block

Extent E2

Extent E3

Extent E4

Extent E1

unsealed

Stream

Hierarchical namespace

Ordered list of pointers to extents

Append/Concatenate

Extent

Unit of replication

Sequence of blocks

Size limit (e.g. 1GB)

Sealed/unsealed

Block

Min unit of write/read

Checksum

Up to N bytes (e.g. 4MB)

sealed

sealed

sealed


Creating an extent

Creating an Extent

Paxos

Create Stream/Extent

SM

Partition Layer

SM

Stream Master

EN1 PrimaryEN2, EN3 Secondary

Allocate Extent replica set

EN 1

EN 2

EN 3

EN

Primary

Secondary A

Secondary B


Replication flow

Replication Flow

Paxos

Partition Layer

SM

EN1 PrimaryEN2, EN3 Secondary

SM

SM

Append

Ack

EN 1

EN 2

EN 3

EN

Primary

Secondary A

Secondary B


Providing bit wise identical replicas

Providing Bit-wise Identical Replicas

  • Want all replicas for an extent to be bit-wise the same, up to a committed length

    • Want to store pointers from the partition layer index to an extent+offset

    • Want to be able to read from any replica

  • Replication flow

    • All appends to an extent go to the Primary

    • Primary orders all incoming appends and picks the offset for the append in the extent

    • Primary then forwards offset and data to secondaries

    • Primary performs in-order acks back to clients for extent appends

      • Primary returns the offset of the append in the extent

      • An extent offset can commit back to the client once all replicas have written that offset and all prior offsets have also already been completely written

      • This represents the committed length of the extent


Dealing with write failures

Dealing with Write Failures

Stream //foo/myfile.dat

Ptr E1

Ptr E2

Ptr E3

Ptr E5

Ptr E4

?

Extent E5

Extent E1

Extent E2

Extent E3

Extent E4

Failure during append

  • Ack from primary lost when going back to partition layer

    • Retry from partition layer can cause multiple blocks to be appended (duplicate records)

  • Unresponsive/Unreachable Extent Node (EN)

    • Append will not be acked back to partition layer

    • Seal the failed extent

    • Allocate a new extent and append immediately


Extent sealing scenario 1

Extent Sealing (Scenario 1)

Paxos

Seal Extent

Seal Extent

Sealed at 120

Partition Layer

SM

SM

Stream Master

Append

120

Ask for current length

120

EN 1

EN 2

EN 3

EN 4

Primary

Secondary A

Secondary B


Extent sealing scenario 11

Extent Sealing (Scenario 1)

Paxos

Seal Extent

Sealed at 120

Partition Layer

SM

SM

Stream Master

Sync with SM

120

EN 1

EN 2

EN 4

EN 3

Primary

Secondary A

Secondary B


Extent sealing scenario 2

Extent Sealing (Scenario 2)

Paxos

Seal Extent

Seal Extent

Sealed at 100

Partition Layer

SM

SM

SM

Append

Ask for current length

120

100

EN 1

EN 2

EN 3

EN 4

Primary

Secondary A

Secondary B


Extent sealing scenario 21

Extent Sealing (Scenario 2)

Paxos

Seal Extent

Sealed at 100

Partition Layer

SM

SM

SM

100

Sync with SM

EN 4

EN 1

EN 2

EN 3

Primary

Secondary A

Secondary B


Providing consistency for data streams

Providing Consistency for Data Streams

Partition Server

SM

SM

SM

Safe to read from EN3

EN 1

EN 2

EN 3

  • Network partition

  • PS can talk to EN3

  • SM cannot talk to EN3

Primary

Secondary A

Secondary B

  • For Data Streams, Partition Layer only reads from offsets returned from successful appends

    • Committed on all replicas

    • Row and Blob Data Streams

  • Offset valid on any replica


Providing consistency for log streams

Providing Consistency for Log Streams

Check commit length

Partition Server

SM

SM

Use EN1, EN2 for loading

SM

Seal Extent

Check commit length

EN 1

EN 2

EN 3

  • Network partition

  • PS can talk to EN3

  • SM cannot talk to EN3

Primary

Secondary A

Secondary B

  • Logs are used on partition load

    • Commit and Metadata log streams

  • Check commit length first

  • Only read from

  • Unsealed replica if all replicas have the same commit length

  • A sealed replica


Summary

Summary

  • Highly Available Cloud Storage with Strong Consistency

  • Scalable data abstractions to build your applications

    • Blobs – Files and large objects

    • Tables – Massively scalable structured storage

    • Queues – Reliable delivery of messages

  • More information at:

  • http://www.sigops.org/sosp/sosp11/current/2011-Cascais/11-calder-online.pdf


  • Login