practi replication towards a unified theory of replication n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Practi Replication Towards a Unified Theory of Replication PowerPoint Presentation
Download Presentation
Practi Replication Towards a Unified Theory of Replication

Loading in 2 Seconds...

play fullscreen
1 / 60

Practi Replication Towards a Unified Theory of Replication - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

Practi Replication Towards a Unified Theory of Replication. Nalini Belaramani , Mike Dahlin, Lei Gao, Amol Nayate, Arun Venkatramani, Praveen Yalangandula, Jiandan Zheng University of Texas at Austin 2 nd August 2006. Replication Systems Galore. Server-Replication.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Practi Replication Towards a Unified Theory of Replication' - lakeisha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
practi replication towards a unified theory of replication

Practi ReplicationTowards a Unified Theory of Replication

Nalini Belaramani, Mike Dahlin, Lei Gao,

Amol Nayate, Arun Venkatramani, Praveen Yalangandula,

Jiandan Zheng

University of Texas at Austin

2nd August 2006

server replication
Server-Replication

Bayou [Terry et al 95]

  • All servers have full set of data
  • Nodes exchange updates made since previous synchronization
  • Any server can exchange node with any other server
  • Eventually nodes will agree on order of updates to data

Read

Write

Read

Write

client server model
Client-Server Model

Coda [Kistler et al 92]

  • Data cached on client machine
  • Callbacks established for notification of change
  • Clients can get updates only from server

Read A

Read A

Write A

Write A

A modified

file system for planet lab
File System For Planet Lab
  • Data is replicated on geographically distributed nodes
  • Updates need to be propagated from node to node
  • Need to maintain strong consistency depending on application
  • Some FS assume complete connectivity among nodes
personal file system
Personal File System

Data at multiple locations

  • Desktop, server, laptop, pda, collegues laptop

Desirable properties

  • Download updates to only what I want
  • Do not necessarily have to connect to server for updates.
  • Some consistency guarantee
see a similarity
See a Similarity?

They are all data replication systems

  • Data is replicated on multiple nodes

They differ . . .

  • How much data is replicated at each node
  • Who each node talks to
  • What consistency to guarantee

So many existing replication systems

  • 14 systems in SOSP/OSDI in the last 10 years
  • New applications, New domain -> build system from scratch
  • Need characteristics from different systems -> build system from scratch
motivation
Motivation

What if we have a toolkit?

  • Supports mechanisms required for replication systems
  • Mix and match mechanisms to build system for your requirements
  • Pay for what you need.

We will have

  • A way to build better replication systems
  • A better way to build replication systems
our work
Our Work

3 properties to characterize replication systems

  • PR – Partial Replication
  • AC – Arbitrary Consistency
  • TI – Topology Independence

Mechanisms to support above properties

  • Practi prototype
  • Subsumes existing replication systems
  • Better trade-offs

Policy elegantly characterized

  • Policy as topology
    • Concise declarative rules + configuration parameters
grand challenge
Grand Challenge

100

75

Time(s)

50

25

Infinity

81

81

81

81

0

66

Full Replication

Client-Server

PRACTI

5.1

1.7

1.7

2.0

1.7

1.6

Plane

Hotel

Home

Office

How can I convince you?

  • Better tradeoffs
  • Build 14 OSDI/SOSP systems on prototype
    • With less that 1000 lines of code each
outline
Outline

PRACTI Taxonomy

Achieving Practi

  • PRACTI prototype
  • Evaluation

Making Policy Easier

  • Building on PRACTI
  • Policy as topology

Ongoing and Future Work

practi taxonomy

Practi Taxonomy

Characterizing Replication Systems

practi taxonomy1
PRACTI Taxonomy

Partial

Replication

Replicate any subset of data to any node

Topology

Independence

Arbitrary

Consistency

Support consistency requirements of application

Any node can communicate with any other node

practi taxonomy2
PRACTI Taxonomy

Partial

Replication

Replicate any subset of data to any node

DHT

(e.g. CFS, PAST)

Hierarchy, Client/Server (e.g. Coda, Hier-AFS)

Object Replication (e.g. Ficus, Pangaea)

Topology

Independence

Arbitrary

Consistency

Support consistency requirements of application

Any node can communicate with any other node

Server Replication (e.g. Bayou, TACT)

practi taxonomy3
PRACTI Taxonomy

Partial

Replication

Replicate any subset of data to any node

DHT

(e.g. CFS, PAST)

Hierarchy, Client/Server (e.g. Coda, Hier-AFS)

Object Replication (e.g. Ficus, Pangaea)

Topology

Independence

Arbitrary

Consistency

Support consistency requirements of application

Any node can communicate with any other node

Server Replication (e.g. Bayou, TACT)

PRACTI

why is practi hard
Why is Practi Hard?

project/module/a

project/module/a

project/module/b

project/module/b

project/module/a

project/module/b

project/module/b

project/module/b

project/module/b

project/module/z

Write module A

Write module B

Read module B

Read module A

time

achieving practi

Achieving Practi

Practi Prototype

Evaluation

peer to peer log exchange patterson 97
Peer to Peer Log Exchange [Patterson 97]

Write = <objId, acceptStamp, BODY>

Log exchanges for updates

  • <objID, timestamp, body>
  • Order of updates is maintained

Log

Log

Checkpoint

Checkpoint

Node A

Node B

peer to peer log exchange
Peer-to-Peer Log Exchange

Node 1

Node 3

Node 4

Node 2

Log exchanges for updates

  • TI: Pairwise exchange with any peer
  • AC: Careful ordering of updates in logs
    • Prefix property, causal/eventual consistency
    • Broad range of consistency [Yu and Vahdat 2002]
  • PR: All nodes store all data, see all updates
separate data and metadata paths
Separate Data and Metadata Paths

Read foo

Invalidation =<foo, <10, A>>

<foo, <10, A>, body>

Write =<foo, <10, A>, body>

Log exchange:

  • Ordered streams of metadata (invalidations)
    • Invalidation : <object, timestamp>
  • All nodes see all invalidations (logically)

Checkpoints track which objects are VALID

Nodes receive only bodies of interest

Log

Log

Checkpoint

Checkpoint

Node A

Node B

separate data and metadata paths1
Separate Data and Metadata Paths

Node 1

Node 3

Invalidation stream

Node 4

body

Node 2

Separation of data and metadata paths:

  • TI: Pairwise exchange with any peer
  • AC: Careful ordering of updates in logs
  • PR: Partial replication of bodies

Full replication of invalidations

summarize unneeded metadata
Summarize Unneeded Metadata

Imprecise invalidation

  • Summary of group of invalidations
  • <objectSet, [start]*, [end]*>
  • “One or more objects in objectSet were modified between start time and end time”

Conservative summary

  • ObjectSet may include superset of the targets
  • Compact encoding of large number of invalidations
summarize unneeded metadata 2
Summarize unneeded Metadata (2)

PI:<green, <10, A>>

Read foo

II:<non-green, <11, A>, <13, A>>

Imprecise invalidations act as “placeholders”

  • In log and checkpoint
  • Receiver knows that it is missing information
  • Receiver blocks operations that depend on missing information

Log

Log

subscribe for green

Checkpoint

Node A

Node B

summarize unneeded metadata 3
Summarize Unneeded Metadata (3)

Node 1

Node 3

Invalidation stream

Node 4

body

Node 2

Summarize unneeded metadata:

  • TI: Pairwise exchange with any peer
  • AC: Careful ordering of updates in logs
  • PR: Partial replication of bodies

Partial replication of invalidations

summary of approach
Summary of Approach

Node 1

Node 3

Node 4

Node 2

3 key ideas

  • Peer-to-Peer log exchange
  • Separation of data and metadata paths
  • Summarize unneeded metadata
summary of approach1
Summary of Approach

Node 1

Node 3

invalidation stream

Node 4

body

Node 2

3 key ideas

  • Peer-to-Peer log exchange
  • Separation of data and metadata paths
  • Summarize unneeded metadata
summary of approach2
Summary of Approach

Node 1

Node 3

invalidation stream

Node 4

body

Node 2

3 key ideas

  • Peer-to-Peer log exchange
  • Separation of data and metadata paths
  • Summarize unneeded metadata
why is this better
Why is this better?

How to evaluate?

  • Compare with
    • AC-TI server replication (e.g., Bayou, TACT)
    • PR-AC client-server (e.g., Coda, NFS)
    • PR-TI object replication (e.g., Ficus, Pangea)
  • Key question
    • Does system provide significant advantages?

Prototype benchmarking

  • Java + Berkley DB
practi v client server v full replication
PRACTI v. Client/Server v. Full Replication

HOTEL

1 Mb/s

0 Mb/s

50 Kb/s

Internet

1 Mb/s

10 Mb/s

10 Mb/s

10 Mb/s

synchronization time
Synchronization Time

Infinity

100

81

81

81

81

75

66

Time(s)

Full Replication

50

Client-Server

PRACTI

25

5.1

1.7

1.7

2.0

1.7

1.6

0

Plane

Hotel

Home

Office

Palmtop <-> Laptop

  • Client-server (e.g., Coda)
      • Limited by network to server – Not an attractive solution
  • Full Replication (e.g., Bayou)
      • Limited by fraction of shared data – Not a feasible solution
  • PRACTI:
      • Up to order of magnitude better – Does what you want!
making policy easier

Making Policy Easier

Building on Practi

Policy as Topology

practi as a toolkit
Practi as a toolkit

Practi Prototype

  • Provides all 3 properties
  • Subsumes existing replication systems
  • Gives you the mechanisms
  • Implement policy over PRACTI for different systems

Bayou

Coda

PlanetLab

FS

Personal

FS

. . .

Policy

PRACTI Prototype

Mechanism

system overview
System Overview

Core – mechanisms

  • Asynchronous message passing

Controller - policy

Controller

Requests to remote cores

Requests

Events

Local Interface

Controller Interface

Requests from remote cores

Read()

Write()

Delete()

Practi

Core

Inval Streams

Body Streams

practi basics
PRACTI Basics

Subscription Streams

  • 2 types of streams – Inval streams and body streams
  • Every stream is associated with a subscription set
  • Received Invals and bodies are forwarded to appropriate outgoing streams

Controller

  • Implements the policy
    • Who to establish subscriptions to
    • What to do in a read miss
    • Who to send updates to
controller interface
Controller Interface

Notification of key events

  • Stream begin/end
  • Invalidation arrival
  • Body arrival
  • Local read miss
  • Became Precise
  • Became Imprecise

Directs communication among cores

  • Subscribe to inval or body stream
  • Request demand read body

Local housekeeping

  • Log garbage collection
  • Cache replacement
not all that simple yet
Not all that simple yet

Need to take care of

  • Stream management, timeouts etc.

Some systems

  • Arrival of body or inval may require special processing
  • Read misses occur and need to be dealt with
  • Replication set based on priorities or access patterns

Policy - 39 methods to do magic

  • Can we make it easier?
policy as topology

Policy as Topology

Characterizing Policy Elegantly

policy topology
Policy & Topology

Overlay topology question:

  • Among all the possible nodes I am connected to, who do I communicate with?

Replication policy questions:

  • If data is not available locally, who do I contact?
  • If data is locally updated
    • Who do I send updates?
    • Who do I send invalidates?
  • Whom to prefetch from?

~Topology

  • Replication Set
  • Consistency Semantics

~Configuration Parameters

policy revisited
Policy Revisited

Policy now separated into several dimensions

  • Propagation of updates -> Topology
    • if there are updates, or if I have a local read miss, who do I contact?
  • Consistency requirements -> Local interface
    • Whether we can read stale/invalid data. How stale?
  • Replication of data -> config file
    • What subset of data does each node have
  • Other policy essentials -> config file
    • How long is timeout?
    • How many times to retry?
    • How often do I GC logs?
    • How much storage to I have?
    • Conflict resolution?
bayou policy
Bayou Policy

Bayou Policy

  • Propagation of updates
    • When connected to a neighbor, exchange updates for everything

-> establish update subscription from neighbor for “/*”

  • Replication
    • Full Replication
  • Local interface
    • Reads - only precise and valid objects
  • On read miss
    • Should not happen
how to specify topology

How to specify topology?

In concise rules

overlog p2
Overlog/P2

Overlog [Boon et al 05]

  • Declarative routing language based on Datalog
  • Expressive and compact rules to specify topology
  • Relational data model
    • Tuples
      • Stored in tables, or transient
    • Rules
      • Fired by combination of tuples and conditions
      • A tuple is generated after a rule is fired
    • Inter-node access : through remote table access or tuples
    • Basic Syntax
      • <Action> :- <Event><Condition1><Condition2>…<ConditionN>
      • @ - location specifier
      • _ - wild card

P2

  • Runtime system for Overlog
    • Parses Overlog and sets up data flows between nodes, etc.
overlog 101
Overlog 101

Ping every neighbor periodically to find live neighbors.

/* Tables */

neighbor(X,Y)

liveNeighbor(X,Y)

/* generate ping event every PING_PERIOD seconds */

pg0 pingEvent@X(X) :- periodic@X(X, E, PING_PERIOD).

/* generate a ping request */

pg1 pingRequest@X(X, Y) :- pingEvent@X(X), neighbor@X(X, Y).

/* send reply to ping request */

pg2 pingReply@Y(Y, X) :- pingRequest@X(X, Y).

/* add to live neighbor table */

pg3 liveNeighbor@X(X, Y) :- pingReply@Y(Y, X).

practi p2 overview
Practi & P2 Overview

Wrapper

  • handles conversion between overlog tuples and Practi requests and events
  • takes care of reconnections and time-outs.

DataFlows

Overlog/P2

Overlog/P2

Tuples

Wrapper

Wrapper

Requests

Events

Local Interface

Practi

Core

Controller Interface

Practi

Core

Controller Interface

Local Interface

Local

Read

&

Writes

Streams

practi overlog
Practi & Overlog

Implement policy with Overlog rules

  • Overlog tuples/table -> invoke mechanisms in Practi
    • AddInvalSubscription / RemoveInvalSubscription
    • AddBodySubscription / RemoveBodySubscription
    • DemandRead
  • Practi Events -> overlog tuples
    • LocalRead / LocalWrite / LocalReadMiss
    • RecvInval
  • Example:
    • Policy: Subscribe invalidates for /* from all neighbors
    • Overlog rule:

AddInvalSubscription@X(X, N, SS) :- Neighbor@X(X, N), SS:= “/*”.

bayou in overlog
Bayou in Overlog

Bayou Policy

  • Replication
    • Full Replication
  • Local interface
    • Reads - only precise and valid objects
  • On read miss
    • Should not happen
  • Propagation of updates
    • When connected to a neighbor, exchange updates (anti-entropy)

-> establish update subscription from neighbor for “/*”

    • In overlog:

subscriptionSet("localhost:5000", "/*")

AddUpdateSubscription@X(X, Y, SS) :- liveNeighbor@X(X, Y),

subscriptionSet@X(X, SS)

coda in overlog
Coda in Overlog

Policy for Coda (Single Server)

  • Replication
    • Server: All data
    • Client: HoardSet + currently being accessed
  • Local Interface (Client)
    • Reads - only precise & valid objects (blocks otherwise)
    • Writes - to locally valid objects (otherwise conflict)
  • ReadMiss
    • Get the object from the server, and establish callback:
      • Callback: establish a inval subscription for the object.
  • Propagation of Updates
    • Client sends updates to Server
    • Server: Break callback for all other clients who have the obj
      • To break callback: remove obj from inval subscription stream
  • Hoarding
    • Periodically, fetch all (invalid) objects and establish callbacks on them
coda in overlog 2
Coda in Overlog 2
  • Client: On Read Miss
    • Get Obj from Server
      • DemandRead@X(X, S, Obj, offset, length) :-

localReadMiss@X(X, Obj, offset, length), Server@X(X, S), isConnected@X(X, V), V == 1.

    • Establish Callback
      • AddInvalSubscription@X(X, S, Obj) :-

localReadMiss@X(X, Obj, _, _), Server@X(X, S), isConnected@X(X, V), V == 1.

    • Set up Subscription for Updates
      • AddInvalSubscription@S(S, X, Obj) :-

localReadMiss@X(X, Obj, _, _), Server@X(X, S), isConnected@X(X, V), V == 1.

      • AddBodySubscription@S(S, X Obj) :-

localReadMiss@X(X, Obj, _, _), Server@X(X, S), isConnected@X(X, V), V == 1.

  • Server: On receiving Update from Client
    • Break callbacks for other clients
      • removeOutgoingInvalSubscription@X(X, C2, Obj) :-

receivedInval@X(X, C1, O, _, _, _, _, _), establishedOutgoingInvalSubscriptions@X(X, C2, O),C1 != C2.

grand challenge1
Grand Challenge

100

75

Time(s)

50

25

Infinity

81

81

81

81

0

66

Full Replication

Client-Server

PRACTI

5.1

1.7

1.7

2.0

1.7

1.6

Plane

Hotel

Home

Office

How can I convince you?

  • Better Tradeoffs
  • Build 14 OSDI/SOSP Systems on Prototype
    • Experience so far
      • Bayou – 1 rule + 10 config parameters
      • CODA – 13 rules + 10 config parameters
overlog p2 not quite perfect
Overlog/P2 – not quite perfect

No guarantee of atomicity, or ordering among rules.

Difficult to specify access-based policies.

Difficult to specify policies which store information in the replicated object itself

ongoing and future work1
Ongoing and Future Work

To make a dream a reality

  • Overlog + Practi integration
  • 14 OSDI/SOSP Systems
  • NFS interface
  • New Systems: Personal File System, Enterprise File System
  • Scalibility -- 1000s of nodes
  • Security
conclusions
Conclusions

Identified 3 properties which can be used to classify existing replication systems.

A way to build better replication systems

  • First replication architecture which provides all three properties
  • Subsumes existing systems
  • Exposes new points in the design space

A better way to build replication systems

  • Policy elegantly characterized
    • Policy as topology and configuration parameters
  • Policy can be written as concise rules + config parameters
thank you
Thank You

Towards a unified replication architecture

http://www.cs.utexas.edu/~dahlin/unifiedReplication

slide60
Why is this better?
  • Subsumes existing systems
    • Client-Server, server, object replication, P2P, quorums, ..
  • Exposes new points design space
  • Makes it easier to build new systems
  • Builds better systems