1 / 60

Practi Replication Towards a Unified Theory of Replication

Practi Replication Towards a Unified Theory of Replication. Nalini Belaramani , Mike Dahlin, Lei Gao, Amol Nayate, Arun Venkatramani, Praveen Yalangandula, Jiandan Zheng University of Texas at Austin 2 nd August 2006. Replication Systems Galore. Server-Replication.

lakeisha
Download Presentation

Practi Replication Towards a Unified Theory of Replication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practi ReplicationTowards a Unified Theory of Replication Nalini Belaramani, Mike Dahlin, Lei Gao, Amol Nayate, Arun Venkatramani, Praveen Yalangandula, Jiandan Zheng University of Texas at Austin 2nd August 2006

  2. Replication Systems Galore

  3. Server-Replication Bayou [Terry et al 95] • All servers have full set of data • Nodes exchange updates made since previous synchronization • Any server can exchange node with any other server • Eventually nodes will agree on order of updates to data Read Write Read Write

  4. Client-Server Model Coda [Kistler et al 92] • Data cached on client machine • Callbacks established for notification of change • Clients can get updates only from server Read A Read A Write A Write A A modified

  5. File System For Planet Lab • Data is replicated on geographically distributed nodes • Updates need to be propagated from node to node • Need to maintain strong consistency depending on application • Some FS assume complete connectivity among nodes

  6. Personal File System Data at multiple locations • Desktop, server, laptop, pda, collegues laptop Desirable properties • Download updates to only what I want • Do not necessarily have to connect to server for updates. • Some consistency guarantee

  7. See a Similarity? They are all data replication systems • Data is replicated on multiple nodes They differ . . . • How much data is replicated at each node • Who each node talks to • What consistency to guarantee So many existing replication systems • 14 systems in SOSP/OSDI in the last 10 years • New applications, New domain -> build system from scratch • Need characteristics from different systems -> build system from scratch

  8. Motivation What if we have a toolkit? • Supports mechanisms required for replication systems • Mix and match mechanisms to build system for your requirements • Pay for what you need. We will have • A way to build better replication systems • A better way to build replication systems

  9. Our Work 3 properties to characterize replication systems • PR – Partial Replication • AC – Arbitrary Consistency • TI – Topology Independence Mechanisms to support above properties • Practi prototype • Subsumes existing replication systems • Better trade-offs Policy elegantly characterized • Policy as topology • Concise declarative rules + configuration parameters

  10. Grand Challenge 100 75 Time(s) 50 25 Infinity 81 81 81 81 0 66 Full Replication Client-Server PRACTI 5.1 1.7 1.7 2.0 1.7 1.6 Plane Hotel Home Office How can I convince you? • Better tradeoffs • Build 14 OSDI/SOSP systems on prototype • With less that 1000 lines of code each

  11. Outline PRACTI Taxonomy Achieving Practi • PRACTI prototype • Evaluation Making Policy Easier • Building on PRACTI • Policy as topology Ongoing and Future Work

  12. Practi Taxonomy Characterizing Replication Systems

  13. PRACTI Taxonomy Partial Replication Replicate any subset of data to any node Topology Independence Arbitrary Consistency Support consistency requirements of application Any node can communicate with any other node

  14. PRACTI Taxonomy Partial Replication Replicate any subset of data to any node DHT (e.g. CFS, PAST) Hierarchy, Client/Server (e.g. Coda, Hier-AFS) Object Replication (e.g. Ficus, Pangaea) Topology Independence Arbitrary Consistency Support consistency requirements of application Any node can communicate with any other node Server Replication (e.g. Bayou, TACT)

  15. PRACTI Taxonomy Partial Replication Replicate any subset of data to any node DHT (e.g. CFS, PAST) Hierarchy, Client/Server (e.g. Coda, Hier-AFS) Object Replication (e.g. Ficus, Pangaea) Topology Independence Arbitrary Consistency Support consistency requirements of application Any node can communicate with any other node Server Replication (e.g. Bayou, TACT) PRACTI

  16. Why is Practi Hard? project/module/a project/module/a project/module/b project/module/b project/module/a project/module/b project/module/b project/module/b project/module/b … project/module/z Write module A Write module B Read module B Read module A time

  17. Achieving Practi Practi Prototype Evaluation

  18. Step 1: Peer-to-Peer Log Exchange

  19. Peer to Peer Log Exchange [Patterson 97] Write = <objId, acceptStamp, BODY> Log exchanges for updates • <objID, timestamp, body> • Order of updates is maintained … … Log Log Checkpoint Checkpoint Node A Node B

  20. Peer-to-Peer Log Exchange Node 1 Node 3 … … … … … Node 4 Node 2 Log exchanges for updates • TI: Pairwise exchange with any peer • AC: Careful ordering of updates in logs • Prefix property, causal/eventual consistency • Broad range of consistency [Yu and Vahdat 2002] • PR: All nodes store all data, see all updates

  21. Step 2: Separation of Metadata and Data Paths

  22. Separate Data and Metadata Paths Read foo Invalidation =<foo, <10, A>> <foo, <10, A>, body> Write =<foo, <10, A>, body> Log exchange: • Ordered streams of metadata (invalidations) • Invalidation : <object, timestamp> • All nodes see all invalidations (logically) Checkpoints track which objects are VALID Nodes receive only bodies of interest … … Log Log Checkpoint Checkpoint Node A Node B

  23. Separate Data and Metadata Paths Node 1 Node 3 … … Invalidation stream … Node 4 … body … Node 2 Separation of data and metadata paths: • TI: Pairwise exchange with any peer • AC: Careful ordering of updates in logs • PR: Partial replication of bodies Full replication of invalidations

  24. Step 3: Summarize Unneeded Metadata

  25. Summarize Unneeded Metadata Imprecise invalidation • Summary of group of invalidations • <objectSet, [start]*, [end]*> • “One or more objects in objectSet were modified between start time and end time” Conservative summary • ObjectSet may include superset of the targets • Compact encoding of large number of invalidations

  26. Summarize unneeded Metadata (2) PI:<green, <10, A>> Read foo II:<non-green, <11, A>, <13, A>> Imprecise invalidations act as “placeholders” • In log and checkpoint • Receiver knows that it is missing information • Receiver blocks operations that depend on missing information … … Log Log subscribe for green Checkpoint Node A Node B

  27. Summarize Unneeded Metadata (3) Node 1 Node 3 … … Invalidation stream … Node 4 … body … Node 2 Summarize unneeded metadata: • TI: Pairwise exchange with any peer • AC: Careful ordering of updates in logs • PR: Partial replication of bodies Partial replication of invalidations

  28. Summary of Approach Node 1 Node 3 … … … … … Node 4 Node 2 3 key ideas • Peer-to-Peer log exchange • Separation of data and metadata paths • Summarize unneeded metadata

  29. Summary of Approach Node 1 Node 3 … … invalidation stream … Node 4 … body … Node 2 3 key ideas • Peer-to-Peer log exchange • Separation of data and metadata paths • Summarize unneeded metadata

  30. Summary of Approach Node 1 Node 3 … … invalidation stream … Node 4 … body … Node 2 3 key ideas • Peer-to-Peer log exchange • Separation of data and metadata paths • Summarize unneeded metadata

  31. Why is this better? How to evaluate? • Compare with • AC-TI server replication (e.g., Bayou, TACT) • PR-AC client-server (e.g., Coda, NFS) • PR-TI object replication (e.g., Ficus, Pangea) • Key question • Does system provide significant advantages? Prototype benchmarking • Java + Berkley DB

  32. PRACTI v. Client/Server v. Full Replication HOTEL 1 Mb/s 0 Mb/s 50 Kb/s Internet 1 Mb/s 10 Mb/s 10 Mb/s 10 Mb/s

  33. Synchronization Time Infinity 100 81 81 81 81 75 66 Time(s) Full Replication 50 Client-Server PRACTI 25 5.1 1.7 1.7 2.0 1.7 1.6 0 Plane Hotel Home Office Palmtop <-> Laptop • Client-server (e.g., Coda) • Limited by network to server – Not an attractive solution • Full Replication (e.g., Bayou) • Limited by fraction of shared data – Not a feasible solution • PRACTI: • Up to order of magnitude better – Does what you want!

  34. Making Policy Easier Building on Practi Policy as Topology

  35. Practi as a toolkit Practi Prototype • Provides all 3 properties • Subsumes existing replication systems • Gives you the mechanisms • Implement policy over PRACTI for different systems Bayou Coda PlanetLab FS Personal FS . . . Policy PRACTI Prototype Mechanism

  36. System Overview Core – mechanisms • Asynchronous message passing Controller - policy Controller Requests to remote cores Requests Events Local Interface Controller Interface Requests from remote cores Read() Write() Delete() Practi Core Inval Streams Body Streams

  37. PRACTI Basics Subscription Streams • 2 types of streams – Inval streams and body streams • Every stream is associated with a subscription set • Received Invals and bodies are forwarded to appropriate outgoing streams Controller • Implements the policy • Who to establish subscriptions to • What to do in a read miss • Who to send updates to

  38. Controller Interface Notification of key events • Stream begin/end • Invalidation arrival • Body arrival • Local read miss • Became Precise • Became Imprecise Directs communication among cores • Subscribe to inval or body stream • Request demand read body Local housekeeping • Log garbage collection • Cache replacement

  39. Not all that simple yet Need to take care of • Stream management, timeouts etc. Some systems • Arrival of body or inval may require special processing • Read misses occur and need to be dealt with • Replication set based on priorities or access patterns Policy - 39 methods to do magic • Can we make it easier?

  40. Policy as Topology Characterizing Policy Elegantly

  41. Policy & Topology Overlay topology question: • Among all the possible nodes I am connected to, who do I communicate with? Replication policy questions: • If data is not available locally, who do I contact? • If data is locally updated • Who do I send updates? • Who do I send invalidates? • Whom to prefetch from? ~Topology • Replication Set • Consistency Semantics ~Configuration Parameters

  42. Policy Revisited Policy now separated into several dimensions • Propagation of updates -> Topology • if there are updates, or if I have a local read miss, who do I contact? • Consistency requirements -> Local interface • Whether we can read stale/invalid data. How stale? • Replication of data -> config file • What subset of data does each node have • Other policy essentials -> config file • How long is timeout? • How many times to retry? • How often do I GC logs? • How much storage to I have? • Conflict resolution?

  43. Bayou Policy Bayou Policy • Propagation of updates • When connected to a neighbor, exchange updates for everything -> establish update subscription from neighbor for “/*” • Replication • Full Replication • Local interface • Reads - only precise and valid objects • On read miss • Should not happen

  44. How to specify topology? In concise rules

  45. Overlog/P2 Overlog [Boon et al 05] • Declarative routing language based on Datalog • Expressive and compact rules to specify topology • Relational data model • Tuples • Stored in tables, or transient • Rules • Fired by combination of tuples and conditions • A tuple is generated after a rule is fired • Inter-node access : through remote table access or tuples • Basic Syntax • <Action> :- <Event><Condition1><Condition2>…<ConditionN> • @ - location specifier • _ - wild card P2 • Runtime system for Overlog • Parses Overlog and sets up data flows between nodes, etc.

  46. Overlog 101 Ping every neighbor periodically to find live neighbors. /* Tables */ neighbor(X,Y) liveNeighbor(X,Y) /* generate ping event every PING_PERIOD seconds */ pg0 pingEvent@X(X) :- periodic@X(X, E, PING_PERIOD). /* generate a ping request */ pg1 pingRequest@X(X, Y) :- pingEvent@X(X), neighbor@X(X, Y). /* send reply to ping request */ pg2 pingReply@Y(Y, X) :- pingRequest@X(X, Y). /* add to live neighbor table */ pg3 liveNeighbor@X(X, Y) :- pingReply@Y(Y, X).

  47. Practi & P2 Overview Wrapper • handles conversion between overlog tuples and Practi requests and events • takes care of reconnections and time-outs. DataFlows Overlog/P2 Overlog/P2 Tuples Wrapper Wrapper Requests Events Local Interface Practi Core Controller Interface Practi Core Controller Interface Local Interface Local Read & Writes Streams

  48. Practi & Overlog Implement policy with Overlog rules • Overlog tuples/table -> invoke mechanisms in Practi • AddInvalSubscription / RemoveInvalSubscription • AddBodySubscription / RemoveBodySubscription • DemandRead • Practi Events -> overlog tuples • LocalRead / LocalWrite / LocalReadMiss • RecvInval • Example: • Policy: Subscribe invalidates for /* from all neighbors • Overlog rule: AddInvalSubscription@X(X, N, SS) :- Neighbor@X(X, N), SS:= “/*”.

  49. Bayou in Overlog Bayou Policy • Replication • Full Replication • Local interface • Reads - only precise and valid objects • On read miss • Should not happen • Propagation of updates • When connected to a neighbor, exchange updates (anti-entropy) -> establish update subscription from neighbor for “/*” • In overlog: subscriptionSet("localhost:5000", "/*") AddUpdateSubscription@X(X, Y, SS) :- liveNeighbor@X(X, Y), subscriptionSet@X(X, SS)

  50. Coda in Overlog Policy for Coda (Single Server) • Replication • Server: All data • Client: HoardSet + currently being accessed • Local Interface (Client) • Reads - only precise & valid objects (blocks otherwise) • Writes - to locally valid objects (otherwise conflict) • ReadMiss • Get the object from the server, and establish callback: • Callback: establish a inval subscription for the object. • Propagation of Updates • Client sends updates to Server • Server: Break callback for all other clients who have the obj • To break callback: remove obj from inval subscription stream • Hoarding • Periodically, fetch all (invalid) objects and establish callbacks on them

More Related