1 / 36

Pastiche: Making Backup Cheap and Easy

Pastiche: Making Backup Cheap and Easy. Introduction. Backup is cumbersome and expensive ~$4/GB/Month Small-scale solutions dominated by administrative efforts Large-scale solutions require centralized management. Pastiche. Observation 1: disk is no longer full

ronaldhunt
Download Presentation

Pastiche: Making Backup Cheap and Easy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pastiche: Making Backup Cheap and Easy

  2. Introduction • Backup is cumbersome and expensive • ~$4/GB/Month • Small-scale solutions dominated by administrative efforts • Large-scale solutions require centralized management

  3. Pastiche • Observation 1: disk is no longer full • Can use excess capacity for efficient, effective, and administration-free backup • Use untrusted machines to perform backup services • Need replication for reliability • Need to balance locality and reliability

  4. Pastiche • Observation 2: Much of the data on a given machine is not unique • Office 2000: 217 MB footprint • Different installations are largely the same • It’s exploitation can achieve storage savings

  5. Pastiche • Built on three pieces of research • Pastry: Peer-to-peer, self-administering, scalable routing • Content-based indexing: easy discovering of redundant data • Convergent encryption: use the same encrypted representation without sharing keys

  6. Challenges • How to discover backup buddies without a centralized directory? • How can nodes reuse their own state to backup others? • How can nodes restore files/machines without requiring administrative intervention? • How can nodes detect unfaithful buddies?

  7. Basic Idea • Summarize storage content with abstracts • Use abstracts to locate buddies • A skeleton tree is used to represent and restore an entire file system • Periodic queries of buddies for stored data

  8. Enabling Technologies • Peer-to-peer routing • Content-based indexing • Convergent encryption

  9. Peer-to-Peer Routing • Pastry: scalable, self-organizing, routing and object location infrastructure • Each node has a nodeID • IDs are uniformly distributed in the ID space • A proximity metric to measure the distance between two IDs

  10. More on Pastry - I • Each node maintains three sets of states • Leaf set • Closest nodes in terms of nodeIDs • Neighborhood set • Closest nodes in terms of of the proximity metric • Routing table • Prefix routing

  11. More on Pastry - II • Pastry is self organizing • Nodes come and go • seed discovery protocol

  12. Prefix Routing • In each step, a node forwards the message to a node whose nodeID shares a prefix that is at least one digit longer than the prefix of the current nodeID • Destination: 1230 • Current NodeID: 1023 • Next Hop: 12--

  13. Pastiche’s Use of Pastry • Uses two separate Pastry overlay networks during buddy discovery • Once a node is discovered, traffic is send directly via IP • Pastiche adds two mechanisms • Lighthouse sweep to discover distinct Pastry nodes • Distance metric based on the FS contents

  14. Content-Based Indexing • Goal: identify file regions for sharing • Use Rabin fingerprints • A fingerprint is generated for each overlapping k-byte substring in a file • If the lower-order bits of a fingerprint match a predetermined value, that offset is marked as an anchor • Anchors divide files into chunks; each chunk isassociated with a secure hash value

  15. Sharing with Confidentiality • Sharing encrypted data without sharing keys • Need to have a single encrypted representation • Use convergent encryption

  16. Convergent Encryption • So…say…how do you share a door without sharing its corresponding keys?

  17. Convergent Encryption • How about use different safes to stores those keys?

  18. Convergent Encryption • And use different keys to access those keys

  19. Implications of the Use of Convergent Encryption • If a backup node is not a participating group • Cannot decrypt the data • If not, a backup node knows the node also stores that data • Information leak vs. storage efficiency

  20. Design • Pastiche data is stored in chunks • Chunk boundaries determined by content-based indexing • Encrypted with convergent encryption • Chunks carry owner lists

  21. Design • When a newly written file is closed, it is scheduled for chunking • If a chunk already exists, the local host is added to the owner list • If not, encrypt the chunk and write it out • Chunking and writing deferred to avoid short-lived files

  22. Design • Chunks are immutable • When a file is written, its set of chunk may change • A chunk is not deleted until the last reference to it is removed

  23. Abstracts: Finding Redundancy • An ideal backup buddy is one that holds a superset of the new machine’s data • To find it, send the full signature (hashes) of the new node to candidate buddies • However, we need to transfer 1.3MB per GB of stored data • Solution: Abstracts—transfer only a random subset of signatures

  24. Compare one disk to another Node1 signature Node2 signature 98 98 73 73 1 1 46 46 98 98 73 73 1 1 46 46 20 67 8 8 11 11 55 55 20 67 8 8 11 11 55 55 26 7 13 53 45 16 24 21 7 26 53 13 17 16 24 93 35 33 15 18 16 45 24 21 35 77 15 19 35 33 15 18 1 67 13 15 Node1 abstract

  25. Overlays: Finding a Set of Buddies • A desirable buddy should have • A substantial overlap • Physically nearby (with at least one far away to survive geographically correlated failures)

  26. Applied Use of Pastry • Pastiche uses two Pastry overlays to facilitate buddy discovery • One for network proximity • One for file system overlap • Coverage—the fraction of overlapping chunks stored on a site

  27. Security Problems • A malicious node can • Under-report coverage to avoid being chosen as a buddy • Over-report coverage to attract clients just to discard their chunks

  28. Backup Protocol • A Pastiche node has full control over the backup schedule • A snapshot consists of three things • Chunks to be added • Chunks to be removed • Metadata of those chunks

  29. Restoration • A Pastiche node retains its archive skeleton, so performing partial restores is easy • To recover the whole machine, a node has to obtain its root node from one of the backup machines first…

  30. Detecting Failure and Malice • A node randomly requests data from its buddies • Can bound probability of having failures and malicious nodes undetected

  31. Preventing Greed • Someone can store things everywhere • Need to institute distributed quota • Very difficult • Some proposed solutions • Each node monitors the overall storage costs imposed by its backup clients • Problem: Sybil attacks (forge many entities that consumes little storage)

  32. Preventing Greed • Force each node to solve puzzles proportional to storage consumption • Problems: • Needless expensive • Storage is traded against something other than storage • Heterogeneous computing power

  33. Preventing Greed • Electronic currency • Problems: • Need to add atomic currency transactions • Complicated

  34. Implementation • Chunkstore file system • Backup daemon

  35. Performance Overhead

  36. The Chance of Finding Buddies

More Related