pastiche making backup cheap and easy n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Pastiche: Making Backup Cheap and Easy PowerPoint Presentation
Download Presentation
Pastiche: Making Backup Cheap and Easy

Loading in 2 Seconds...

play fullscreen
1 / 36

Pastiche: Making Backup Cheap and Easy

0 Views Download Presentation
Download Presentation

Pastiche: Making Backup Cheap and Easy

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Pastiche: Making Backup Cheap and Easy

  2. Introduction • Backup is cumbersome and expensive • ~$4/GB/Month • Small-scale solutions dominated by administrative efforts • Large-scale solutions require centralized management

  3. Pastiche • Observation 1: disk is no longer full • Can use excess capacity for efficient, effective, and administration-free backup • Use untrusted machines to perform backup services • Need replication for reliability • Need to balance locality and reliability

  4. Pastiche • Observation 2: Much of the data on a given machine is not unique • Office 2000: 217 MB footprint • Different installations are largely the same • It’s exploitation can achieve storage savings

  5. Pastiche • Built on three pieces of research • Pastry: Peer-to-peer, self-administering, scalable routing • Content-based indexing: easy discovering of redundant data • Convergent encryption: use the same encrypted representation without sharing keys

  6. Challenges • How to discover backup buddies without a centralized directory? • How can nodes reuse their own state to backup others? • How can nodes restore files/machines without requiring administrative intervention? • How can nodes detect unfaithful buddies?

  7. Basic Idea • Summarize storage content with abstracts • Use abstracts to locate buddies • A skeleton tree is used to represent and restore an entire file system • Periodic queries of buddies for stored data

  8. Enabling Technologies • Peer-to-peer routing • Content-based indexing • Convergent encryption

  9. Peer-to-Peer Routing • Pastry: scalable, self-organizing, routing and object location infrastructure • Each node has a nodeID • IDs are uniformly distributed in the ID space • A proximity metric to measure the distance between two IDs

  10. More on Pastry - I • Each node maintains three sets of states • Leaf set • Closest nodes in terms of nodeIDs • Neighborhood set • Closest nodes in terms of of the proximity metric • Routing table • Prefix routing

  11. More on Pastry - II • Pastry is self organizing • Nodes come and go • seed discovery protocol

  12. Prefix Routing • In each step, a node forwards the message to a node whose nodeID shares a prefix that is at least one digit longer than the prefix of the current nodeID • Destination: 1230 • Current NodeID: 1023 • Next Hop: 12--

  13. Pastiche’s Use of Pastry • Uses two separate Pastry overlay networks during buddy discovery • Once a node is discovered, traffic is send directly via IP • Pastiche adds two mechanisms • Lighthouse sweep to discover distinct Pastry nodes • Distance metric based on the FS contents

  14. Content-Based Indexing • Goal: identify file regions for sharing • Use Rabin fingerprints • A fingerprint is generated for each overlapping k-byte substring in a file • If the lower-order bits of a fingerprint match a predetermined value, that offset is marked as an anchor • Anchors divide files into chunks; each chunk isassociated with a secure hash value

  15. Sharing with Confidentiality • Sharing encrypted data without sharing keys • Need to have a single encrypted representation • Use convergent encryption

  16. Convergent Encryption • So…say…how do you share a door without sharing its corresponding keys?

  17. Convergent Encryption • How about use different safes to stores those keys?

  18. Convergent Encryption • And use different keys to access those keys

  19. Implications of the Use of Convergent Encryption • If a backup node is not a participating group • Cannot decrypt the data • If not, a backup node knows the node also stores that data • Information leak vs. storage efficiency

  20. Design • Pastiche data is stored in chunks • Chunk boundaries determined by content-based indexing • Encrypted with convergent encryption • Chunks carry owner lists

  21. Design • When a newly written file is closed, it is scheduled for chunking • If a chunk already exists, the local host is added to the owner list • If not, encrypt the chunk and write it out • Chunking and writing deferred to avoid short-lived files

  22. Design • Chunks are immutable • When a file is written, its set of chunk may change • A chunk is not deleted until the last reference to it is removed

  23. Abstracts: Finding Redundancy • An ideal backup buddy is one that holds a superset of the new machine’s data • To find it, send the full signature (hashes) of the new node to candidate buddies • However, we need to transfer 1.3MB per GB of stored data • Solution: Abstracts—transfer only a random subset of signatures

  24. Compare one disk to another Node1 signature Node2 signature 98 98 73 73 1 1 46 46 98 98 73 73 1 1 46 46 20 67 8 8 11 11 55 55 20 67 8 8 11 11 55 55 26 7 13 53 45 16 24 21 7 26 53 13 17 16 24 93 35 33 15 18 16 45 24 21 35 77 15 19 35 33 15 18 1 67 13 15 Node1 abstract

  25. Overlays: Finding a Set of Buddies • A desirable buddy should have • A substantial overlap • Physically nearby (with at least one far away to survive geographically correlated failures)

  26. Applied Use of Pastry • Pastiche uses two Pastry overlays to facilitate buddy discovery • One for network proximity • One for file system overlap • Coverage—the fraction of overlapping chunks stored on a site

  27. Security Problems • A malicious node can • Under-report coverage to avoid being chosen as a buddy • Over-report coverage to attract clients just to discard their chunks

  28. Backup Protocol • A Pastiche node has full control over the backup schedule • A snapshot consists of three things • Chunks to be added • Chunks to be removed • Metadata of those chunks

  29. Restoration • A Pastiche node retains its archive skeleton, so performing partial restores is easy • To recover the whole machine, a node has to obtain its root node from one of the backup machines first…

  30. Detecting Failure and Malice • A node randomly requests data from its buddies • Can bound probability of having failures and malicious nodes undetected

  31. Preventing Greed • Someone can store things everywhere • Need to institute distributed quota • Very difficult • Some proposed solutions • Each node monitors the overall storage costs imposed by its backup clients • Problem: Sybil attacks (forge many entities that consumes little storage)

  32. Preventing Greed • Force each node to solve puzzles proportional to storage consumption • Problems: • Needless expensive • Storage is traded against something other than storage • Heterogeneous computing power

  33. Preventing Greed • Electronic currency • Problems: • Need to add atomic currency transactions • Complicated

  34. Implementation • Chunkstore file system • Backup daemon

  35. Performance Overhead

  36. The Chance of Finding Buddies