1 / 28

Pastiche: Making Backup Cheap and Easy

Pastiche: Making Backup Cheap and Easy . Presented by Deniz Hastorun. Overview . Motivation and Goals Enabling Technologies System Design Implementation and Evaluation Limitation Conclusion. Motivation . Majority of users today do not backup

ady
Download Presentation

Pastiche: Making Backup Cheap and Easy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pastiche: Making Backup Cheap and Easy Presented by Deniz Hastorun

  2. Overview • Motivation and Goals • Enabling Technologies • System Design • Implementation and Evaluation • Limitation • Conclusion

  3. Motivation • Majority of users today do not backup • Those who do, either backup rarely and not everything • Backup is cumbersome and can be expensive • Restoration is time consuming • Is it really rewarding, justifications? • Disks are getting cheap • File systems now only %53 full on average • Newly written data of client/day small friction of FS

  4. Goals • P2P backup system • Utilizes excess disk storage capacity of peers • Leverage common data and programs (Windows, Office, Linux, etc) • Ensure integrity and privacy. • Efficient, cost-free, administrative-free

  5. PASTICHE • P2P backup system • Nodes form a cooperative but untrusted collection of machines for backup service • Target environment end user machines • Identifies systems with overlap for efficient backup and space savings

  6. Enabling Technologies • Pastry for self organizing routing and node location • Content based indexing • Identify redundant data in similar files • Divide files into chunks using Rabin fingerprints • Convergent Encryption • Enables encrypted sharing of data w/o sharing keys

  7. Pastry • Self –organizing,p2p routing overlay • Nodes represented by nodeId uniformly distributed across nodeID- (SHA1 output of domain names ) • Each node maintains • Leaf set: L/2 closest smaller (larger) nodeIDs • Neighborhood set: Closest nodes according to proximity metric • Routing table: Prefix routing • Used for discovery of backup buddies in Pastiche • Upon discovery of the buddy set, traffic routed directly via IP • 2 mechanisms added • Lighthouse sweep • Distance metric based on FS contents

  8. Content Based Indexing • Identify boundary regions- anchors • Anchors divide files into chunks • Rabin fingerprints • Fingerprint computed for each overlapping k-byte substring in a file • If low order bits match a predetermined value, offset marked as an anchor • Edits only change relative chunks • Chunks named by SHA-1 hash of its content

  9. Convergent Encryption • Encrypt file using key derived from file’s contents. • Further encrypt using client’s key. • Clients share encrypted files w/o sharing keys • Applied to all on-disk chunks in Pastiche • Used also in Farsite • Encrypted key is stored with file in FARSITE • Keys are stored in metadata chunks in Pastiche • Backup buddies know they store same data- information leekage in Pastiche

  10. Design (1) • Data stored as immutable chunks on disk • Content based indexing+ convergent encryption • Each chunk carries owner lists andmaintains reference count. • When a newly written file is closed, it is scheduled for chunking: • Hc – Handle • Ic – Chunk ID • Kc – Encryption key

  11. Design(2) • Chunks can be stored on local disk, a backup client or both • Chunk ID list describing node’s FS form signature. • Chunking simplifies implementation of chunk sharing, convergent encryption, backup/restore • If a chunk exists, add local host to owner list + increment ref. count • Else encrypt, append MAC and write out to disk w/ ref. count=1 for local owner

  12. Design(3) • Remote hosts supply their public key for backup storage request • For removal request • Must be signed by corresponding secret key • Host removed from owner list • Decrease reference count, if 0 ,remove the local owner • If owner list empty, reclaim chunks storage space

  13. Design (4)- File Metadata • Meta-data • list of handles for the chunks • usual contents (ownership, permissions, creation, modification date, etc) • Mutable with fixed Hc, Kc and Ic • File system root meta-data: Hc generated based on host-specific passphrase. • Meta-data chunks stored as encrypted for protection

  14. Abstracts • Amount of data written after initial installation is relatively small • Initial backup of a freshly installed machine most expensive • Ideal backup buddy the one w/ complete coverage • Ship the full signature to candiates and have them report the degree of overlap • Signatures are large, 20 bytes/chunk • Instead send a small, random subset of signature -abstract

  15. Finding Set of Buddies • Two Pastry overlays used • Organized by network proximity and file system overlap • Nodes join Pastry overlay w/ their nodeID • Route Pastry msg to a random node with abstract • Each node along the route returns • Its address • Its coverage with the abstract • If not successful, repeat this probing process by varying first digit of the original nodeID • Rotating probing process -> Lighthouse sweep

  16. Finding Set of Buddies(2) • In the case of machines w/ rare installations? • Nodes join the second overlay- coverage rate overlay • File system overlap used as distance metric • Neighbor set-nodes encountered during join with best coverage available • Backup bodies selected from neighbor set • Coverage rate not symmetric btw nodes • Possibility of malicious nodes under- or over reporting their coverage rate

  17. Backup Protocol • Each backup event viewed as a single snapshot • Meta-data skeleton for each snapshot stored as persistent per-file logs. • Stored both on the machine and on all its buddies • State necessary for a new snapshot includes: • List of Chunks to be added • List of chunks to be removed • Meta-data objects in the skeleton changed as a result

  18. Snapshot Process • Snapshot host (A) sends its public key to backup buddy (B) • To be used for later validation of requests • A forwards chunkIDs of add set to B. • B fetchs the chunks not already stored, from A. • A sends delete list signed with its private key • The list only contains chunkIds not belonging to any snaphots • A sends updated meta-data chunks signed w/ priv key. • A sends commit request of the checkpoint, B responds when all changes are stored persistently.

  19. Restoration • Each node retains its archive skeleton • Partial restores are straightforward • Obtain the chunks requested from the relevant buddies • For recovering entire machine • Keep entire copy of root meta-data object in each member of leaf set. • Rejoin first overlay with same nodeID • Retrieve root meta-data object from one of the leaves. • Decrypt it w/ the key generated from its passphrase • Root block contain list of backup buddies.

  20. Detecting Failure and Malice • Untrusted Buddies: • can reclaim chunks storage if it runs out of disk space. • can fail or leave the network. • Malicious buddy may claim to store the chunks. • Probabilistic Mechanism as solution: • Before taking a new snapshot, query buddies for random subset of chunks from the node archive. • If unsuccessful, remove buddy from the list and search for replacement of buddy. • Sybil attacks • Malicious party occupy substantial fraction of nodeID space. • No defense against it • Requires some centralized authority to certify identities

  21. Preventing Greed • Greedy host aggressively consumes storage and never retires->Freerider problem in p2p systems. • Distributed quota enforcement mechanism needed • Allow nodes to only occupy as much space as they contribute • Three solutions: • Equivalence classes based on resources consumed. • Force nodes to solve cryptographic puzzles according to storage consumed. • Account for space with electronic currency • Currency accounting requires exchange of currency and backup state to be an atomic transaction.

  22. Implementation • Chunkstore file system • Implemented primarily in user space-pclientd • Small in kernel portion implements vnode interface • Container files – LRU cache of decrypted, recently used files for performance. • Chunks increase internal fragmentation. • Backup daemon • Server: Manages remote requests for storage and restoration. • Client: Supervises selection of buddies and snapshots. • Cleans meta-data logs and obtains deleted chunks

  23. Evaluation • Compare chunkstore with ext2fs and measure on modified Andrew benchmark: • Total overhead of 7.4% is reasonable. • Copy phase is expensive, takes 80% longer in chunkstore • Overheads due to meta-data and container file maintenance and Rabin fingerprints computation to find anchors • Backup and restore compares favorably to NFS cross-machine copy of source tree. • Conclusion: service does not deteriorate file system performance greatly. • Chunkstore and ext2fs performed within 1% of each other

  24. Evaluation(2) • The impact of abstracts measured • First computed coverage w/ full signatures • Then uniform random samples at rates of 10%,%,0.1 % and 0.01% • Compare coverage of machines with a freshly installed machine • Estimates independent of sample size • Abstract size does not effect much • However only effective on nodes w/ good coverage

  25. Evaluation (3) • How effective is Lighthouse Sweep ? • Simulations w/ 50000 nodes from a distribution of 11 types on 25 different Pastry networks • Nodes w/ common representation of 10% or higher find sufficient number of buddies in the overlay

  26. Evaluation(4) • How effective is coverage rate overlay? • Simulations w/ 10000 nodes, each one of a 1000 species • Only same species can be backup buddies • For Neighborhood set size of 256 • 85 % found at least 1 buddy • 72% at least 5 buddy • Increasing neighborhood size • 1 buddy %: 38% to 85% • 5 buddies %: 1% to 72 % Neighborhood size proves to have significant results

  27. Evaluation (5) • Pastiche nodes need to query just enough chunks q to detect corrupted state with prob p • Results indicate queries grow • slowly with respect to backup size • Query costs are modest

  28. Conclusions • Pastiche p2p automatic backup system • Is it really deployable, end users?? • Evaluation results show service does not penalize file system performance • Failure or malicious nodes detection requires administrative intervention • Much improvement needed to provide privacy and prevent greedy users

More Related