1 / 21

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage. Wei Zhang*, Hong Tang † , Hao Jiang † , Tao Yang*, Xiaogang Li † , Yue Zeng † * University of California at Santa Barbara † Aliyun.com Inc. Motivations.

dianne
Download Presentation

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-level Selective Deduplication for VMSnapshots in Cloud Storage Wei Zhang*, Hong Tang†, Hao Jiang†, Tao Yang*, Xiaogang Li†, Yue Zeng† * University of California at Santa Barbara † Aliyun.com Inc.

  2. Motivations • Virtual machines on the cloud use frequent backup to improve service reliability • Used in Alibaba’s Aliyun - the largest public cloud service in China • High storage demand • Daily backup workload: hundreds of TB @ Aliyun • Number of VMs per cluster: 10000+ • Large content duplicates • Limited resource for deduplication • No special hardware or dedicated machines • Small CPU& memory footprint

  3. Focus and Related Work • Previous work • Version-based incremental snapshot backup • Inter-block/VM duplicates are not detected. • Chunk-based file duduplication • High cost for chunk lookup • Focus on • Parallel backup of a large number of virtual disks. • Large files for VM disk images. • Contributions • Cost-constrained solution with very limited computing resource • Multi-level selective duplicate detection and parallel backup.

  4. Requirements • Negligible impact on existing cloud service and VM performance • Must minimize CPU and IO bandwidth consumption for backup and deduplication workload • (e.g. <1% of total resource). • Fast backup speed • Compute backup for 10,000+ users within a few hours each day during light cloud workload. • Fault tolerance constraint • addition of data deduplication should not decrease the degree of fault tolerance.

  5. Design Considerations • Design alternatives • An external and dedicated backup storage system. • A decentralized and co-hosted backup system with full deduplication Backup Cloud service backup backup backup . . . Cloud service Cloud service Cloud service

  6. Design Considerations • Decentralized architecture running on a general purpose cluster • co-hosting both elastic computing and backup service • Multi-level deduplication • Localize backup traffic and exploit data parallelism • Increase fault tolerance • Selective deduplication • Use minimal resource while still removing most of redundant content and accomplishing good efficiency

  7. Key Observations • Inner-VM data characteristics • Exploit unchanged data to localize deduplication • Cross-VM data characteristics • Small common data dominates duplicates • Zipf-like distribution of VM OS/user data • Separate consideration of OS and user data

  8. VM Snapshot Representation Segments are fix-sized Data blocks are variable-sized

  9. Processing Flow of Multi-level Deduplication

  10. Data Processing Steps • Segment level checkup. • Use dirty bitmap to see which segments are modified. • Block level checkup • Divide a segment into variable-sized blocks, and compare their signatures with the parent snapshot • Checkup from common dataset (CDS) • Identify duplicate chunks from CDS • Write new snapshot blocks • Write new content chunks to stoage. • Save recipes • Save segment meta-data information

  11. Architecture of Multi-level VM snapshot backup Cluster node

  12. Status& Evaluation • Prototype system running on Alibaba’s Aliyuan cloud. • Based on Xen. • 100 nodes and each has 16 cores, 48G memory, 25VMs. • Use <150MB per machine for backup&deduplication • Evaluation data from Aliyuan’s production cluster • 41TB. • 10 snapshots per VM • Segment size: 2MB. • Avg. Block size: 4KB

  13. Data Characteristics of the Benchmark • Each VM uses 40GB storage space on average • OS and user data disks: each takes ~50% of space • OS data • 7 main stream OS releases: • Debian, Ubuntu, Redhat, CentOS, Win2003 32bit, win2003 64 bit and win2008 64 bit. • User data • From 1323 VM users

  14. Impacts of 3-Level Deduplication Level 1: Segment-level detection within VM Level 2: Block-level detection within VM Level 3: Common data block detection across-VM

  15. Impact for Different OS Releases

  16. Separate consideration of OS and user data Both have Zipf-like data distribution But popularity growth differs as the cluster size/VM users increase

  17. Commonality among OS releases 1G common OS meta data covers 70+%

  18. Cumulative coverage of popular user data Coverage is the summation of covered data block size*frequency

  19. Space saving compared to perfect deduplication as CDS size increases 100G CDS (1GB index) -> 75% of perfect dedup

  20. Impact of dataset-size increase

  21. Conclusions • Contributions: • A multi-level selective deduplication scheme among VM snapshots • Inner-VM deduplication localizes backup and exposes more parallelism • global deduplication with a small common data set appeared in OS and data disks • Use less than 0.5% of memory per node to meet a stringent cloud resource requirement -> accomplish 75% of what perfect deduplication does. • Experiments • Achieve 500TB/hour on a 1000-node cloud cluster • Reduce bandwidth by 92% -> 40TB/hour

More Related