1 / 27

Supporting Content-Addressable Caching with CZIP Compression

Supporting Content-Addressable Caching with CZIP Compression. KyoungSoo Park , Sunghwan Ihm, Mic Bowman* and Vivek Pai Princeton University *Intel Research. Content-Based Naming (CBN). Naming scheme based on its content Name = one-way hash (content) Hashing function: MD5, SHA-1, etc.

egan
Download Presentation

Supporting Content-Addressable Caching with CZIP Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting Content-Addressable Caching with CZIP Compression KyoungSoo Park, Sunghwan Ihm, Mic Bowman* and Vivek Pai Princeton University *Intel Research

  2. Content-Based Naming (CBN) • Naming scheme based on its content • Name = one-way hash (content) • Hashing function: MD5, SHA-1, etc. • Rabin’s fingerprint for chunk detection • Redundancy elimination • Network-traffic/storage systems • Research/commercial systems • Special-purpose systems USENIX 2007

  3. Where Can CBN be Applied? • Similar file distribution • Linux distribution mirror • DVD ISO contains all CD ISOs • Virtual machine image migration • Base OS takes up majority of content • httpd VM vs. httpd+mysqld VM • Uncacheable Web content • Some dynamic content doesn’t change USENIX 2007

  4. Contribution of This Work • Generic CBN tool • Easy to build new systems • Easy to upgrade existing non-CBN systems • CZIP compression + CZIP-aware apps • Can be used on existing platforms • Provides benefit to non-CZIP apps • Demonstrate sample systems • Reduces FC6 mirror memory footprint by half • Comparable compression speed to GZIP’s • 2x throughput for CZIP-aware Apache • 4x origin server BW reduction for CZIP-aware CDN USENIX 2007

  5. Header A Global Fields A Chunk Index 1 B B Chunk Index 2 Chunk Index 3 A Chunk Index 4 C B Chunk Index 5 C CZIP Compression • Compression scheme like GZIP, BZIP2 • Export CBN information in the header CZIP UNCZIP CZIP Header USENIX 2007

  6. CZIP Header • Header = global attributes + chunk info • Global attributes • One-way hash function (SHA-1/MD5) • Chunk data compression (GZIP/BZIP2) • Convergent encryption (on/off) • Header CRC, File Hash, etc. • Chunk information • Content hash, start offset, chunk size USENIX 2007

  7. read header file1.cz read chunks read header file2.cz xyzlo5g Chunk A read chunk C asdfghk Chunk B qoiertty Chunk C Deployment Scenario • CZIP-aware server xyzlo5g hdr asdfghk Client A Chunk A Server Chunk B file1.cz CBN Cache Client B xyzlo5g header asdfghk qoiertty Chunk A Chunk B Chunk C file2.cz USENIX 2007

  8. GET /file2.cz Range: bytes=1000-1999 X-SHA-1: qoiertty file2.cz read chunk C xyzlo5g Chunk A asdfghk Chunk B qoiertty Chunk C Deployment Scenario • CZIP-aware client-side proxy xyzlo5g hdr asdfghk file1.cz Client A Chunk A Proxy Server Chunk B file1.cz CBN Cache Client B xyzlo5g header asdfghk qoiertty Chunk A 1. X-SHA-1 field helps CZIP-aware server 2. Browser cache can support CBN too! Chunk B Chunk C file2.cz USENIX 2007

  9. 7.9 6.5 6.5 48.3 48.5 3.3 3.2 3.2 20.3 19.9 19.6 2.7 2.5 2.5 1.9 Compressibility • Fedora Core 6 ISOs/ All files/ Wikipedia DB 1 Data Compression Ratio CZIP+plain 0.9 CZIP+gzip 0.8 CZIP+bzip2 0.7 GZIP 0.6 BZIP2 0.5 0.4 0.3 0.2 0.1 0 FC6_i386_ISOs.tar FC6_All_files.tar Wikipedia_DB.tar 6.7 GB 49.7 GB 7.9 GB USENIX 2007

  10. Compression speed • On Pentium D 2.8GHz with 4GB memory 29,004 secs 3,151 secs 3,964 secs USENIX 2007

  11. Virtual Machine Images • Server consolidation/management • Much redundancy among similar VMs • Xen FC4 base image (X) • X + httpd (Y) / Y + mysqld (Z) • Investigating content overlap over • Chunk size • Chunking methods • Rabin’s fingerprint vs. fixed-sized • After extensive use USENIX 2007

  12. Chunk Size / Chunking Methods Compare three VM images Base = Xen FC4 image / Apache = Base + httpd Both = Apache + mysqld Rabin’s fingerprint Fixed-sized chunking USENIX 2007

  13. Real VM Images EC1 ~ EC5: VMs based on Xen FC-4 + standard tools Daily used by five different engineers for three weeks USENIX 2007

  14. Dynamic Web Pages • Observed the front page of these sites • Google News • CNN • Slashdot • Digg.com • Fark.com • New York Times • All of them non-cacheable • “no-cache”, “no-store” or “private” USENIX 2007

  15. Average Content Overlap Downloaded pages every 10 minutes for 18 days USENIX 2007

  16. Potential Data Savings via CZIP 37% 39% 61% 24% 57% 90% USENIX 2007

  17. Summary So far • CZIP is comparable to GZIP in speed and performance • CZIP is far better with files with much redundancy • Redundancy decreases as chunk size increases • Rabin’s fingerprint exposes a good deal of redundancy regardless of chunk sizes • Optimal chunk size varies over workload • Bigger chunk size is better for network transfer • Dynamic content also exposes redundancy • CZIP can save 24-90% of BW instead of GZIP USENIX 2007

  18. Server Performance • CZIP Apache Module • Test scenario (FC mirror simulation) • 1.5 GB from FC6 DVD • 1.5 GB is split into three 0.5 GB images • Each file is requested in round-robin fashion • 100-300 clients simulated by six machines in LAN • Server is 2.8GHz Pentium D w/ 2GB memory • w/ 2GB physical memory with 2 Gbps-NICs USENIX 2007

  19. Worst client in CZIP-aware Apache is faster than 91% of normal Apache clients CZIP Apache Module 90% 2.56 times Median 2.07 times USENIX 2007

  20. CBN-Aware Content Distribution • CoBlitz large-file CDN [NSDI’06] • Serving 1-2 TB every day on PlanetLab • http://coblitz.codeen.org/URL • University channel – podcast/vodcast • Fedora Core mirror, Citeseer etc. • Chunk is basic caching unit • Parallel chunk requests/responses • Chunk request in HTTP byte-range query USENIX 2007

  21. Making CoBlitz CZIP-Aware • CoBlitz’s chunk request GET /coblitz.codeen.org/www.cs.princeton.edu/ bigfile.cz,start=1000,end=1999 HTTP/1.0 Host: coblitz.codeen.org • CZIP-aware CoBlitz (C-CoBlitz) request GET /czip.codeen.org/Chunk_SHA-1_Hash HTTP/1.0 Host: czip.codeen.org X-URL: www.cs.princeton.edu/bigfile.cz X-Range: byte=1000-1999 USENIX 2007

  22. CZIP-Aware CoBlitz Testing • Two content-overlapping files • Simultaneously fetch from 100 PlanetLab nodes • Origin server is at Princeton • Testing cases • Regular: Download original files by regular CoBlitz • File-CZIP: DownloadCZIP’ed files by regular CoBlitz • CZIP-CDN: DownloadCZIP’ed files by C-CoBlitz USENIX 2007

  23. 273 MB, 29.6% 191 MB, 29.7% 100 MB File Downloading 388 MB Regular File-CZIP CZIP-CDN USENIX 2007

  24. 92 MB, 49.7% 24 MB, 73.9% 50 MB File Downloading 183 MB Regular File-CZIP CZIP-CDN USENIX 2007

  25. Conclusion • CZIP is a generic compression tool providing CBN benefits • CZIP is comparable to GZIP in compression performance • CZIP helps greatly reduce memory footprint in serving similar files • It is very easy to support CZIP and the benefit is transparent USENIX 2007

  26. Thank you! More information can be found at http://codeen.cs.princeton.edu/czip/ CZIP code will be released soon! USENIX 2007

  27. 200/300 Clients 90% 2.27 times 90% 2.11 times 80% 65% Median 1.95 times Median 1.84 times 200 clients 300 clients USENIX 2007

More Related