Similarity based deduplication By: Lior Aronovich, Ron Asher, Eitan Bachmat, Haim Bitner, Michael Hirsch , Tomi Klein
Deduplication • There is a lot of redundancy in stored data, especially in backup data. • Deduplication aims to store only the differences between different versions.
Different types of deduplication • Inline or offline. • Hash comparison or byte by byte. • Similarity based or identity based.
Our initial design requirements • Support for a petabyte of physical storage. • A deduplication rate of at least 350MB/sec. • Inline • Byte to byte (B2B) comparison
Standard approach • Break up the incoming data stream into segments, a few KB in size. The break up boundaries are computed using patterns in a rolling hash. • Identify each segment using a long hash. • Check if the hash belongs to a previous segment • If so place a pointer to the segment.
Standard approach • Can be fast, can be inline, however: • Doesn’t scale to a physical Petabyte (because of KB sized segments) • No B2B comparison
Our approach • Break up incoming stream to chunks of a few MB size. • Compute a similarity (not identity!) signature so that chunks which are alike (even only 50%) will have signatures which are alike (could be only 25%). • Do a B2B comparison between the incoming chunk and similar repository segments. • Store differences based on the B2B.
Similarity signatures • Compute a rolling hash (Karp-Rabin) to all blocks of the chunk. • Three possibilities: • (Breen et al) Take k random block hashes • (Broder, Heintze, Manbar) Take the largest k hashes • (Our choice) Take k hashes of blocks which are close to those that produced the k largest hashes
Criteria for comparison • Similarity checking speed • Successful identification of similarity percentage • Low probability of false positives • Likelihood of finding the most similar match
Comparison of methods • The first method (random block hashes) is slow and has many false positives, likelihood of finding best similarity is lower compared to the other methods • The second method (k maximal hashes) is faster, but still has false positives • The third method solves all issues
B2B phase • Once similarity is detected, we know where in the repository the similar data is located and we have a few anchoring matches. • The B2B comparison itself is completely decoupled from the similarity search ! We have the anchors and computed hashes to support the B2B.
Implementation • The TS7650 from IBM, formerly from Diligent • Has been available since 2005. • Many clients managing many petabytes • Very large installations
Did we achieve our goals? • Up to 850MB/sec on a single system node • Up to 1PB of physically usable storage with only 12 GB of memory • Inline • B2B comparison
Some of the competition • Data Domain 690 series • An HP system, the DD4000 from 2008, academic paper in FAST 2009, the only other similarity based product, but with hash comparison, uses a variant of the second method.
In more detail • While our solution supports 1PB physical, Data domain supports at most 50TB and the HP product at most 10TB, we have actual installations which are far bigger than either of these numbers. • Our solution is faster, somewhat faster than Data Domain, much faster than HP • We still find time to do B2B comparison, they don’t • Our solution has faster reconstruction rate, remember that’s when you have a data outage situation !