DATA DEDUPLICATION

DATA DEDUPLICATION By: Lily Contreras April 15, 2010

What is data deduplication? • Often called intelligent compression or single instance storage. • In the deduplication process duplicate data is deleted leaving only one copy of the data to be stored. • Data deduplication turns the incoming data into segments, uniquely identifies the data segments, and compares these segments to the data that has already been stored. If the incoming data is new data then it is stored on disk, but if it is a duplicate of what has already been stored then it is not stored again and a reference is created to it. • “Only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy.”

What is data deduplication? • Data deduplication operates at different levels such as the file, block, and bit level. • If a file is updated, only the changed data is saved. For example, if only a few bytes of a document or presentation are changed, only the changed blocks or bytes are saved. The changes will not create an entirely new file. This behavior makes block and bit deduplication far more efficient. • Deduplication works by comparing chunks of data to detect duplicates. Each chunk of data is assigned a unique identification calculated by the software, typically using cryptographic hash functions. When a new hash number is created it is compared with the index of other existing hash numbers. If that hash number is already in the index then the data is considered a duplicate and does not need to be stored again. Otherwise the new hash number is added to the index and the new data is stored.

In-line deduplication is the most efficient and economic method. Hash calculations are created as the data is entered in real time. If the target device identifies a block that has already been stored then it simply references to the existing block. An advantage that in-line deduplication has over post-process deduplication is that it requires less storage as data is not duplicated. Inline deduplication significantly reduces the raw disk capacity needed in the system since the full, not-yet-deduplicated data set is never written to disk. “It optimizes time-to-DR (disaster recovery) far beyond all other methods since it does not need to wait to absorb the entire data set and then deduplicate it before it can begin replicating to the remote site.” However, “because hash calculations and lookups takes so long, it can mean that the data ingestion can be slower thereby reducing the backup throughput of the device.” Post-process deduplication first stores new data on the storage device which is later analyzed for deduplication. One of its advantages is that it does not need to wait for hash calculations and lookup to be completed before storing the data. However one of the problems with post-process deduplication is the fact that it may unnecessarily store duplicate data for a short period of time which can be big problem if storage capacity is near its limit. Perhaps the major drawback is the inability to predict when the process shall be completed. Deduplication Methods

Benefits of Data Deduplication • Eliminates redundant data. • Drives down cost. • Improves backup and recovery service levels. • Changes the economics of disk versus tape. • Reduces carbon footprint.

Problems with Data Deduplication • Hash collisions • Intensive computation power required • Effect of compression • Effect of encryption

Consider the broader implications of deduplication. Think about how deduplication can be used to eliminate tape in your environment. Data created by humans dedupes well but data that is created by computers does not dedupe well. Compare multiple products. Ensure ease of integration into your existing environment. How to choose a data deduplication solution?

References • http://searchdatabackup.techtarget.com/tip/0,289483,sid187_gci1360643,00.html • http://www.datadomain.com/resources/faq.html • http://searchstorage.techtarget.com/sDefinition/0,,sid5_gci1248105,00.html • http://forms.datadomain.com/go/datadomain/eNL_WP_IDCBR_10 • http://wwpi.com/index.php?option=com_content&view=article&id=8477:how-to-choose-a-deduplication-solution&catid=99:cover-story&Itemid=2701018

DATA DEDUPLICATION