1 / 43

Implementation and Use of TSM Client/Server Data Deduplication in TSM V6.2

Implementation and Use of TSM Client/Server Data Deduplication in TSM V6.2. Dave Canan Advanced Technical Skills TSM Users Group – Rochester NY Oct 2011 ddcanan@us.ibm.com. ATS Team. Dave Canan ddcanan@us.ibm.com Dave Daun djdaun@us.ibm.com Tom Hepner hep@us.ibm.com Eric Stouffer

Download Presentation

Implementation and Use of TSM Client/Server Data Deduplication in TSM V6.2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementation and Use of TSM Client/Server Data Deduplication in TSM V6.2 Dave Canan Advanced Technical Skills TSM Users Group – Rochester NY Oct 2011 ddcanan@us.ibm.com TSM Data Deduplication – Oct. 2011 NY TUG

  2. ATS Team • Dave Canan • ddcanan@us.ibm.com • Dave Daun • djdaun@us.ibm.com • Tom Hepner • hep@us.ibm.com • Eric Stouffer • ecs@us.ibm.com Data Deduplication in Tivoli Storage Manager 6

  3. Agenda • Deduplication concepts • Data deduplication in TSM V6.x • Planning for data deduplication in TSM V6 • Implementation of client side data deduplication in TSM V6.2 • Hints and tips for data deduplication in TSM V6 • Client instrumentation reports and data deduplication Data Deduplication in Tivoli Storage Manager 6

  4. Deduplication Concepts TSM Data Deduplication NY TUG

  5. Data Deduplication Value Proposition Potential advantages • Reduced storage capacity required for a given amount of data • Ability to store significantly more data on given amount of disk • Restore from disk rather than tape may improve ability to meet recovery time objective (RTO) • Network bandwidth savings (Client side deduplication) • Lower storage-management and energy costs resulting from reduced storage requirements Potential tradeoffs/limitations • Significant CPU and I/O resources required for deduplication processing • Deduplication might not be compatible with encryption • Increased sensitivity to media failure because many files could be affected by loss of common chunk • Deduplication may not be suitable for data on tape because increased fragmentation of data could greatly increase access time Data Deduplication in Tivoli Storage Manager 6

  6. Where Deduplication is Performed (with TSM) 6.2 6.1, 6.2 Note: Source-side and target-side deduplication are not mutually exclusive Data Deduplication in Tivoli Storage Manager 6

  7. When Deduplication is Performed (with TSM) 6.2 6.1, 6.2 Note: In-band and out-of-band deduplication are not mutually exclusive Data Deduplication in Tivoli Storage Manager 6

  8. Data Deduplication in TSM V6.x TSM Lunch and Learn Data Deduplication

  9. B C B A C C A A B TSM 6.1 Deduplication Overview (Server Side) Deduplicated disk storage pool stores unique chunks to reduce disk utilization Client 1 Deduplication Server Client 2 Tape pool stores A, B, and C individually to avoid performance degradation during restore TSM Database Client 3 Files A, B and C have common data Allows more objects to be stored on disk for fast access Data Deduplication in Tivoli Storage Manager 6

  10. E G E G B B B E1 A0 A0 E2 B0 B0 C0 C0 E3 D0 D0 F1 B1 B1 G1 A1 B1 B2 C1 D1 Deduplication Example 6.1 2. Client2 backs up files E, F and G. File E has data in common with files B and G. 1. Client1 backs up files A, B, C and D. Files A and C have different names, but the same data. F A C D Client1 Client2 Server Server Vol1 A C D Vol1 A C D F Vol2 3. Server process “chunks” the data and identifies duplicate chunks C1, E2 and G1. 4. Reclamation processing recovers space occupied by duplicate chunks. Server Server Vol1 E1 Vol3 A1 B1 B2 D1 E1 E1 E3 F1 Vol2 Data Deduplication in Tivoli Storage Manager 6

  11. Client-Side Data Deduplication (6.2) - Operation B F E File 4 Deduplication-Enabled Disk Storage Pool Local Cache 1. Client creates chunks 2. Client and server identify which chunks need to be sent A B C File 1 4. Entire file can be reconstructed during Backup Stgpool operation to non-deduplicated stg. pool at a later time TSM 6.2 client D F E TSM 6.2 API Copy Storage Pool (non-deduplicated) Local Cache 3. Client sends chunks and hashes to server so that it can represent object in database. Client saves newfound hash values in local cache if being used. File 1 File 4 File 2 File 3 hash Index File 4 Data Deduplication in Tivoli Storage Manager 6

  12. G E E G B B B E1 A0 A0 B0 E2 B0 C0 C0 E3 F1 D0 D0 B1 G1 B1 A1 B1 B2 C1 D1 Deduplication Example (6.2) 2. Client2 backs up files E, F and G. Files E and F have data in common with files B and C. 1. Client1 backs up files A, B, C and D. Deduplication is enabled for stgpool. F A C D Client1 Client2 Server Server Vol1 A C D Vol1 A C D F Vol2 3. Server/Client identify which “chunks” are duplicates. Hash index DB build on Client. New data (File G) is sent, duplicate data is not.sent to Server. 4. Identify not needed for to recognize duplicate data; Reclamation processing only needed when dead space happens for deleted data Client2 Server Server Vol1 E1 Vol3 A1 B1 B2 D1 E1 E1 E3 F1 Vol2 Data Deduplication in Tivoli Storage Manager 6

  13. Planning for Data Deduplication in TSM V6 TSM Data Deduplication NY TUG

  14. Server or Client Data Deduplication? • Consider server side deduplication if • Client has few CPU cycles to spare, or is a critical 24x7 response time application • Network capacity is not a concern • TSM server CPU and disk I/O resources are available for intensive processing to identify duplicate chunks • Clients not upgraded yet to 6.2 level • Want to dedup legacy data on server • You have a business requirement to get a copy of the data before identification of duplicate data chunks is done (see following slide) • Large files on client would slow client backup down if client side deduplication used. (more on this later) • Consider client side deduplication if you • Have spare CPU cycles on the client machine • Have a remote connection with a slower network • This method might be considered more scalable. You may be able to add more clients without as much concern for impact on the server deduplication function. • Have multiple clients that all store the same data • Have large numbers of Windows clients because of systemstate data (more on this later) Data Deduplication in Tivoli Storage Manager 6

  15. Server or Client Data Deduplication? • Maybe you want both, depending on workload • On weekends, network might be less busy. You could use server side data deduplication here. • During week, network response is more critical. Might want to use client side data deduplication during the week. • Could be controlled thru a macro that would update nodes using deduplication (update node ‘’x’’ deduplication=serveronly) • Maybe you don’t want either • Restore of deduplicated data might be slower. Extents for a file can be spread across many volumes, thus making the I/O more random in nature from the storage pool. Also, restore from deduplicated pool results in additional DB I/O. • If data is considered critical and there is an SLA for optimal restore, then consider moving to a non-deduplicated pool Data Deduplication in Tivoli Storage Manager 6

  16. Planning for TSM Deduplication (server side) • How do I want to control data duplication processes? • You can have them running all the time and have them process data as transactions commit. This could be more CPU intensive. • You can run them after backups have completed and then cancel them after the identification of duplicate data has finished. Consider whether you have the extra processing time in the day to run this as a separate step. • Do you want to set up new storage pools or use existing ones? • You may want only certain nodes to perform data deduplication. These could be updated to new policy domains and management classes that point to new storagepools. Data Deduplication in Tivoli Storage Manager 6

  17. Planning for TSM Deduplication – Estimating the Potential Cost Savings • How can I estimate my space savings with data deduplication in my environment? (2 possible techniques) • Best way to test this is with a test system you can delete when done. • Another way (but not a best practice): Consider backing up your data from a primary storage pool to a temporary copy storage pool that has data deduplication enabled to estimate the space savings. Downside to this technique is that it will increase DB size. • Create a copy stgpool using a devclass type of FILE. • Do a BACKUP STGPOOL primary stgpool to copy stgpool. • Run IDENTIFY DUPLICATES against the volumes in the copy storagepool. • When the IDENTIFY DUPLICATES process goes to idle state, set the reclamation threshold for the storagepool to 1%. • After reclamation finishes, issue the q stgpool command against the copy storagepool to check the amount of space that was saved. • If the results are satisfactory, then update the primary stgpool to specify deduplication is to be used. (Or if type DISK, move data to a new stgpool that is defined with devclass FILE.) Data Deduplication in Tivoli Storage Manager 6

  18. Planning for TSM Deduplication • How much additional log space is required for data deduplication? • There are many factors in calculating this. This, is addition to looking for the number of logs archived during the identify process, could be used to estimate the log consumption during the data deduplication identification process. • db2diag -H 1d -g msg:="Completed archive" -fmt "Time : %%{ts} Message : @{msg}" | grep Time | gawk -F\ -f "C:\Program files\tivoli\tsm\baclient\loginfo.awk" >> c:\temp\tsm-files\\loginfo.txt • Is data deduplication suitable for all types of disk subsystems? • Restores from a deduplicated storage pool are random in nature. For disk subsystems that are slower in speed and do not have adequate cache (for example SATA), restores will be impacted with deduplicated data. Data Deduplication in Tivoli Storage Manager 6

  19. Planning for TSM Deduplication – Other Considerations • What is the possible impact on restore performance if I implement TSM data deduplication? • Smaller files (less than 100K) that have been deduplicated will restore slower than files that are not deduplicated. Having more sessions doing the restore will improve restore performance. • How many Identify Processes should I run on my TSM server? • The Identify process is both CPU and IO intensive. You should run no more than “N” identify processes for an N-Way CPU. Each identify process can use the entire CPU, so if you need CPU for other processes, use less. Data Deduplication in Tivoli Storage Manager 6

  20. Planning for TSM Deduplication • How are new clients introduced into your system? • If many new clients introduced at once, might be better to use server side data deduplication initially. If you do multiple clients at the same time and use client side data deduplication, this will cause a lot more initial DB hash index lookups and will be more overhead on the TSM server. • What is the possible impact on backup performance if I implement TSM client side data deduplication? • the impact of data deduplication on the client is comparable to the impact of Tivoli Storage Manager compression . Data Deduplication in Tivoli Storage Manager 6

  21. Planning for TSM Deduplication • What is the impact on database size when implementing data deduplication? • The average “chunk size” is 256K, and for each chunk, there is approximately 500 bytes of metadata added to the TSM DB. Also, files less than 2K are not eligible for data deduplication. • How do I change my daily housekeeping schedules for data deduplication? Where does the Identify Process fit in the daily cycle? • If you are going to run the identify process as part of daily housekeeping, then run it after BACKUP STGPOOL has completed but before reclamations for devclass FILE storagepools have run. You could run identify for a given period of time (duration=nn); otherwise you need to know when the Identify Duplicates process has gone to IDLE state, as these processes are different than other TSM processes. Sample select for to use for automation: Select count(*) from processes where status like ‘%State: idle%’ and process=‘Identify Process’ Data Deduplication in Tivoli Storage Manager 6

  22. Planning for TSM Deduplication • How should I start with Data Deduplication? • If you are going to use data deduplication on an existing storage pool, consider the following approach: • Initially set identify processes to 0. • On a daily basis, run identify duplicates for some duration until all volumes in storagepool processed. • Then update storagepool identifyprocess parameter to appropriate value. • If you are going to use data deduplication in a new storage pool, consider the following approach: • Initially set identify processes to 1 or 2. • On a daily basis, do a move nodedata for a few nodes into the new storage pool, and then point them to a new policy domain. Identify will process that set of node’s data. • When all nodes have been moved, delete old storagepool and change copygroups to reflect new storage hierarchy. • Then update storagepool identifyprocess parameter to appropriate value. Data Deduplication in Tivoli Storage Manager 6

  23. Implementation of Client Side Data Deduplication in TSM V6.2 TSM Lunch and Learn Data Deduplication

  24. Client-Side Data Deduplication - Prerequisites Client and server must be at version 6.2.0 or greater Client must have the client-side deduplication option enabled (DEDUPLICATION YES) The server must enable the node for client-side deduplication with the DEDUP=CLIENTORSERVER parameter using either REGISTER NODE or UPDATE NODE commands The storage pool being backed up to must be a deduplication enabled storage pool Files must be bound to the correct management class File must not be excluded from client side deduplication processing (by default all files are included). See exclude.dedup Client option for details File must be larger than 2 KB, transactions must be less than clientdeduptxnlimit Data Deduplication in Tivoli Storage Manager 6

  25. Client-Side Data Deduplication - Restrictions • LAN-free • Unix HSM • Encryption • TSM client-side • Files from known encrypted file systems which are read in their raw encrypted format • SSL encryption is OK • Sub-file backup • API Buffer copy elimination • Simultaneous Storage Pool Write • If any of these exist, Client-side dedup does not occur Data Deduplication in Tivoli Storage Manager 6

  26. Client-Side Data Deduplication - Options / Controls • Server-side • Register and Update Node commands include: • DEDUPLICATION=SERVERONLY | CLIENTORSERVER • Copygroup destination must be • Stgpool with DEDUPLICATE=YES • Client-side • DEDUPLication YES | NO • Include.dedup and exclude.dedup • ENABLEDEDUPCAche YES | NO • DEDUPCACHEPath directory_name • DEDUPCACHESize 256 (1 to 2048 in MB) Data Deduplication in Tivoli Storage Manager 6

  27. Client-Side Data Deduplication - Include / Exclude Controls • Include.dedup filter • Exclude.dedup filter • Objtype parameter • Include and Exclude may also specify: • File • Image • SYSTEMObject • SYSTEMState • Examples • include.dedup /Users/Administrator/Documents/.../* • include.dedup objtype=image E: • include.dedup objtype=systemobject ALL • include.dedup objtype=systemstate ALL Data Deduplication in Tivoli Storage Manager 6

  28. Data Deduplication and Log Use Considerations (TSM 6.2) • Two new server options • ServerDedupTxnLimit • Sets the maximum size of objects that can be deduplicated on the server • ClientDedupTxnLimit • Sets the maximum size of a transaction when client-side deduplicated data is backed up or archived • Can be set dynamically with setopt server command • Use default values for now – changes may be coming in upcoming maintenance • Note: To control which objects are deduplicated, you can also use the MAXSIZE parameter of the DEFINE STGPOOL and UPDATE STGPOOL commands. Using the MAXSIZE parameter, you can force large objects to the NEXT storage pool for storage. Data Deduplication in Tivoli Storage Manager 6

  29. Client-Side Data Deduplication and Compression • New chunks can be compressed • Process: • Client chunks the data • Client compresses each chunk • Data is stored in compressed format • During a restore operation, compressed chunks sent back to client • For partial object restore, compressed chunks are decompressed by the server first • Compressed chunks can be shared across compressed/not compressed objects • Server will decompress on way back to lower level client • Server Operations • BACKUP STGPOOL – chunks decompressed first • BACKUPSET and EXPORT • Data is decompressed during a backup set generation or an export operation, if the node for which the backup set or export being generated is TSM 6.1 or prior. For TSM 6.2 and later clients, the chunks in backup set volumes and export volumes are not decompressed.

  30. Hints and Tips for Data Deduplication in TSM V6 (reference material) TSM Lunch and Learn Data Deduplication

  31. Hints and Tips for Using Deduplication in V6 • Each consumer backup session has its own chunk-query session to the server, but all sessions will uses a single client deduplication cache. (client side deduplication) Multiple threads within a session will share a cache. Multile dsmc processes will not share the cache. Keep this in mind when doing client side data deduplication with TDPs. • If you are doing several dsmc processes, the first one will use duplication cache if enabled. Additional dsmc processes will not but client side deduplication will still take place. • Each chunk requires ~200 bytes for the duplication cache . • Use the dedupcachesize option to limit the size of the deduplication database. If this size is exceeded, all entries in the duplication cache are deleted and the cache is reset. Data Deduplication in Tivoli Storage Manager 6

  32. Types of Client Backups to consider with client-side data deduplication • Image Backups – • Only changed chunks sent. Would allow for more frequent client image backups to server. Recommendation – use for volumes up to ~300GB. This would be similar to the effect of doing incremental image backups. .Performance with very large image objects is a concern. In some cases client-side deduplication may be be impractical. See the slide for ServerDedupTxnLimits/ClientDedupTxnLimits • Windows XP clients – • Windows XP does not use the new ‘’ incremental ’’ systemstate feature of TSM V6.2 • Windows 2003/Vista/2008/7 clients • Could use in conjunction with the new ‘’incremental’’ systemstate feature of V6.2 client to further reduce amount of data being spent to server. If you have many servers this will save you space, but at the cost of performance and time. Data Deduplication in Tivoli Storage Manager 6

  33. Debugging Tips – where to look if dedup does not appear to be working • Several areas need to be checked if it does not appear that dedup is working correctly. • Server side dedup checklist: • Is the storage pool enabled for dedup? • Has Identify been run yet? • Is DedupRequiresBackup set to yes? If so, has a “Backup Stgpool” been done yet? • Have you run reclamation yet ? • Are the transactions/files too big? • Client side dedup checklist • Is the storage pool enabled for dedup? • Has the data been excluded with exclude.dedup processing? • Do you have Deduplication set to YES in option file? • Are the transactions/files too big? Data Deduplication in Tivoli Storage Manager 6

  34. Final Statistics Messages (from backup/archive) Message Data Deduplication in Tivoli Storage Manager 6

  35. Verifying data dedup is happening with older API clients (Data Protection for Mail, DB, etc.) • V6.1 or earlier DP clients don’t display any indications of deduplication statistics. Here’s one way to see if dedup is working: • Enable a trace with the following parameters in the dsm.opt file used by the DP: TRACEFILE api.txt TRACEFLAGS dedup api • Example on following page Data Deduplication in Tivoli Storage Manager 6

  36. Verifying data dedup is happening with older API clients (DPs) example 04/09/10 14:35:58.394 : dsmsend.cpp (2073):tsmEndSendObjEx: Total bytes send 0 2183921664, encryptType is NOencryptAlg is NONE compress is 0, dedup is 1,totalCompress is 0 0 totalLFBytesSent 0 0totalDedupSize 0 6789732 txnBytes 2183921664 The totalDedupSize parameter in the extract reports how many bytes were sent to the server after deduplication (6789732). Note below message reflects # of bytes sent before deduplication occurred: ANE4991I (Session: 4892, Node: CCDORA1) TDP Oracle AIXANU0599 TDP for Oracle: (1392894): =>(ccdora1) ANU2526I Backup details for backup piece /oradb//18lamq0j_1_1(database "XADAC"). Total bytes sent: 2183921664. Totalprocessing time: 00:03:18. Throughput rate:10771.39Kb/Sec. Compressed: No . Encryption: None.LAN-Free: No. (SESSION: 4892) Note: In the future, the activity log may reflect the deduplication reduction Data Deduplication in Tivoli Storage Manager 6

  37. Hints and Tips – Reporting Data Dedup Effectiveness • Q occ • Physical space occupied for storagepools that are deduplicated is not shown. • Shows the “Reporting occupancy” in the logical occupancy field. Reporting occupancy represents how much space this data would occupy if this wasn’t in a data dedup pool • Shows actual number of files in the filespace. • If you turn off deduplication for a stgpool, there is no value displayed in the “physical occupancy” field until there are no more deduplicated files in storage pool. Data Deduplication in Tivoli Storage Manager 6

  38. Hints and Tips – Reporting Data Dedup Effectiveness • Select from occupancy • Physical space occupied for storagepools that are deduplicated has value of 0.0. • Lists fields “Logical_MB” and “Reporting_MB”. Reporting_MB represents how much space this data would occupy if this wasn’t in a data dedup pool. Logical_MB is how much space is actually being used. Data Deduplication in Tivoli Storage Manager 6

  39. Changed Commands/Output • Q stgpool f=d • Contains a new field “Duplicate data not stored.” This represents the amount of data eliminated in the storagepool via data dedup. The % shown is the amount saved / amount in the pool. • Example. Lets say that we are currently storing 20GB in a storage pool, and we didn’t store 10GB because of data deduplication. If we didn’t have data deduplication enabled, we would have needed 30GB of space. So, our % is 10/30 or 33%. • % Utilization reflects physical occupancy, which means it won't change until you reclaim the volume Data Deduplication in Tivoli Storage Manager 6

  40. Changed Commands/Output • Q stgpool f=d Data Deduplication in Tivoli Storage Manager 6

  41. New Commands • Show deduppending stgpool-name • This command shows the amount of data that has been identified as duplicate data, but has not yet been processed by reclamation to delete it. Data Deduplication in Tivoli Storage Manager 6

  42. Client instrumentation changes • Client instrumentation – new categories Fingerprint category – where chunk boundaries are ICC Digest – Chunk hash Data Deduplication in Tivoli Storage Manager 6

  43. Questions? Data Deduplication in Tivoli Storage Manager 6 Basic text slide

More Related