80 likes | 99 Views
Learn how to recover deleted files from HPSS, including setting up a clone server, restoring backups, and transferring files. Follow the detailed steps for efficient data retrieval.
E N D
Recovering deleted files from HPSS Pierre-Emmanuel Brinette
Context • HPSS used as MSS behind dCache for LHC Experiments • In march 2011, LHCb accidentally delete a dataset from the central database located at CERN • Delete operation has been propagated to all grid sites that retains the data • many site impacted : CERN (CASTOR), RAL (CASTOR), GRID-KA (TSM-HSM), PIC (CASTOR) • 3 days after the deletion, LHCb experiment asked sites to restore the deleted files. • All these HSM have ‘undelete’ features and sites were able to restore files within 2-3 days. • No undelete features in HPSS !
Restore recipes • Setup a clone HPSS core server • Restore the HPSS backup before deletion • Identify the tapes holding the files • Check if the tapes have been altered in the production system (repacked or reclaimed) • Prepare the resources to recover the files • Copy files from clone HPSS to production HPSS
Setup a HPSS clone core server • Installation of a new core server • Clone the local user (same uid,gid,password,shadow) • Setup same DB2 tablespaces containers (raw devices, mount points, …) • Setup DB2 9.5 • Compilation of HPSS (no installation) • Restore the production database • Restore HCFG, HSUBSYSx and rollforward to the time before the deletion • Alter the configuration to change the hostname • Transfers /var/hpss/etc/* from production server to clone server • Change the hostname in the text config files • Recreate the HPSS Unix keytab • Recreate the HPSS mm keytab • DB2 Rebind • Alter the HPSS metadata • Update EXECUTE_HOSTNAME in HCFG.HPSS.SERVER • Update SERV_DESC in HCFG.HPSS.SERVER for SUD daemon and LOGC • Start SSM
Setup a HPSS clone core server • Avoid writing operations • Disable MPS • Disable log archiving in HPSS (in LOGD) • Lock all drives • Disk volumes • Tape drives • Disk and tape mover • Start component and disables all pending operations • Start logc,logd,PVL (NOT PVR), CORE • Cancel pvljob • Wait for timeout of the pending operation
Identify the tape containing the erased files • Query the clone CORE server with the list of deleted file • Get list of tape contains that contains the files. • Check that the tape exists on the Production env. • Lock the tape on the production env. (ie: VV Cond Down)
Restore procedure Purge VV: Down put Locked VV: Down get Locked Cancel pvljob Start PVR
Remarks • Clone server is setup with an “alive” HPSS metadata backup • Operations are on progress • Disk & tape volumes are in use • Risk reductions : • Clone: Mark all movers & MPS “non executable” (w/ hpssadm) • Clone: Disable log archiving • Clone: Don’t start PVR • Clone: Cancel all PVL operations • Clone: Force dismount of tapes • Clone: Wait for timeout • Clone: Start/Stop Core server many times • Prod : lock tape drives still displayed as “in use” in clone • Clone: Startup PVR, restart PVL and wait for PVR timeout • Clone: Shutdown & restart CORE, PVL, PVR • Take care of all the error messages on both systems • For recovering : • stage files before transfers • Take care of the position of files on the tape for optimizing transfers