/usr/afs/bin/backup

/usr/afs/bin/backup • Always talks to the budb • If given a -port, talks to a butc • Run 'backup help' for commands

butc(on your backup server)‏ • Normally started by the script that launches the daily backups. • If doing a restore, you will need to start one by hand. • Appropriate ports might be port 60 on plod, for a tape restore, or port 100 on meditrina for a TSM restore.

TSM • Terribly convenient; somewhat brittle. • If you see XBSA errors, go ask Dave and Patrick what might've been wrong on the TSM side. You might as well kill all processes associated with your backup, and clean up the status file, as you will need to start over. • One TSM nodename is shared between all servers in the cell; password consequences. • If a restore takes more than 10min, Dave will be happy to intervene and free up a drive for you.

Tape • 15-tape DLT stackers on plod are in sequential mode • Home grown scripts (Mount.pl) will take care of loading up the next tape, for a multi-tape restore. You must put them in the stacker sequentially, then load up the first one. • Expect 10min per tape for restores. • Use 'mt -f /dev/rmt/0 status' or 'mt -f /dev/rmt/0 rewoffl' if trying to figure things out remotely.

Regular Backups • Two levels of incrementals: if a full fails, do not let the incrementals run, or they will be huge. • Schedule file in /afs/.athena/service/dump_data • /usr/afs/backup/logs/status.portnum exists to lock out concurrent backups. (Delete it if necessary.)‏ • If two backups run into each other, or cloning runs late “has not been re-cloned since last dump” errors get mailed out. (Harmless.)‏ has not been re-cloned since last dump

Retention • Retention is six months for regular backups. • A cron job looks for dumps in our budb which are expired, and does a 'backup deletedump' while talking to a TSM butc, to flag them for deletion on TSM. • The TSM server cleans up ten days later. • Tape backup archivals are retained for as long as we are maintaining a drive that can read them. • We expect to be able to restore everything listed in the budb.

Backup Schedule • Schedule is normally the same from month to month; edits are made as needed. • Each server looks for an entry for its hostname, and today's date. No entry, no error, no backup. • Comment out by changing the hostname a little. • Normal schedule has a full monthly for each partition of each server. Incrementals come before fulls, because maybe the full takes all day.

volsets andromache andromache-b/full/inc/iinc,andromache-a/full poseidon poseidon-a/full/inc,poseidon-b/archive • 'andromache-b' is the volset; '/full/inc/iinc' is the dump. • 'backup listvolsets' and 'backup listdumps' will show you.

Archival Backups(performance tuning)‏ • End of term archivals are done twice a year. • They are painful under TSM because our throughput diminishes if multiple servers try to do a backup at once, and we tend to fill up staging space. • Normal performance is for a single full to take eight hours. However, TSM will spend a while after it's done getting it copied out to tape. • We have five big servers with four partitions each: at two per day, ~10tb of data total, minimum of ten days.

Restores • Follow checklist in wiki • Gather data first: • 'backup volinfo' • 'vos ex' to make sure destination volume doesn't already exist • 'backup volrestore' with the -n, so you find out if you need tapes or TSM • DANGER: don't restore on top of something else.

Restores for real • Start a butc somewhere appropriate. • Now run your 'backup volrestore' with -port instead of -n. • Wait for it to finish (5-10min)‏ • 'vos ex' your new volume • You might want to mount it under service/restore • You might want to rename something else out of the way and rename it into place • If it's an Xuser volume, you might have moira rename it for you

Things you might now be ready for • Disable tomorrow's daily backups. • Clean up and restart a regular backup that's been interrupted. • Restore a deactivated volume from TSM. • Restore a deactivated volume from Tape.

Tasks I haven't given you the details for. (But which aren't hard.)‏ • When doing a restore for deleted data, find the date you want precisely. • Reschedule to clean up after a backup failure that happened days ago. • Create new volumesets, for a new server deployment. • Gracefully handle a large bulk move of data. • Figure out how long the last few backups took.

/usr/afs/bin/backup