470 likes | 495 Views
Learn about different data backup solutions and the principles of fault tolerance. Understand the importance of backing up data and ensuring 100% availability. Discover how to effectively back up and restore your data to protect against data loss.
E N D
COMP1321 Backup of data Richard Henson December 2015
Week 10 – Back Up and Fault Tolerance • Objectives • describe different kinds of solutions available for data backup • explain the concept and principles of fault tolerance
Backing Up and Fault Tolerance • In terms of computing… • “back up” is what is done to data, in case the original is corrupted for some reason • e.g. all computer users should back up any files they may save, and want to use again later… • e.g. all network users should have their “user data” saved as part of good network processes • “fault tolerance” is more fundamental, and concerned with 100% availability • relates to hardware as well and the software required to manage that hardware… • but back up is an essential part of fault-tolerance
Fault Tolerance of Data • Data on storage media is easily corrupted or deleted • magnetic disks particularly sensitive to data loss • must be backed up at all times • Useful also to store contents of memory • otherwise lost whenever there is a power interruption or system malfunction
Backing Up Data • If complete hard disk is regularly copied… • massive amounts of data will soon accumulate… • need a storage medium that can copy very large quantities • Once upon a time, tape storage the preferred method • use a new back up tape every day • keep the old ones carefully labelled in a safe place
Which Data Should be Backed Up? • Can be classified into several types • System data • Critical System data • Application data • User data • May be backed up in different ways, with differing regularity:
What data should NOT be backed up • Probably two categories: • that which is safely stored elsewhere and can be restored at leisure • e.g. applications on CD • that which won’t be used again and won’t be missed • e.g. temporary files • read/unread emails that aren’t important • saved word files, etc. that are no longer needed
Clearing out data that is longer needed • According to a visionary researcher: • "All computer-mediated processes produce data. Unless dealt with, it stays around. • And it’s after-effects can be pretty toxic. • And, just as 100 years ago we ignored pollution in our rush to build the Industrial Age, today we’re ignoring data in our rush to build the Information Age. • And, I believe, 100 years from now our great-grandchildren will look back at the decisions we made and wonder how we could have been so ignorant and short-sighted." (Bruce Schneier, 2008)
Automatic tidying up of data • The answer is simple… • BACK UP processes should be accompanied by DELETE processes! • not yet accepted practice… • This is good information management • Reduces risk of information getting into the wrong hands • and ensures compliance with UK Data Protection Legislation
Essential/Important System Data • Essential: what is needed for a healthy boot up • Microsoft networks refer to this as SYSTEM STATE DATA • highly dynamic • regular back up essential • Data to support utilities • required for “housekeeping” duties • back up every time not essential • data available on CD • Data to support services • as utilities data
Backing up User Data • A number of approaches available: • Incremental backup • Some files backed up on Monday, others on Tuesday, etc… • Differential backup • Just files that have changed (different datestamp) are backed up • Full backup • all data backed up • What about critical system data? e.g. Windows registry settings • differential backup?
Tape Backup? • Can store many gigabytes of data on a single tape • Storage is fairly rapid, but BUT… tape is no longer regarded as the natural choice… • storage medium is still magnetic • can be very slow retrieval time
The Backup Process • Handled by software • easily scheduled to be automatic • Data could be backed up to a variety of alternative media • e.g. removable hard disk • A lot of backup data will accumulate… • general rule to dump data after three backup “generations” • known as grandfather-father-son
Other Alternatives to Tape Backup • Server data could also be backed up to: • a USB-linked hard drive • another computer on the network • a computer on another network in a different location • easily achievable via the Internet • data will be preserved in the event of a fire or environmental catastrophe
Verification of Backup • One thing to THINK that the data is being backed up • Quite another to ensure that this has indeed occurred! • no reason to assume that the backup will be completely effective • plenty that could go wrong • Data backup routine should to check: • that the data has indeed been copied • that there are no errors • Good backup software should make such checks automatically!
Restoring Backed Up Data • Should happen… • as part of a regular routine • just like the backing up itself… • No good backing the data up to tape or disk if it can’t easily be recovered! • or can’t be copied back to the right place… • Back up software should always be tested in “restore” mode as well…
Beyond Backup: “Thinking the unthinkable...” • Humans are optimistic • we HOPE things won’t ever go wrong… • but they do!!! • ANY network device could go wrong at any time • could affect network performance • could even bring the whole network to a halt… • with time, the business/organisation will be kaput • Software can also fail • may go into an endless loop • may need to be restarted
An International Standard for “Business Continuity Planning” • BS25999: • taking “Murphy’s Law” contingency planning to all aspects of the organisation • recent UK e.g.: flooding • need to prepare for it so the business can continue… • These days, a business’s most important asset often is its information • stored in digital format • copy needs to be kept in a different location
Fault Tolerance and Computer Systems • All about availability • Any organisation now dependent on digital data • Power cut… people stop work… most of what they do involves a computer • Good fault tolerance is about minimising the chances of this happening…
Definition of “Fault Tolerant”? • “A computer system or component designed so that, in the event that a component fails, a backup component or procedure can immediately take its place with no loss of service”
Fault Tolerance role of the Network Operating System • Each important hardware component on the network should have a backup that can take over in the event of a failure • NOS should therefore • detect failures • enable a backup to automatically take over when the fault is detected...
Achieving Fault Tolerance • ONE APPROACH… • carefully written software • software detects failure of other software • takes evasive action in real time • hardware has an embedded system that: • detects failure • rapidly swaps alternative hardware into action • Makes sense for the operating system to do all of this… • detects both hardware and software failure • restarts program(s) • swaps in alternative pre-wired hardware
Concept of Data “Mirroring” • Problem with periodic backup: • data copied the previous night • what if the system hard disk goes kaput in the middle of the next day? • Copy of all data should additionally be stored “shorter term” on further media • easiest way is to have another disk in reserve • everything copied to system disk also copied to mirror
Disk Mirroring Disk A • Increases boot/system disk fault tolerance under most conditions • In its simplest form: • all data held on one disk: • second disk is an exact copy of the first • When anything is written to disk… • written simultaneously to both disks Writes data to A Disk controller Writes same data to B Disk B
Where even Mirroring alone is not enough… • If the system crashes and will not reboot… • operating system doesn’t get reloaded • therefore the mirror never gets activated • and copied files cannot be read…
Recovering the system after a damaged Mirrored Boot Disk • Boot program can only point to one disk at a time… • If the boot disk crashes … • the system boot program will fail to access a disk at all next time it restarts… • System needs an alternative boot up… • e.g. use a boot floppy or CD to restart the system… • Boot program can then be modified to point to the backup, not the faulty disk
Remedial action after a broken mirror • Just because the system is up and running again, doesn’t mean the emergency is over… • Fault-tolerance MUST be restored before ANYONE can relax • replacement disk must be added asap to replace the damaged one • the mirror must then be re-established • all the disk copying required to re-establish system fault-tolerance may take some time…
Relative Merits of Mirroring (system availability) • Advantage: • system keeps going as normal if a non-boot disk crashes • Disadvantages: • disk write operations take longer • half of available disk space is used up (only 50% efficient used of storage)
Hardware flaw with Mirroring • Regardless of the boot disk problem, disk mirroring is STILL not entirely fault-tolerant! • both disks connected to the same hard disk controller • if the controller card goes down, neither disk will be accessible
Disk Duplexing Disk A Disk B • Separate controller card for each disk • if one card goes down, only the disk connected to it is affected • NOTE: • use of duplexing DOES NOT eradicate the potential re-booting problem caused by a damaged boot disk • needs the same solution as mirroring Controller A Controller B motherboard
Problem: Too Much redundancy of disk space • Redundancy = disk space used by the system/total disk space • Both mirroring and duplexing: • Redundancy = 0.5 (50%) • Rather high • Half of available space tied up in backup! • Solution: RAID (Redundant Array of Inexpensive Disks) • less redundancy • still full backup
What is RAID? • A system of several disks where part of each disk is used to store system data, and the rest stores backup data • If all the disks are linked together and just used for primary data (ie no backup): • the arrangement is known as a stripe set • also known as RAID 0 (ie zero fault tolerance)
Categories of RAID providing fault tolerance • RAID 1 - mirroring or duplexing • RAID 2 – backup using disks that do not have their own error-checking • RAID 3 – backup using disks with their own error checking • striped across disks at byte level • parity data stored on one drive • RAID 4 – similar to RAID 3 • but data striped in whole blocks, not per byte • poor data write performance
RAID 5 (the best!) • Can use different number of disks (minimum three) • Each disk divided into sections • One parity section in each disk • Data write faster than RAID 4 • Redundancy depends on number of disks used….
Example of RAID 5 (four disks) • Each of the four disks divided into four sections • one section for parity in each • not always write to parity disk • data write therefore faster than RAID 4, read slower • Redundancy = ¼
RAID 5 (five disks) • RAID 5 using five disks (most popular), each divided into five sections • One section for parity as before • Redundancy = 1/5
Hot Swapping • Disks that can be removed and replaced without rebooting the system • If a disk that belongs to a RAID system fails… • the system can continue • but fault tolerance is immediately lost • Helpful and quicker to replace the disk, and re-establish RAID: • as soon as possible • without having to turn the power off and rebooting
Fault Tolerance and Re-boot • If a system crashes and/or is rebooted… • availability is temporarily lost • Needs to be a reserve system (backup server) that will perform that system’s functions in the meantime • Network Operating system needs to synchronise processes across systems to enable this to take place…
The Backup Server • Essential for 100% availability • Should be configured as a replacement for the main server • also needs to be a domain controller • must also have a copy of the users database, regularly synchronised with the main domain controller • also configured to be able to log users onto the network
Backup of Settings before Reconfiguration by Rebooting • New hardware will be added to a server from time to time: • more memory? • extra hard disk? • new video, sound, or network card? • Hot swapping may well NOT be supported • Server will have to reboot and reconfigure
Backup of Settings before Reconfiguration by Rebooting • If the new drivers are not correct: • system may not reboot properly • may be difficult to remove drivers • In such circumstances, system needs a “rollback” feature, so the old hardware can be put back, as well as… • previous settings safely stored where they can be easily retrieved • previous settings restored as an option on boot-up
Keeping Servers Cool! • Servers work hard (especially the disks…) • Can get hot • will reduce MTBF of components • Need good ventilation at all times…
Minimising Effects of Power Failure • Power failure can ruin hardware • mains spikes can overheat components • sudden lack of power will lose data currently being processed • Best to protect all hardware: • bottom line - surge preventer • better: UPS (uninterruptible power supply)
The UPS • Battery packs that can provide mains voltage after a power cut • for a few minutes (cheap but effective) • or half an hour (expensive, less down time) • NOS needs to make sure it automatically cuts in when voltage drops sharply • Power continuation must include the backup domain controller, so synchronisation can occur • procedure of “graceful degradation” • allows processing to go to completion • allows new system settings to be written
The Fault Tolerant Network Operating System • A Fault Tolerant system needs to have good control of hardware, backup hardware and software • The NOS, and those who configure it, need to use fault tolerance effectively so an organisational network will • keep going… (accessibility) • do what is expected… (reliability, stability)
Network Operating Systems and Fault Tolerance • Many features to make fault tolerance kick in automatically • However, fault tolerance only restored once the faulty component has been replaced and its replacement configured to work as the new backup… • You’ll see how all this can be achieved in the practicals…