Summarization – CS 257 Chapter – 13 & 14 (Index Structures) Database Systems: The Complete Book

Summarization – CS 257 Chapter – 13 & 14 (Index Structures)Database Systems:The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin

13.1.1 Memory Hierarchy Several components for data storage having different data capacities available Cost per byte to store data also varies Device with smallest capacity offer the fastest speed with highest cost per bit

Memory Hierarchy Diagram Tertiary Storage As Visual Memory Disk File System Main Memory Cache Programs, DBMS Main Memory DBMS’s

13.1.1 Memory Hierarchy • Cache • Lowest level of the hierarchy • Data items are copies of certain locations of main memory • Sometimes, values in cache are changed and corresponding changes to main memory are delayed • Machine looks for instructions as well as data for those instructions in the cache • Holds limited amount of data • No need to update the data in main memory immediately in a single processor computer

Types of Memory

13.1.2 Transfer of Data Between levels Data moves between adjacent levels of the hierarchy At the secondary or tertiary levels accessing the desired data or finding the desired place to store the data takes a lot of time Disk is organized into bocks Entire blocks are moved to and from memory called a buffer

(cont’d) A key technique for speeding up database operations is to arrange the data so that when one piece of data block is needed it is likely that other data on the same block will be needed at the same time Same idea applies to other hierarchy levels

13.1.3 Volatile and Non Volatile Storage A volatile device forgets what data is stored on it after power off Non volatile holds data for longer period even when device is turned off All the secondary and tertiary devices are non volatile and main memory is volatile

13.1.4 Virtual Memory • Typical software executes in virtual memory • Address space is typically 32 bit or 232 bytes or 4GB • Transfer between memory and disk is in terms of blocks

13.2.1 Mechanism of Disk • Mechanisms of Disks • Use of secondary storage is one of the important characteristic of DBMS • Consists of 2 moving pieces of a disk • 1. disk assembly • 2. head assembly • Disk assembly consists of 1 or more platters • Platters rotate around a central spindle • Bits are stored on upper and lower surfaces of platters

(con’t) Disk is organized into tracks The track that are at fixed radius from center form one cylinder Tracks are organized into sectors Tracks are the segments of circle separated by gap

13.2.2 Disk Controller • One or more disks are controlled by disk controllers • Disks controllers are capable of • Controlling the mechanical actuator that moves the head assembly • Selecting the sector from among all those in the cylinder at which heads are positioned • Transferring bits between desired sector and main memory • Possible buffering an entire track

13.2.3 Disk Access Characteristics • Accessing (reading/writing) a block requires 3 steps • Disk controller positions the head assembly at the cylinder containing the track on which the block is located. It is a ‘seek time’ • The disk controller waits while the first sector of the block moves under the head. This is a ‘rotational latency’ • All the sectors and the gaps between them pass the head, while disk controller reads or writes data in these sectors. This is a ‘transfer time’

13.3 Accelerating Access to Secondary Storage • Several approaches for more-efficiently accessing data in secondary storage: • Place blocks that are together in the same cylinder. • Divide the data among multiple disks. • Mirror disks. • Use disk-scheduling algorithms. • Prefetch blocks into main memory. • Scheduling Latency – added delay in accessing data caused by a disk scheduling algorithm. • Throughput – the number of disk accesses per second that the system can accommodate.

13.3.1 The I/O Model of Computation • The number of block accesses (Disk I/O’s) is a good time approximation for the algorithm. • This should be minimized. • Ex 13.3: You want to have an index on R to identify the block on which the desired tuple appears, but not where on the block it resides. • For Megatron 747 (M747) example, it takes 11ms to read a 16k block. • A standard microprocessor can execute millions of instruction in 11ms, making any delay in searching for the desired tuple negligible.

13.3.2 Organizing Data by Cylinders • If we read all blocks on a single track or cylinder consecutively, then we can neglect all but first seek time and first rotational latency. • Ex 13.4: We request 1024 blocks of M747. • If data is randomly distributed, average latency is 10.76ms by Ex 13.2, making total latency 11s. • If all blocks are consecutively stored on 1 cylinder: • 6.46ms + 8.33ms * 16 = 139ms (1 average seek) (time per rotation) (# rotations)

13.3.3 Using Multiple Disks • If we have n disks, read/write performance will increase by a factor of n. • Striping – distributing a relation across multiple disks following this pattern: • Data on disk R1: R1, R1+n, R1+2n,… • Data on disk R2: R2, R2+n, R2+2n,… … • Data on disk Rn: Rn, Rn+n, Rn+2n, … • Ex 13.5: We request 1024 blocks with n = 4. • 6.46ms + (8.33ms * (16/4)) = 39.8ms (1 average seek) (time per rotation) (# rotations)

13.3.4 Mirroring Disks • Mirroring Disks – having 2 or more disks hold identical copied of data. • Benefit 1: If n disks are mirrors of each other, the system can survive a crash by n-1 disks. • Benefit 2: If we have n disks, read performance increases by a factor of n. • Performance increases further by having the controller select the disk which has its head closest to desired data block for each read.

13.3.5Disk Scheduling and the Elevator Problem • Disk controller will run this algorithm to select which of several requests to process first. • Pseudo code: • requests[] // array of all non-processed data requests • upon receiving new data request: • requests[].add(new request) • while(requests[] is not empty) • move head to next location • if(head location is at data in requests[]) • retrieve data • remove data from requests[] • if(head reaches end) • reverse head direction

(con’t) Events: Head starting point Request data at 8000 Request data at 24000 Request data at 56000 Get data at 8000 Request data at 16000 Get data at 24000 Request data at 64000 Get data at 56000 Request Data at 40000 Get data at 64000 Get data at 40000 Get data at 16000 64000 56000 48000 40000 32000 24000 16000 8000

(con’t) Elevator Algorithm FIFO Algorithm

13.3.6 Prefetching and Large-Scale Buffering • If at the application level, we can predict the order blocks will be requested, we can load them into main memory before they are needed.

13.4 Disk Failures • Intermittent failure • Media Decay • Write failure • Disk Crash

Intermittent Failures Read or write operation on a sector successful not on first try, but after repeated tries. The most common form of failure. Parity checks can be used to detect this kind of failure.

Media Decay Serious form of failure. Bit/Bits are permanently corrupted. Impossible to read a sector correctly even after many trials. Stable storage technique for organizing a disk is used to avoid this failure.

Write failure Attempt to write a sector is not possible. Attempt to retrieve previously written sector is unsuccessful. Possible reason – power outage while writing of the sector. Stable Storage Technique can be used to avoid this.

Disk Crash Most serious form of disk failure. Entire disk becomes unreadable, suddenly and permanently. RAID techniques can be used for coping with disk crashes.

More on Intermittent failures… When we try to read a sector, but the correct content of that sector is not delivered to the disk controller. If the controller has a way to tell that the sector is good or bad (checksums), it can then reissue the read request when bad data is read. The controller can attempt to write a sector, but the contents of the sector are not what was intended. The only way to check this is to let the disk go around again read the sector. One way to perform the check is to read the sector and compare it with the sector we intend to write.

Checksums. Technique used to determine the good/bad status of a sector. Each sector has some additional bits called the checksum that are set depending on the values of the data bits in that sector. If checksum is not proper on reading, then there is an error in reading. There is a small chance that the block was not read correctly even if the checksum is proper. The probability of correctness can be increased by using many checksum bits.

Checksum Calculation Checksum is based on the parity of all bits in the sector. If there are odd number of 1’s among a collection of bits, the bits are said to have odd parity. A parity bit ‘1’ is added. If there are even number of 1’s then the collection of bits is said to have even parity. A parity bit ‘0’ is added. The number of 1’s among a collection of bits and their parity bit is always even. During a write operation, the disk controller calculates the parity bit and append it to the sequence of bits written in the sector. Every sector will have a even parity.

Examples & the Odds… A sequence of bits 01101000 has odd number of 1’s. The parity bit will be 1. So the sequence with the parity bit will now be 011010001. A sequence of bits 11101110 will have an even parity as it has even number of 1’s. So with the parity bit 0, the sequence will be 111011100. There are chances that more than one bit can be corrupted and the error can be unnoticed. Increasing the number of parity bits can increase the chances of detecting errors. In general, if there are n independent bits as checksum, the chances of error will be one in 2n.

Stable Storage Checksums can detect the error but cannot correct it. Sometimes we overwrite the previous contents of a sector and yet cannot read the new contents correctly. To deal with these problems, Stable Storage policy can be implemented on the disks. Sectors are paired and each pair represents one sector-contents X. The left copy of the sector may be represented as XL and XR as the right copy.

Stable -Storage Writing Policy Write the value of X into XL. Check the value has status “good”; i.e., the parity-check bits are correct in the written copy. If not repeat write. If after a set number of write attempts, we have not successfully written X in XL, assume that there is a media failure in this sector. A fix-up such as substituting a spare sector for XL must be adopted. Repeat (1) for XR.

Stable-Storage Reading Policy: The policy is to alternate trying to read XL and XR until a good value is returned. If a good value is not returned after pre chosen number of tries, then it is assumed that X is truly unreadable.

Recovery from Disk Crashes. • To reduce the data loss by Dish crashes, schemes which involve redundancy, extending the idea of parity checks or duplicate sectors can be applied. • In general, if the mean time to failure of disks is n years, then in any given year, 1/nth of the surviving disks fail. • Each of the RAID schemes has data disks and redundant disks. • Redundant disks are one or more disks that hold information that is completely determined by the contents of the data disks. • When there is a disk crash of either of the disks, then the other disks can be used to restore the failed disk to avoid a permanent information loss.

Content 1)Focus on : “How to recover from disk crashes” RAID - “redundancy array of independent disks” 2)Several schemes to recover from disk crashes: • Mirroring—RAID level 1 • Parity checks--RAID 4 • Improvement--RAID 5 • RAID 6

1) Mirroring The simplest scheme to recovery from Disk Crashes -- making two or more copied of the data on different disks Benefit: -- save data in case of one disk will fail; -- divide data on several disks and let access to several blocks at once -- the only way data can be lost if there is a second (mirror/redundant) disk crash while the first (data) disk crash is being repaired.

2)Parity Blocks Why changes? -- disadvantages of Mirroring: uses so many redundant disks What’s new? -- RAID level 4: uses only one redundant disk How this one redundant disk works? -- modulo-2 sum; -- the jth bit of the redundant disk is the modulo-2 sum of the jth bits of all the data disks. Example: Data disks: Disk1: 11110000/ Disk2: 10101010/ Disk3: 00111000 Redundant disk: Disk4: 01100010

(con’t) Reading -- Similar with reading blocks from any disk; Writing 1)change the data disk; 2)change the corresponding block of the redundant disk; Why? -- hold the parity checks for the corresponding blocks of all the data disks

(con’t) For a total N data disks: 1) naïve way: • read N data disks and compute the modulo-2 sum of the corresponding blocks; • rewrite the redundant disk according to modulo-2 sum of the data disks; 2) better way: • Take modulo-2 sum of the old and new version of the data block which was rewritten; • Change the position of the redundant disk which was 1’s in the modulo-2 sum;

(con’t) • Data disks: • Disk1: 11110000 • Disk2: 10101010  01100110 • Disk3: 00111000 • to do: • Modulo-2 sum of the old and new version of disk 2: 11001100 • So, we need to change the positions 1,2,5,6 of the redundant disk. • Redundant disk: • Disk4: 01100010  10101110

(con’t) • Redundant disk crash: -- swap a new one and recomputed data from all the data disks; • One of Data disks crash: -- swap a new one; -- recomputed data from the other disks including data disks and redundant disk; • How to recomputed? (same rule, that’s why there will be some improvement) -- take modulo-2 sum of all the corresponding bits of all the other disks

3) An Improvement: RAID 5 Why need a improvement? -- Shortcoming of RAID level 4: suffers from a bottleneck defect (when updating data disk need to read and write the redundant disk); Principle of RAID level 5 (RAID 5): -- treat each disk as the redundant disk for some of the blocks; Why it is feasible? The rule of failure recovery for redundant disk and data disk is the same: “take modulo-2 sum of all the corresponding bits of all the other disks” So, there is no need to retreat one as redundant disk and others as data disks

(con’t) • How to recognize which blocks of each disk treat this disk as redundant disk? -- if there are n+1 disks which were labeled from 0 to N, then we can treat the ith cylinder of disk J as redundant if J is the remainder when I is divided by n+1; • Example : N=3; The first disk, labeled as 0 : 4,8,12…; The second disk, labeled as 1 : 1,5,9…; The third disk, labeled as 2 : 2,6,10…; Suppose all the 4 disks are equally likely to be written, for one of the 4 disks, the possibility of being written: 1/4 + 3 /4 * 1/3 =1/2 If N=m => 1/m +(m-1)/m * 1/(m-1) = 2/m

4) Coping with multiple disk crashes • RAID 6 • deal with any number of disk crashes if using enough redundant disks • Example a system of seven disks ( four data disks_numer 1-4 and 3 redundant disks_ number 5-7); • How to set up this 3*7 matrix ? (why is 3? – there are 3 redundant disks) 1)every column values three 1’s and 0’s except for all three 0’s; 2) column of the redundant disk has single 1’s; 3) column of the data disk has at least two 1’s; • Reading: • read form the data disks and ignore the redundant disk • Writing: • Change the data disk • change the corresponding bits of all the redundant disks

(con’t) • In those system which has 4 data disks and 3 redundant disk, how they can correct up to 2 disk crashes? • Suppose disk a and b failed: • find some row r (in 3*7 matrix)in which the column for a and b are different (suppose a is 0’s and b is 1’s); • Compute the correct b by taking modulo-2 sum of the corresponding bits from all the other disks other than b which have 1’s in row r; • After getting the correct b, Compute the correct a with all other disks available; • Example

Coping with multiple disk crashes

(con’t) • In that 3*7 matrix, find in row 2, disk 2 and 5 have different value and disk 2’s value is 1 and 5’s value is 0. • So: compute the first block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,4,6; • Then compute the first block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,2,3;

13.5 Arranging data on Disk • Data elements are represented as records, which stores in consecutive bytes in same same disk block. • Basic layout techniques of storing data : • Fixed-Length Records Allocation criteria - data should start at word boundary. • Fixed Length record header 1. A pointer to record schema. 2. The length of the record. 3. Timestamps to indicate last modified or last read.

Summarization – CS 257 Chapter – 13 & 14 (Index Structures) Database Systems: The Complete Book