Lecture 8 on Physical Database. DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager which has a view of the database as a collection of pages, and that view is supported by the disk manager. The DBMS, file manager and disk manager.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Lecture 8 on Physical Database
DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager which has a view of the database as a collection of pages, and that view is supported by the disk manager.
The DBMS, file manager and disk manager
Request stored record
Store record returned
Request stored page
Stored page return
Disk I/O operation
Data read from disk
Physical database design
The physical database design is initiated to a certain extent in the “logical” design. The physical organization is determined largely by the need for operational efficiency, fast response times, and cost minimization. Most data storage devices record data as a stream of bits. The groups of bits which we can read with one machine instruction are called physical records. The physical records are stored at locations which are identified by a means of machine addresses. A program identifies a logical record or sequent by means of a key.
With the physical sequential access method, the physical records are stored in logical sequence. If the storage medium to be used is a tape, the programmer has to present the physical records in a logical sequence. If the storage medium is a direct access one, the system will interconnect the physical records so that they are in logical sequence, even if they were not presented in logical sequence. The records must be read in a fixed sequence from begin to end sequential.
With ISAM files the records are grouped so as to fit onto physical disk tracks, and one track on each cylinder contains an index to the records stored in that cylinder. When new records are inserted after the original sequential file has been set up these are stored in an overflow area. The index track contains pointers both to the prime data area and to the overflow area.
A common scheme for extremely large files is to induce a hierarchy of indices that follow the hierarchical nature of the secondary storages devices on which the file resides. We can view the hierarchy of indices as a tree. The benefit of index file is efficient space utilization in searching indexes.
Multilevel Index B+ Tree
B+-trees are defined to use a particular insertion/deletion strategy that ensures no node, except the root, is less than half full.
In general, we define index block values as:
2d -1 ≥ Order where Order is the maximum pointer in each index block.
d = number of search values in index block
Similarly, we define leaf block values as:
2e -1 ≥ Order where Order is the number of key values in each leaf block.
e = number of minimum key values in leaf block
For example, if the order is 3, then the number of key values in an index block is d = 2 such that 2d-1>3.
Also, the number of key values in a leaf block is 3 such that 2e-1>3 where e=2.
To insert a record with key value v, apply the lookup procedure to find the block B in which this record belongs.
If there are room (< 2e-1 records) in B, insert the new record in B.
If there is no room (=2e-1 records) in block B, create a new block B1 and divide the records from B and the inserted record into two groups of e record each.
The effects of inserting a record into B can ripple up the tree for several levels up to the root.
If we delete the record with key value v, we use the lookup procedure to find this record.
If after deletion, block B has more than half (e) records,we are done.
If, after deletion, block B has less than half (e-1) records, we look for a neighbor block B1. If B1 has more than half ( e) records, we distribute records of B and B1;
otherwise combine B with B1, which will have exactly 2e-1 records, and in the parent of B, modify the record for B1 and delete record for B.
If the deleted record was the first in block B, then we go to the parents of B to change the key value in the record for B.
If B is the first child of its parent, the parent has no key value for B, so we must go to the parent’s parent and so on, until we find an ancestor A1 of B such that A1 is not the first child of its parent A2.
Then the new lowest key value of B goes in the record of A2 that points to A1.
B+-tree after deleting 64
Let us search for a record with key value v. We find a path from the root of B+-tree to some leaf, where the desired record will be found if it exists. Suppose we have reached node (block) B. If B is a leaf, then examine block B for a record with key value v. If B is not a leaf, it is an index block. Determine which key value in block B covers v. In the record of B that covers v is a pointer to another block. That block follows B in the path being constructed.
Given a B+-tree with n records and e values in the leaf, and d values in each branch.
The tree will have no more than n/e leaves, no more than n/(de) parents of parents of leaves and so on.
If there are i nodes on paths from the root to leaves, then n>di-1e. It follows that
i1+logd(n/e) where i = number of I/O costs to access leaf block
For example, if n=1000000, e=5 and d=50, the number read/writes of blocks in an operation is
i 1+log50(200000) 4.12 5 (round up)
i = 4 (round down)
Notice that log50(200000) = log (200000) / log(50)
Hashing is a form of address calculation technique which can convert an item’s key into a near-random number used to determine where the item is stored. The near-random number refer to the address where a record is stored. The number of logical records stored in this area is referred to as the bucket capacity. The benefit of hashing is fast response time but with poor space ultilzation.
If the key is not numeric, convert it into numeric.
The keys are converted into a spread of numbers of the order of magnitude of the address numbers required.
The resulting numbers are multiplied by a constant which compresses them to the precise range of address.
Remainder hashing algorithm example
Suppose supplier number values are S100, S200, S300, S400, S500, and each stored supplier record requires an entire page to itself. By using hash function division/remainder. The page numbers for the five suppliers are then 9, 5, 1, 10, 6 and the divider is 13. For example, reminder of 100/13 is 9.
Suppose that the internal bit string representation of a key is
and that 5 bits are allowed in the index. The three bit strings
01011, 10010 and 10110
are exclusive ORed to produce, i.e., 0 ORed 0 = 0
1 ORed 0 = 1
0 ORed 1 = 1
1 ORed 1 = 0
which is 15 as a binary integer.
This method treats the key as a single large number, square the number, and extract whatever number of digits is needed from the middle of the result.
Suppose you want to generate addresses between 0 and 99. If the key is the number 453, its square is 205,209. Extracting the middle two digits yields a number between 0 and 99, in this case 52.
Logical record order can be maintained using inverted list which is a table that cross references record addresses with some field value. The benefit of inverted file is to implement secondary index, that is, an alternative index besides prime index.
There is one entry in the secondary key’s index for each value that the secondary key presently has in the data file.
The entry in the multi-list index for a key value has just one pointer to the first data record with that key value.
The data records contains a pointer to the next data record with that key value, and so forth.
There is a linked list of data records for each value of the secondary key.
Multi-list chains are bi-directional, and occasionally are circular to improve update performance of a database.
The benefit of multi-list file is to implement duplicate secondary indexes.
Four kinds of file structures have been introduced to design physical database in order to implement logical database schema.
Firstly, B+ tree is good for both performance with indexing and space utilization with balanced blocks.
Secondly, hashing is good for fast response in searching key value.
Thirdly, inverted file is good for secondary indexing.
Fourthly, multi-linked lists is good for duplicate secondary indexing.
How to compare the efficiency of physical database storage by use of B+-tree file, hashing file, inverted list file and multi-list file?
What is a B+-tree and what are its components.
Given number of search-key values that fit in one index node is 2 and in one leaf node is 3. Construct a B+-tree for the following set of key values (2, 3, 5, 7, 11, 17, 19, 23, 29, 31).
Show how to use this B-tree to find record with search-key value 11. Show B-tree after inserting search-value 9 and after deleting search value 17.
Chapter 14 Indexing Structures for Files of “Fundamentals of Database Systems:, 5th edition, by Elmasri and Navathe, Pearson International Edition, 2007, pp.500-531.