Physical Database Design

Physical Database Design • The last phase of database design. • It is to determine how to store the database. • RDBMSs usually support a number of alternative physical representations. • The designer need to know the advantage and disadvantage of each representation.

Physical Database Design Objectives: • Data may be accessed with acceptable speed; • The data does not use up too much computer’s storage; • The database is reasonably robust: • Be possible to recover a damaged database system; • Still be possible to run the reminder if part of system fails.

Physical Design Process Design decisions should be based on the following knowledge: • The logical database design: which relations are to be included. • Quantities and volatility of data: the number of tuples, the frequency with which each relation will be altered, the rate at which each relation will grow. • The way in which the data is to be used: the frequency of runing the application; then longest time for the application to execute. • Costs for storing and accessing data: how the representation affects the speed to access, insert and delete the records.

Physical Design Process The design process is a ‘implement-test-improve’ process: Step 1:analysis of the database to generate a initial design; Step 2: using test database to test the initial design. Step 3:modification of initial design by removing Bottle-necks. Step 4: monitoring the performance of database systems. Modification should be made to correct inappropriate design.

Physical Representation The requirements of any strategy for representation: a) It must make it possible to access all data without having to specify where tuples are stored; b) It must be possible to apply the relational algebra operators restrict, project, join etc. c) It must be possible to display the relation as a table of values.

P1 WH1 B1 100 P1 WH1 B3 200 P4 WH3 B2 3000 P2 WH4 B9 50 P5 WH4 B10 50 P5 WH4 B11 50 P_No WareHouse Bin_No Quantity P1 WH1 B1 100 P1 WH1 B3 200 P4 WH3 B2 3000 P2 WH4 B9 50 P5 WH4 B10 50 P5 WH4 B11 50 Physical Representation Example of file representation for data File page records

File Structure and access techniques 1. Heap file, serial search; 2. Sorted file, binary search; 3. Hash file, hashing function; 4. Index: B-tree; 4. Clustering: attribute grouping.

Heap Files • The simplest file structures. A heap file is constructed as a list of pages. A new record is inserted in the last page of the file. Advantage of using heap files: 1) Fast record insertion: just insert news record in the last page. 2) Economic use of store: to store the data records only.

Heap Files Disadvantage of using heap files: 1) Can only use the serial searching method, which is the slowest searching method; 2) Unable to reclaim the space of deleted records. When to use heap files: 1)batch records is to be inserted; 2)a few pages long only; 3)used as a part of some other structures.

Access keys When a file is organized to provide direct access to records on the basis of values of specific attributes, then those attributes are called access keys. Select * FROM STOCK WHERE P_NO= ‘p1’ AND Quantity > 100; P_NO and Quantity is used as the access keys

Access Key The most appropriate access keys can be selected on the basis of how tight they are. A tight access key is one where there are relatively few tuples containing specific access key values. If there are many such tuples, the access key is said to be loose. Extreme examples: a primary key is an extremely tight; attribute ‘SEX’ is an extremely loose access key.

C1 C2 C3 C5 C7 C9 C12 C15 C19 Sorted Files In sorted files, the records are sorted in some order. For sorted file, the binary search can be used to access the specific tuples. SELECT * FROM CUSTOMER WHERE CUSTOMER_NO=‘C9’;

Sorted Files When to use sorted files: When tuples are normally accessed in some specific sequence; The main issues: How to maintain the sequence when new records are inserted.

Hash Files Hashing is the process of calculating the location of a record (page address) from the value of an access key. The access key is also called the hash key. Hash files are sometimes called random files as the records appear to be randomly distributed across the file space. Hashing potentially provides the fastest access to a record via an access key – a record may be retrieved by reading just one page.

0 1 2 3 4 i Record(k) j Hash Functions <Record Key>  [Hushing Function]  <Home Address> k ----- f(k) ---- 2 l ----- f(l) ---- 3 m---- f(m) ---- 4 Example: Hushing function Modulo-5: Access key values: k=12; l =13; m=24

When to use Hash Files When retrieval is always on the basis of the value of a single access key. Inappropriate situations to use the hash files: 1) When retrieval is on the basis of pattern matching. 2) When retrieval is on the basis of range value; 3) When access is on the basis of values only PART of the access key.

P_No P_No Address Warehouse Bin_No Quantity P1 P1 WH1 * B1 100 P1 P1 WH1 ** B3 200 P4 P2 WH3 *** B2 3000 P4 P2 WH4 **** B9 50 P5 P5 WH4 ***** B10 50 P5 P5 WH4 ****** B11 50 P6 P6 WH5 ******* B1 20 Indexes Indexes are an alternative to hashing as a mechanism for direct access to records. An index is a table of access key values, along with the address of the records to store the associating value. P_No_Index P_No_Index

Indexes Advantage of using indexes: 1) Indexes provide access to sequences of records; 2) Indexes may be used to implement many access keys for one relation, whereas there can be only one hash key. Index are managed by: CREATE INDEX DROP INDEX

P1 P6 P5 P1 P1 P4 P5 WH1 WH5 WH3 * * WH4 * B1 B10 B2 B1 100 20 3000 50 P6 P2 P5 P1 P2 P5 WH4 WH4 WH1 ** ** ** B11 B3 B9 50 50 200 Multi-level Indexes The shorter the index, the faster the search. When an index is large the search time can become significant. A solution to this problem is to split the index up into a number of shorter indexes and to provide an index to the indexes. Data Records 1st_Index 2nd_Index

B-trees A B-tree is a type of multi-level full index. B-trees are widely used because they are largely self-maintaining. A B-tree keeps itself balanced such that it always takes approximately the same time to access any data record.

* * * * * * * * C_No * C1 C9 C25 C19 C25 C9 C25 C11 C19 Name * * * * * * * * * C23 C15 C10 C2 C32 C27 C11 C27 Area * * * * * * * * C1 Nippers Ltd W Yorks C2 Tot-Gear Middl C9 Kid-Naps Middl C10 Boys Hats London C11 Play Time London C15 School Kit London C19 Smart Kids Anglia C23 Bed Socks London C25 Slugs Anglia C27 Kids Stuff Middl C32 Play Ground Middl C34 Way In W Yorks B-trees Example:

When to use indexes and B-tree Indexes are suitable for the following situations: 1) Pattern matching based retrieval; 2) Retrieval with rang of access key; 3) Retrieval based on multi-attribute access key. B-tree is more suitable for the following situation: 1) When the relation is frequently updated; 2) When the relation is so large that costly to recreate; 3) When sorting the access key is required.

Clustering Clustering is the technique of storing related records physically close together. The advantage of doing this is that it reduces the number of page accesses necessary to process a group of related records. When to use clustering: When an application accesses groups of tuples which have some common attribute value.

Summary • Physical database design is the process of deciding how to store relations; • The decision is based on the logical database design, the volume and volatility of data, the ways to manipulate the data, and the costs of representation. • A physical database design is performed by ‘implement-test-improve’ process. • Typical methods: heap file, sorted files, hash files, and indexes (like B-tree).

Summary • A heap file is a chain of pages where new records are simply added to the end. • Other file organization make it possible to access records speedily, given the value of some access key. • Sorting files could improve the access speed. Binary searching may be used to access records with a specific value of the access key. • Hashing is the process of locating a record by calculating its address from the access key value.

Summary • An index is a table of access key values and associated record addresses. Indexed are slower than hashing but more flexible. • B-tree is a typical index method. • Clustering is the technique of storing related physically close together.

Physical Database Design