Physical Database Design

Physical Database Design

Conceptual database design • The process of constructing a model of the information used in an enterprise, independent of all physical considerations. Conceptual database design is entirely independent of implementation details such as the target DBMS, application programs, programming languages, hardware platform, performance issues, or any other physical consideration.

Logical database design • The process of constructing a model of the information used in an enterprise based on a specific data model, but independent of a particular DBMS and other physical considerations.

Physical database design • The process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and indexes used to achieve efficient access to the data, and any associated integrity constraints and security measures.

Physical database design • The physical database design is tailored to a specific DBMS. There is feedback between physical and logical design, because decisions taken during physical design for improving performance may affect the logical data model.

Physical database design steps • Translate global logical data model for target DBMS 2. Design physical representation 3. Design user views 4. Design security mechanisms 5. Consider the introduction of controlled redundancy 6. Monitor and tune the operational system

1. Translate Global Logical Data Model for Target DBMS • This process requires knowledge of the functionality offered by the target DBMS. For example, the designer need to know: • How to create base relations • Whether the system supports the definition of primary keys, foreign keys, and alternate keys; • Whether the system allows attributes to be defined as NOT NULL • Whether the system supports the definition of domains, integrity constraints, and enterprise constraints.

Three activities of Translate Global Logical Data Model for Target DBMS • Design base relations • Design representation of derived data • Design enterprise constraints

Design base relations • the name of the relation • a list of simple attributes • the primary key, alternate keys, and foreign keys • a list of any derived attributes and how they should be computed • referential integrity constraints for any foreign keys identified • its domain, consisting of a data type, length, and any constraints on the domain • an optional default value for the attribute • whether the attribute can hold nulls

Design representation of derived data • Derived attributes do not appear in the logical data model but are documented in the data dictionary. • The designer should calculate: • the additional cost to store the derived data and keep it consistent with operational data from which it is derived; • the cost to calculate it each time it is required

Design enterprise constraints • CONSTRAINT staffnothandlingtoomuch CHECK ( NOT EXISTS ( SELECT staffno FROM propertyForRent GROUP BY staffNo HAVING COUNT(*) > 100 ) )

2. Design Physical Representation Objective To determine the optimal file organizations to store the base relations and the indexes that are required to achieve acceptable performance, that is, the way in which relations and tuples will be held on secondary storage.

Factors used to measure efficiency • Transaction throughput The number of transaction that can be processed in a given time. • Response time The elapsed time for the completion of a single transaction. • Disk storage The amount of disk space required to store the database files.

The four basic hardware components • The four basic hardware components interact and affect system performance: • Main memory • CPU • Disk I/O • Network

Disk I/O • The O/S files should be separated from the database files • The main data base files should be separated from the index files • The recovery log file should be separated from the rest of the database Main database file Index file Recovery log file O/S

Design physical representation activities • Analyze transactions • Choose file organizations • Choose indexes • Estimate disk space requirements

Analyze Transaction Objective To understand the functionality of the transactions that will run on the database and to analyze the important transactions.

Analyze Transaction In analyzing the transactions, we attempt to identify performance criteria, such as • The transactions that run frequently and will have a significant impact on performance • The transactions that are critical to the operation of the business • The times during the day/week when there will be a high demand made on the database (called the peak load)

Choose file organizations • In many cases, a relational DBMS may give little or no choice for choosing file organizations.

File Organization and Storage Structures Basic Concepts Page 1 2

The main types of file organization • Heap (unordered) files • Sequential (ordered) files • Hash files

Heap File • Records are placed in the file in the order as they are inserted. • Linear search must be performed to access method. • To delete record – the required page has to be retrieved, the record marked as deleted, and the page written back to disk. • This means that heap files have to be periodically reorganized by the DBA. • Heap files are one of the best organization for bulk loading data into a table.

Sequential (ordered) files • The records in a file can be sorted on the values of one or more of the fields Example – order field: Staff No SELECT * FROM staff WHERE staffno = ‘SG37’; SELECT * FROM staff ORDERBY staffno; binary search SA9 SG5 SG14 SG37 SL21 SL41 (1) (3) (2)

Sequential (ordered) files • Insert and Deleting records ค่อนข้างจะเป็นปัญหา • ในการ Insert จะหาตำแหน่ง Page ก่อน • หาก page ว่างข้อมูลจะถูกบันทึกลงไป • หากไม่ว่างจะต้องขยับ record ไปยัง page ต่อไปเพื่อให้เกิดที่ว่างเพื่อ insert record ใหม่ได้ • จะเป็นข้อเสียอย่างมากหากมีการ insert ณ ต้นๆ files

Sequential (ordered) files การแก้ปัญหาการ insert • สร้าง Temporary unsorted file หรือเรียกว่า overflow หรือ transaction file • เมื่อมีการ insert จะ insert ลง overflow • เวลา Retrieve หากหา main file ไม่พบจะมาค้นที่ overflow โดยใช้วิธี linear search • DBA จะทำหน้าที่ merge ระหว่าง main file และ overflow เป็นระยะ

Hash Files • A hash function calculates the address of the pages. • The base field is called the hash field • Techniques used to calculate are folding and division-remainder

Folding Technique • Folding technique ใช้หลักการทางคณิตศาสตร์ Example: S U R A T 19 21 18 01 20 1 9 2 1 + 1 8 0 1 2 0 = 1 8 2 0 4 1 PAGE ที่เก็บข้อมูล

Division-remainder • Division-remainder ใช้เศษที่เหลือจากการหาร Example: Product No. 120 page ที่เก็บข้อมูลคือ 117 / 13 = 9 เศษ 3 ตำแหน่งที่เก็บข้อมูล

Hash Files • ปัญหาของการใช้ Hashing คือวิธีการนี้ไม่สามารถรับรองว่าแต่ละ record จะได้ รับ unique address • แต่ละ address ที่ได้รับก็คือตำแหน่งของ page หรือ bucket • ในแต่ละ bucket จะมีหลาย slots ซึ่งแต่ละ slot จะใช้สำหรับแต่ละ record • ในแต่ละ bucket ข้อมูลแต่ละ record จะถูกบันทึกเรียงตามการมาถึง • ถ้าตั้งแต่ 2 records ขึ้นไปถูกให้บันทึกใน address เดียวกันจะเรียกว่าเกิดการชน (collision) • Records ที่ชนกันจะเรียกว่า synonyms

วิธีการแก้ปัญหา collision ใน hash file • Open addressing • Unchained overflow • Chained overflow • Multiple hashing

Collision resolution using Open Addressing Before Bucket After Bucket ID 1658 ID 6546 ID 0345 ID 2215 ID 3254 ID 1658 ID 6546 ID 0345 ID 4555 ID 2215 ID 3254 0 0 1 1 2 2 สมมติเราต้องการ insert record ID4555 ซึ่งจาก Hash function คำนวณแล้วจะต้องอยู่ bucket ที่ 2 แต่ bucket ที่ 2 เต็มแล้วจึง scan หา bucket ที่ว่าง แล้วจึงนำไป insert ที่ bucket นั้นในที่นี้ก็คือ bucket ที่ 1 นั่นเอง

Collision resolution using unchained overflow Overflow area Before Bucket Bucket ID 1658 ID 6546 ID 0345 ID 2215 ID 3254 ID 4555 0 3 1 4 2 ไม่ว่า record ใดหากคำนวณโดย Hash function แล้วเกิด collision แล้วจะนำไปเก็บไว้ใน overflow area ซึ่งก็คือ bucket ที่ 3 และ 4 เมื่อมีการค้นข้อมูลใน area หลักไม่เจอ จะไปหาในบริเวณ overflow area

Collision resolution using chained overflow Overflow area Before Bucket Bucket ID 1658 ID 6546 ID 0345 ID 2215 ID 3254 ID 4555 0 3 0 1 4 0 2 3 สมมติเราต้องการ insert record ID4555 ซึ่งจาก Hash function คำนวณแล้วจะต้องอยู่ bucket ที่ 2 แต่ bucket ที่ 2 เต็มแล้วจึง insert row ID 4555 ลงใน overflow area คือ bucket ที่ 3 โดยที่ bucket หลักจะมี synonym pointer ชี้ไปที่ bucket ที่ 3 อาจมี technique อื่นอีก เช่นแทนที่จะชี้ไปที่ bucket แต่จะชี้ไปที่ slot address แทน

Multiple hashing • หากการใช้ hashing function ชนิดแรกแล้วเกิดการชนกัน ก็จะใช้ hashing function ชนิดที่สองแทน

Limitations of Hashing Hashing ไม่เหมาะกับการค้นในลักษณะเป็นช่วง เช่นหาพนักงานที่มีเงินเดือนระหว่าง 8,000 – 12,000 บาท

Indexes Index A data structure that allows the DBMS to locate particular records in a file more quickly and thereby speed response to user queries

Dense Index Supplier file (data) S1 Smith 20 London S2 Jones 10 Paris S3 Blake 30 Paris S4 Clark 20 London S5 Adams 30 Athens

Dense Index City file (index) Athens London London Paris Paris Supplier file (data) S1 Smith 20 London S2 Jones 10 Paris S3 Blake 30 Paris S4 Clark 20 London S5 Adams 30 Athens

Index • Advantage : Speed up Retrieval • Disadvantage : Slow down update

Nondense index Supplier file (data) S1 Smith 20 London S2 Jones 10 Paris S3 Blake 30 Paris S4 Clark 20 London S5 Adams 30 Athens S# index S2 S4 S5 Page P-1 Page P Page P+1

12 32 58 70 89 94 6 8 12 15 18 32 35 40 50 51 52 52 60 62 70 71 78 82 83 85 89 91 93 94 96 97 99 B-trees Index set 50 82 Sequence set

Clustered Tables • Clusters are groups of one or more tables physically stored together because they share common columns.

Clustered Tables Employee Department Cluster key

Physical Database Design