310 likes | 314 Views
LIS569 – Database Systems Dr. Jianqiang Wang Physical Database Design. Week 7 , Feb. 25, 2013. Objectives. Define terms Describe the physical database design process Choose storage formats for attributes Select appropriate file organizations Describe three types of file organization
E N D
LIS569 – Database SystemsDr. Jianqiang WangPhysical Database Design Week 7, Feb. 25, 2013
Objectives • Define terms • Describe the physical database design process • Choose storage formats for attributes • Select appropriate file organizations • Describe three types of file organization • Describe indexes and their appropriate use • Translate a database model into efficient structures • Know when and how to use denormalization
Physical Database Design • Purpose is to translate the logical design into the technical specifications for storing and retrieving data • Goal is to create a design for storing data that will provide adequate performance and insure database integrity, security, and recoverability
Inputs Decisions • Normalized relations • Attribute definitions • Volume estimates • Response time expectations • Data security needs • Backup/recovery needs • Integrity expectations • DBMS technology used • Attribute data types • Physical record descriptions (doesn’t always match logical design) • File organizations • Indexes and database architectures • Query optimization Leads to Physical Design Process
Example of Composite Usage Map Data volumes
Example of Composite Usage Map Access Frequencies (per hour)
Example of Composite Usage Map Usage analysis: 14,000 purchased parts accessed per hour 8,000 quotations accessed from these 14,000 purchased part accesses 7,000 suppliers accessed from these 8,000 quotation accesses
Example of Composite Usage Map Usage analysis: 7,500 suppliers accessed per hour 4,000 quotations accessed from these 7,500 supplier accesses 4,000 purchased parts accessed from these 4000 quotation accesses
Designing Fields • Field: smallest unit of data in database • Field design • Choosing data type • Coding, compression, encryption • Controlling data integrity
MySQL Text Types • CHAR(size): a fixed length string, up to 255 characters. • VARCHAR(size): a variable length string, up to 255 characters. • TINYTEXT: a string with a maximum length of 255 characters. • TEXT: a string with a maximum length of 65,535 characters. • BLOB: Binary Large Objects, up to 65,535 bytes. • MEDIUMTEXT: a string with a maximum length of 16,777,215 characters. • MEDIUMBLOB: up to 16,777,215 bytes of data. • LONGTEXT: a string with a maximum length of 4,294,967,295 characters. • LONGBLOB: up to 4,294,967,295 bytes of data. • ENUM(x,y,z,etc.): let you enter a list of possible values, up to 65535 values. • SET: similar to ENUM except that SET may contain up to 64 list items and can store more than one choice
MySQL Number Types • TINYINT(size): -128 to 127 normal. 0 to 255 UNSIGNED. The maximum number of digits may be specified in parenthesis (same below). • SMALLINT(size): -32768 to 32767 normal. 0 to 65535 UNSIGNED. • MEDIUMINT(size): -8388608 to 8388607 normal. 0 to 16777215 UNSIGNED. • INT(size): -2147483648 to 2147483647 normal. 0 to 4294967295 UNSIGNED. • BIGINT(size): -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED. • FLOAT(size,d) : a small number with a floating decimal point, the maximum number of digits to the right of the decimal point is specified in the d parameter. • DOUBLE(size,d) : a large number with a floating decimal point. • DECIMAL(size,d) : a DOUBLE stored as a string , allowing for a fixed decimal point.
MySQL Date/Time Types DATE(): a date in the format of: YYYY-MM-DD. The supported range is from '1000-01-01' to '9999-12-31’. DATETIME(): a date and time combination in the format of YYYY-MM-DD HH:MM:SS. The supported range is from '1000-01-01 00:00:00' to '9999-12-31 23:59:59' TIMESTAMP(): a timestamp whose values are stored as the number of seconds since the Unix epoch ('1970-01-01 00:00:00' UTC). Format: YYYY-MM-DD HH:MM:SS. The supported range is from '1970-01-01 00:00:01' UTC to '2038-01-09 03:14:07' UTC. TIME() : a time in the format of HH:MM:SS, range from '-838:59:59' to '838:59:59‘. YEAR(): a year in two-digit or four-digit format, values in four-digit format: 1901 to 2155 or values in two-digit format: 70 to 69 (representing years from 1970 to 2069).
Example of a code look-up table Code saves space, but costs an additional lookup to obtain actual value
Field Data Integrity • Default value: assumed value if no explicit value • Range control: allowable value limitations (constraints or validation rules) • Null value control: allowing or prohibiting empty fields • Referential integrity: range control (and null value allowances) for foreign-key to primary-key match-ups Sarbanes-Oxley Act (SOX) legislates importance of financial data integrity
Handling Missing Data • Substitute an estimate of the missing value (e.g., using a formula) • Construct a report listing missing values • In programs, ignore missing data unless the value is significant (sensitivity testing) Triggers can be used to perform these operations
Denormalization • Transforming normalized relations into non-normalized physical record specifications • Benefits: • Can improve performance (speed) by reducing number of table lookups (i.e. reduce number of necessary join queries) • Costs (due to data duplication) • Wasted storage space • Data integrity/consistency threats • Common denormalization opportunities • One-to-one relationship • Many-to-many relationship with non-key attributes (associative entity) • Reference data (1:N relationship where 1-side has data not used in any other relationship)
Denormalization: a M:N relationship with nonkey attributes Extra table access required Null description possible
Denormalization: reference data Extra table access required Data duplication
Partitioning • Horizontal Partitioning: Distributing the rows of a table into several separate files • Useful for situations where different users need access to different rows • Three types: Key Range Partitioning, Hash Partitioning, or Composite Partitioning • Vertical Partitioning: Distributing the columns of a table into several separate relations • Useful for situations where different users need access to different columns • The primary key must be repeated in each file • Combinations of Horizontal and Vertical Partitions often correspond with User Schemas (user views)
Partitioning (cont.) • Advantages of Partitioning: • Efficiency: Records used together are grouped together • Local optimization: Each partition can be optimized for performance • Security: data not relevant to users are segregated • Recovery and uptime: smaller files take less time to back up • Load balancing: Partitions stored on different disks, reduces contention • Disadvantages of Partitioning: • Inconsistent access speed: Slow retrievals across partitions • Complexity: Non-transparent partitioning • Extra space or update time: Duplicate data; access from multiple partitions
Designing Physical Files • Physical File: • A named portion of secondary memory allocated for the purpose of storing physical records • Tablespace–named set of disk storage elements in which physical files for database tables can be stored • Extent–contiguous section of disk space • Constructs to link two pieces of data: • Sequential storage • Pointers–field of data that can be used to locate related fields or records
File Organizations • Technique for physically arranging records of a file on secondary storage • Factors for selecting file organization: • Fast data retrieval and throughput • Efficient storage space utilization • Protection from failure and data loss • Minimizing need for reorganization • Accommodating growth • Security from unauthorized use • Types of file organizations • Sequential • Indexed • Hashed
1 2 If sorted – every insert or delete requires resort Records of the file are stored in sequence by the primary key field values If not sorted Average time to find desired record = n/2 n
Indexed File Organizations • Indexed File Organization: the storage of records either sequentially or nonsequentially with an index that allows software to locate individual records • Index: a table or other data structure used to determine in a file the location of records that satisfy some condition • Primary keys are automatically indexed • Other fields or combinations of fields can also be indexed; these are called secondary keys (or nonunique keys)
uses a tree search Average time to find desired record = depth of the tree
Hash algorithm Usually uses division-remainder to determine record position. Records with same position are grouped in lists.
Clustering Files • In some relational DBMSs, related records from different tables can be stored together in the same disk area • Useful for improving performance of join operations • Primary key records of the main table are stored adjacent to associated foreign key records of the dependent table • e.g. Oracle has a CREATE CLUSTER command
Rules for Using Indexes • Use on larger tables • Index the primary key of each table • Index search fields (fields frequently in WHERE clause) • Fields in SQL ORDER BY and GROUP BY commands • When there are >100 values but not when there are <30 values
Rules for Using Indexes (cont.) • Avoid use of indexes for fields with long values; perhaps compress values first • If key to index is used to determine location of record, use surrogate (like sequence nbr) to allow even spread in storage area • DBMS may have limit on number of indexes per table and number of bytes per indexed field(s) • Be careful of indexing attributes with null values; many DBMSs will not recognize null values in an index search