Practical Database Design and Tuning

Practical Database Design and Tuning Memory DBMS vs Disc DBMS Hardware Raid San Nas Software Hashing B Trees Optimization

Old Disk Storage Devices

DBMS Trends: Disk or Memory Disk based DBMS Source: Stephen Mc Kearney www.smckearney.com/adb/notes/lecture.memory.based.databases.2up.pdf

DBMS Trends: disk or Memory

DBMS Trends: disk or Memory Source: Stephen Mc Kearney www.smckearney.com/adb/notes/lecture.memory.based.databases.2up.pdf

Trends in Disk Technology Source: Elmasri & Navathe Database Systems 5 th ed

Parallelizing Disk Access using RAID Technology. • Secondary storage technology must take steps to keep up in performance and reliability with processor technology. • A major advance in secondary storage technology is represented by the development of RAID, which originally stood for Redundant Arrays of Inexpensive Disks. • The main goal of RAID is to even out the widely different rates of performance improvement of disks against those in memory and microprocessors.

RAID Technology • A natural solution is a large array of small independent disks acting as a single higher-performance logical disk. A concept called data striping is used, which utilizes parallelism to improve disk performance. • Data striping distributes data transparently over multiple disks to make them appear as a single large, fast disk. Source: Elmasri & Navathe Database Systems 5 th ed

RAID Technology Different raid organizations were defined based on different combinations of the two factors of granularity of data interleaving (striping) and pattern used to compute redundant information. • Raid level 0 has no redundant data and hence has the best write performance. • Raid level 1 uses mirrored disks. • Raid level 2 uses memory-style redundancy by using Hamming codes, which contain parity bits for distinct overlapping subsets of components. Level 2 includes both error detection and correction. • Raid level 3 uses a single parity disk relying on the disk controller to figure out which disk has failed. • Raid Levels 4 and 5 use block-level data striping, with level 5 distributing data and parity information across all disks. • Raid level 6 applies the so-called P + Q redundancy scheme using Reed-Soloman codes to protect against up to two disk failures by using just two redundant disks. Source: Elmasri & Navathe Database Systems 5 th ed

Use of RAID Technology Source: Elmasri & Navathe Database Systems 5 th ed

Single and Dual Parity Large-capacity drives lengthen the time needed to recover from the failure of a single drive. Single parity RAID levels are vulnerable to data loss until the failed drive is rebuilt: the larger the drive, the longer the rebuild will take. Dual parity gives time to rebuild the array without the data being at risk if a (single) additional drive fails before the rebuild is complete.

Common RAID Levels RAID 0 - Striped set without parity RAID 1 - Mirrored settings/disks RAID 5 - Striped disks with parity RAID 6 - Striped disks with dual parity RAID 10 - both striping and mirroring

Striping The splitting of data across more than one disk. Segments can be assigned to multiple physical devices (usually disk drives in the case of RAID storage) in a round-robin fashion and thus written concurrently. This technique is useful if the processor is capable of reading or writing data faster than a single disk can supply or accept it. While data is being transferred from the first disk, the second disk can locate the next segment. Striping can be either of type coarse or fine.

Benefits of using RAID Higher Data Security Fault Tolerance Improved Availability Increased, Integrated Capacity Improved Performance

Higher Data Security Through the use of redundancy, most RAID levels provide protection for the data stored on the array. This means that the data on the array can withstand even the complete failure of one hard disk (or sometimes more) without any data loss, and without requiring any data to be restored from backup. This security feature is a key benefit of RAID and probably the aspect that drives the creation of more RAID arrays than any other. All RAID levels provide some degree of data protection, depending on the exact implementation, except RAID level 0.

Fault Tolerance RAID implementations that include redundancy provide a much more reliable overall storage subsystem than can be achieved by a single disk. This means there is a lower chance of the storage subsystem as a whole failing due to hardware failures. (At the same time though, the added hardware used in RAID means the chances of having a hardware problem of some sort with an individual component, even if it doesn't take down the storage subsystem, is increased; see this full discussion of RAID reliability for more.)

Improved Availability Availability refers to access to data. Good RAID systems improve availability both by providing fault tolerance and by providing special features that allow for recovery from hardware faults without disruption. See the discussion of RAID reliability and also this discussion of advanced RAID features.

Integrated Capacity By turning a number of smaller drives into a larger array, you add their capacity together (though a percentage of total capacity is lost to overhead or redundancy in most implementations). This facilitates applications that require large amounts of contiguous disk space, and also makes disk space management simpler.

Improved Performance RAID systems improve performance by allowing the controller to exploit the capabilities of multiple hard disks to get around performance-limiting mechanical issues that plague individual hard disks. Different RAID implementations improve performance in different ways and to different degrees, but all improve it in some way. See this full discussion of RAID performance issues for more.

SAN vs NAS • It all depends upon the situation. In a large network, SAN reigns supreme. It provides faster access, and less network bog. • In a small network, where server outages may be a problem, NAS is the way to go. • The Future of the storage seems to be migrating to a primarily SAN system. The devices are becoming cheaper and more manageable. • NAS will not die off, due to simple and inexpensive equipment and setup.

Storage Area Network • A Network of Storage Devices that are connected to each other and to a server or cluster of servers, which act as an access point to the SAN. • SAN use special switches that look like standard networking hubs to connect to the servers and to each other. • Allows data to be transmitted between SAN devices without polluting the network.

Storage Area Networks (requires server) • The demand for higher storage has risen considerably in recent times. • Organizations have a need to move from a static fixed data center oriented operation to a more flexible and dynamic infrastructure for information processing. • Thus they are moving to a concept of Storage Area Networks (SANs). In a SAN, online storage peripherals are configured as nodes on a high-speed network and can be attached and detached from servers in a very flexible manner. • This allows storage systems to be placed at longer distances from the servers and provide different performance and connectivity options. Source: math.indstate.edu/chi/cs468/projects/SAN_NAS.ppt

Storage Area Networks Advantages of SANs are: • Flexible many-to-many connectivity among servers and storage devices using fiber channel hubs and switches. • Up to 10km separation between a server and a storage system using appropriate fiber optic cables. • Better isolation capabilities allowing nondisruptive addition of new peripherals and servers. • SANs face the problem of combining storage options from multiple vendors and dealing with evolving standards of storage management software and hardware. Source: math.indstate.edu/chi/cs468/projects/SAN_NAS.ppt

General Information SAN • A SAN is primarily used in a large data environment for data storage. • It is not possible to access a SAN without an intermediary unit (A server/Mainframe) • They are expensive to setup, but easy to maintain. • SAN’s also provide a central storage area for tape backups to be taken from. • SAN Hard drives can be SCSI, ATA, or most commonly, fiber connection. • ISU currently has two SANs in use. One runs the primary server cluster, and the other is used in the Library, as part of the DRP plan(in the works). Source: math.indstate.edu/chi/cs468/projects/SAN_NAS.ppt

General Information NAS • NAS devices are inexpensive and easy to setup. Can cause difficulty to maintain, and can bog down networks if used in a Large Network. • Each devices maintains it’s own IP address. This makes it easy to access data on a NAS from anywhere in the world. • There is also no worries about server crashes, since you do not need an intermediate device to access a NAS. • Windows, MAC, and Unix all support access to a NAS device one way or another. (FTP, NFS, File Share) • A simple example of a NAS device would be an FTP that you use to store work documents on. You can map the ftp as a drive on windows machines, or access it through an FTP client. Source: math.indstate.edu/chi/cs468/projects/SAN_NAS.ppt

Network Attached Storage • Various Devices (CD Towers, SCSI Towers, Specialty Servers) • Each Device Connected Directly to network, with own IP Address. • If a Server crashes, the data on a NAS device is still accessible, depending on what device is used. Source: math.indstate.edu/chi/cs468/projects/SAN_NAS.ppt

Software

Chapter 14 – Physical Design Indexing for Performance • Warehouse typ. > #indexes than OLTP • Very beneficial to performance, but • Large # of indexes degrade load time • Place indexes on primary, foreign keys to start • Add others as needed & as feasible • Number of indexes varies with inversely with table size

Hashed Files • The term "hash" apparently comes by way of analogy with its standard meaning in the physical world, to "chop and mix." Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953; the term hash came into use some ten years later. Source: wikipedia

Hashed Files • Hashing for disk files is called External Hashing • The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, ..., bucket M-1. Typically, a bucket corresponds to one (or a fixed number of) disk block. • One of the file fields is designated to be the hash key of the file. • The record with hash key value K is stored in bucket i, where i=h(K), and h is the hashing function. • Search is very efficient on the hash key. • Collisions occur when a new record hashes to a bucket that is already full. An overflow file is kept for storing such records. Overflow records that hash to each bucket can be linked together.

Hashed Files (cont.) There are numerous methods for collision resolution, including the following: • Open addressing: Proceeding from the occupied position specified by the hash address, the program checks the subsequent positions in order until an unused (empty) position is found. • Chaining: For this method, various overflow locations are kept, usually by extending the array with a number of overflow positions. In addition, a pointer field is added to each record location. A collision is resolved by placing the new record in an unused overflow location and setting the pointer of the occupied hash address location to the address of that overflow location. • Multiple hashing(Rehash): The program applies a second hash function if the first results in a collision. If another collision results, the program uses open addressing or applies a third hash function and then uses open addressing if necessary.

Hashed Files (cont.)

Hashed Files (cont.) • To reduce overflow records, a hash file is typically kept 70-80% full. • The hash function h should distribute the records uniformly among the buckets; otherwise, search time will be increased because many overflow records will exist (Test a Sample data ) • Main disadvantages of static external hashing: - Fixed number of buckets M is a problem if the number of records in the file grows or shrinks. - Ordered access on the hash key is quite inefficient (requires sorting the records).

Hashed Files - Overflow handling

Dynamic And Extendible Hashed Files Dynamic and Extendible Hashing Techniques • Hashing techniques are adapted to allow the dynamic growth and shrinking of the number of file records. • These techniques include the following: dynamic hashing , extendible hashing , and linear hashing . • Both dynamic and extendible hashing use the binary representation of the hash value h(K) in order to access a directory. In dynamic hashing the directory is a binary tree. In extendible hashing the directory is an array of size 2d where d is called the global depth.

Dynamic And Extendible Hashing • The directories can be stored on disk, and they expand or shrink dynamically. Directory entries point to the disk blocks that contain the stored records. • An insertion in a disk block that is full causes the block to split into two blocks and the records are redistributed among the two blocks. The directory is updated appropriately. • Dynamic and extendible hashing do not require an overflow area. • Linear hashing does require an overflow area but does not use a directory. Blocks are split in linear order as the file expands.

Extendible Hashing

Dynamic Multilevel Indexes Using B-Trees and B+-Trees • The B-tree's creator, Rudolf Bayer, has not explained what the B stands for. The most common belief is that B stands for balanced, as all the leaf nodes are at the same level in the tree. B may also stand for Bayer, or for Boeing, because he was working for Boeing Scientific Research Labs. • Rudolf Bayer has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. He is famous for inventing the data sorting structures the B-tree with Edward M. McCreight, and later the UB-tree with Volker Markl.He is a recipient of 2001ACMSIGMODEdgar F. Codd Innovations Award. Source: wikipedia

Dynamic Multilevel Indexes Using B-Trees and B+-Trees • Because of the insertion and deletion problem, most multi-level indexes use B-tree or B+-tree data structures, which leave space in each tree node (disk block) to allow for new index entries • These data structures are variations of search trees that allow efficient insertion and deletion of new search values. • In B-Tree and B+-Tree data structures, each node corresponds to a disk block • Each node is kept between half-full and completely full

Dynamic Multilevel Indexes Using B-Trees and B+-Trees • An insertion into a node that is not full is quite efficient; if a node is full the insertion causes a split into two nodes • Splitting may propagate to other tree levels • A deletion is quite efficient if a node does not become less than half full • If a deletion causes a node to become less than half full, it must be merged with neighboring nodes

Difference between B-tree and B+-tree • In a B-tree, pointers to data records exist at all levels of the tree • In a B+-tree, all pointers to data records exists at the leaf-level nodes • A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree

B-tree structures. (a) A node in a B-tree with q – 1 search values. (b) A B-tree of order p = 3. The values were inserted in the order 8, 5, 1, 7, 3, 12, 9, 6.

The nodes of a B+-tree. (a) Internal node of a B+-tree with q –1 search values. (b) Leaf node of a B+-tree with q – 1 search values and q – 1 data pointers.

An example of insertion in a B+-tree with q = 3 and pleaf = 2.

An example of deletion from a B+-tree.

Practical Database Design and Tuning

Practical Database Design and Tuning

Presentation Transcript

Database Tuning

Database Virtualization: A New Frontier for Database Tuning and Physical Design

Practical Performance Tuning

Lecture 7: Indexes and Database Tuning

Lecture 6: Indexes and Database Tuning

CSBP430 – Database Systems Chapter 16: Practical Database Design and Tuning

Database Performance Tuning

Module 4 Database Tuning

Data Mining, Database Tuning

Physical Database Design and Tuning

Physical Database Design and Tuning

Physical Database Design and Tuning

Database Tuning

Physical Database Design and Tuning

Self-Tuning Database systems

Database Performance Tuning

Database Design, a Practical Guide

ARCH-3: Database Design, a Practical Guide

Database Performance Tuning and Query Optimization

Practical Database Design and Tuning

Chapter (12) – Old Version Practical Database Design and Tuning Objectives

Oracle Database Performance Tuning