CSE 480: Database Systems

CSE 480: Database Systems • Lecture 20: Indexing Structures

Index • Mechanism to efficiently locate row(s) of a table without having to scan the entire table • Analogous to a book index Index entry

Indexing Field • An index is built based on an indexing field • Records having a particular value for their indexing field can be quickly located • The index can be built on one or more fields • Example: index on (CrsCode, Semester) • An index contains entries <field value, pointer to record>, ordered by field value • The index file occupies considerably less disk blocks than the data file because its entries are much smaller than the records themselves • A binary search on the index yields a pointer to the file record

Example of an Index

Types of Indexes • Dense vs sparse • Primary vs clustering vs secondary • Single-level vs multi-level • Static vs dynamic

Types of Indexes • Dense index • has an index entry for each record in the data file. • Sparse (or nondense) index • has index entries for only a subset of the records in the data file

Types of Indexes • Primary vs Clustering vs Secondary Index • Depends on the indexing and ordering fields • Indexing field is the field upon which an index is built • Ordering field is the field upon which the data file is sorted • Ordering key field is an ordering field that also corresponds to the key of the table Indexing field is ProfName Ordering field is ProfID ProfID is also the ordering key field (because it is a key)

Types of Indexes • Primary Index • Specified on the ordering key fieldof an ordered file of records • Nondense (sparse) • One index entry for each block in the data file • The index entry has the key field value for the first record in the block, which is called the block anchor

Primary Index Example

Types of Indexes Clustering Index Specified on the ordering field that is not a key field Nondense (sparse) One index entry for each distinct value of the field; The index entry points to the first data block that contains records with that field value

Clustering Index Example

Types of Indexes • Secondary Index • Index defined on some non-ordering fieldof the data file • Dense • Includes one entry for each record in the data file • The index entry points to either a block or a record

Secondary Index Example

Example • EMPLOYEE(NAME, SSN, ADDRESS, JOB, SALARY, ... ) • Record size R=150 bytes • Block size B=512 bytes • No of records r = 30000 records • Blocking factor • Bfr = B  R = 512  150 = 3 records/block • Number of blocks needed to store the records • b = r/Bfr = 30000/3 = 10000 blocks

Example • Suppose we need to perform the following query: SELECT * FROM EMPLOYEE WHERE SSN = ‘1234567890’; • Heap file (unsorted): • Average cost of linear search (assuming SSN exists): • (b/2) = 10000/2 = 5000 block accesses • Sequential file (sorted on SSN) • Average cost of binary search: • log2 b = log2 10000 = 14 block accesses

Example • Suppose there is a (sparse) primaryindex on the SSN field • Field size for SSN, V = 10 bytes, • Record pointer size PR = 4 bytes • Index entry size • Ri = (V + PR) = (10 + 4) = 14 bytes • Index blocking factor • Bfri = B  Ri  = 512  14 = 36 entries/block • Number of index blocks • bi = b/Bfri = 10000/36 = 278 blocks • Binary search on index takes log2 bi = log2 278 = 9 block accesses Total cost = 10 block accesses (1 more to access the record itself)

Example • Suppose there is a (dense) secondary index on the SSN field • Field size for SSN, V = 10 bytes, • Record pointer size PR = 4 bytes • Index entry size • Ri = (V + PR) = (10 + 4) = 14 bytes • Index blocking factor • Bfri = B  Ri  = 512  14 = 36 entries/block • Number of index blocks • bi = r/Bfri = 30000/36 = 834 blocks • Binary search needs log2 bi = log2 834 = 10 block accesses Total cost = 11 block accesses (1 more to access the record itself)

Multi-Level Indexes • Because a single-level index is an ordered file, we can create an index to the index itself • In this case, the original index file is called the first-level index and the index to the index is called the second-level index • We can repeat the process, creating a third, fourth, ..., top level until all entries of the top level fit in one disk block • A multi-level index can be created for any type of first-level index (primary, secondary, clustering) as long as the first-level index has more than one disk block

Example: Two-level Primary Index

Example • Suppose there is a 2-level index on the SSN field (assume index at the first level is sparse, i.e., a primary index) • Field size for SSN, V = 10 bytes, • Record pointer size PR = 4 bytes • Index entry size • Ri = (V + PR) = (10 + 4) = 14 bytes • Index blocking factor • Bfri = B / Ri  = 512  14 = 36 entries/block • Number of index blocks at 1st level • b1 = b/Bfri = 10000/36 = 278 blocks • Number of index blocks at 2nd level • b2 = b1/Bfri = 278/36 = 8 blocks • Binary search needs log2 bi = log2 8 = 3 block accesses Total cost = 3 + 1 + 1 block accesses = 5 block accesses (2nd level) + (1st level) + (data block)

Example • Suppose there is a 3-level index on the SSN field (assume index at first level is sparse) • Field size for SSN, V = 10 bytes, • Record pointer size PR = 4 bytes • Index entry size • Ri = (V + PR) = (10 + 4) = 14 bytes • Index blocking factor • Bfri = B  Ri  = 512  14 = 36 entries/block • Number of index blocks at 1st level • b1 = b/Bfri = 10000/36 = 278 blocks • Number of index blocks at 2nd level • b2 =  b1/Bfri = 278/36 = 8 blocks • Number of index blocks at 3rd level • b3 =  b2/Bfri = 8/32 = 1 block • Total cost = 1 + 1 + 1 + 1 block accesses = 4 block accesses (3rd level) + (2nd level) + (1st level) + (data block)

Dynamic Multilevel Indexes • Insertion and deletion from multilevel indexes can be quite a severe problem • One possibility is to use overflow blocks and then do file reorganization periodically • A better strategy is to use dynamic multi-level indexes • Examples: B-tree and B+-tree (they are called search trees) • In B-Tree and B+-Tree, each node corresponds to a disk block • Each node is always kept at least half-full

A Node in a Search Tree • Each node is stored in a disk block • Within each node: K1 < K2 < … < Kq-1 • For all values X in subtree pointed by Pi: • Ki-1 < X < Ki (B-tree) or Ki-1 X < Ki (B+-tree) A node in the search tree (B-tree or B+-tree)

Search tree • A search tree of order p • each node contains at most p – 1 search values

Nodes of a B+-tree • Difference between an internal node and a leaf node

Example of a B+-tree Tree node pointer Data/record pointer Sibling pointer

Example: Search for 9

Largest value among those stored in the left subtree Example of Insertion in a B+-tree Insertion sequence: 8 5 1 7 3 12 9 6 ?

Example of Insertion in a B+-tree Insertion sequence: 8 5 1 7 3 12 9 6 ?

? ? ? Example of Insertion in a B+-tree Insertion sequence: 8 5 1 7 3 12 9 6

Example of Insertion in a B+-tree Insertion sequence: 8 5 1 7 312 9 6

Example of Deletion in a B+-tree

Example of Deletion in a B+-tree Deletion sequence: 5 12 9

Underflow in a B+-tree • Underflow • when the number of entries in a node is below the minimum required • Redistribute entries with the left sibling or • This changes the search field values at higher levels of the tree • Redistribute entries with the right sibling or • Merge the 3 nodes into 2 nodes and redistribute the entries

Example of Deletion in a B+-tree Deletion sequence: 5 12 9

Example of Deletion in a B+-tree Final tree (after deletion)

CSE 480: Database Systems