Indexing Structures for Files

Indexing Structures for Files

Basic Concepts • Indexing mechanisms used to speed up access to desired data without having to scan entire table based on a search key • Search Key • an attribute used to look up records in a file.

Index Structure pointer search-key • An index file consists of records (called index entries) of the form • Index entries • Search key value and a pointer to a row having that value • The values in the index are ordered. • Index files are typically much smaller than the original file • When a file is modified, every index on the file must be updated • Updating indices imposes overhead on database modification.

Index Evaluation Metrics • Indexing techniques evaluated on basis of: • Access types (queries) supported efficiently. • records with a specified value in the attribute • or records with an attribute value falling in a specified range of values. • Access/search time • Insertion time • Deletion time • Space overhead

Index Classification • primary index: • is specified on the ordering keyfield of an ordered file, where every record has a unique value for that field. • The index has the same ordering as the one of the file. • clustering index: • is specified on the ordering field of an ordered file. • The index has the same ordering as the one of the file. • An ordered file can have at most one primary index or one clustering index, but not both. • secondary index: • is specified on any nonordering field of the file. • The index has different ordering than the one of the file. • A file can have several secondary indices in addition to its primary/clustering index.

Primary Indices • Primary index is specified on the ordering key field of an ordered file. • There is oneindex entry (or index record) in the index file for each block in the data file. • Each index entry has the value of the primary key field for the first record in a block. • The total number of entries in the index file is the same as the number of disk blocks in the data file. • The index file for a primary index needs fewer blocks than does the data file.

Primary Indices

Finding a record is efficient – do a binary search Records insertion and deletion is a major problem. We can avoid the problem by: Using an unordered overflow file, or Using a linked list of overflow records. Primary Indices

Index (sequential) continuous free space 10 32 39 20 38 31 30 33 34 35 36 40 50 60 70 80 overflow area (not sequential) 90 Primary Indices

Sparse Vs. Dense Indices • dense index • has index entry for every record in the file. • sparse (nondense) index • has index entries for only some of the search-key values. • A primary index is sparse (nondense) index.

Sparse Vs. Dense Indices Id Name Dept Sparse primary index sorted on Id Dense secondary index sorted on Name Ordered file sorted on Id

Sparse Vs. Dense Indices Ashby, 25, 3000 22 Basu, 33, 4003 25 Bristow, 30, 2007 30 Ashby 33 Cass, 50, 5004 Cass Smith Daniels, 22, 6003 40 Jones, 40, 6003 44 44 Sparse primary index on Name Smith, 44, 3000 50 Tracy, 44, 5004 Dense secondary index on Age Ordered file on Name

Dense Indices • Pro: • Very efficient in locating a record given a key, if fits in the memory • Can tell if any record exists without accessing file • Con: • if too big and doesn’t fit into the memory, will be expense when used to find a record given its key

Sparse Indices • Sparse index contains index records for only some search-key values. • Some keys in the data file will not have an entry in the index file • Applicable when records are sequentially ordered on search-key (ordered files) • Normally keeps only one key per data block • To locate a record with search-key value K we: • Find index record with largest search-key value K • Search file sequentially starting at the record to which the index record points

Sparse Indices Ordered File Sparse/Primary Index 10 210 170 130 90 50 230 30 70 150 190 110 90 70 50 30 10 40 100 80 60 20

Sparse Indices • Less space (can keep more of index in memory) • Support multi-level indexing structure • Less maintenance overhead for insertions and deletions.

Index Update: Deletion • If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also. • Single-level index deletion: • Dense indices • deletion of search-key is similar to file record deletion. • Sparse indices • If an entry for the search key exists in the index, it is deleted by replacing the entry in the index with the next search-key value in the file (in search-key order). • If the next search-key value already has an index entry, the entry is deleted instead of being replaced.

Dense Index: Deletion 50 30 70 10 80 40 60 20 10 20 30 40 50 60 70 80

Dense Index: Deletion 10 70 50 20 60 80 40 40 delete record 30 10 20 30 30 40 40 50 60 70 80

Sparse Index: Deletion 50 30 70 10 80 40 60 20 10 30 50 70 90 110 130 150

Sparse Index: Deletion 50 30 70 10 80 40 60 20 delete record 40 10 30 50 70 90 110 130 150

Sparse Index: Deletion 10 30 50 70 20 40 60 80 40 40 delete record 30 10 30 50 70 90 110 130 150

Sparse Index: Deletion 10 30 50 70 20 40 60 80 50 70 delete records 30 & 40 10 30 50 70 90 110 130 150

Index Update: Insertion • Single-level index insertion: • Perform a lookup using the search-key value appearing in the record to be inserted. • Dense indices • if the search-key value does not appear in the index, insert it. • Sparse indices • if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created. • In this case, the first search-key value appearing in the new block is inserted into the index.

Sparse Index: Insertion 10 30 40 60 20 50 10 30 40 60

Sparse Index: Insertion 10 30 40 60 20 50 34 insert record 34 10 30 40 60

Sparse Index: Insertion 10 30 60 40 20 50 15 20 20 30 insert record 15 10 30 40 60 • Illustrated: Immediate reorganization • Variation: • insert new block (chained file) • update index

Sparse Index: Insertion 10 30 40 60 25 20 50 overflow blocks (reorganize later...) insert record 25 10 30 40 60

Dense Index: Insertion • Similar • Often more expensive . . .

Duplicate keys 10 10 20 30 40 10 20 30 30 45

Duplicate keys 10 40 10 20 30 30 40 20 10 10 45 30 20 10 30 30 30 45 10 20 • Dense index 10 10 10 10 10 10 20 20 20 20 30 30 30 30 30 30

Duplicate keys 10 10 40 20 30 30 20 30 45 10 careful if looking for 20 or 30! • Sparse index, one way? 10 10 20 30 40

Duplicate keys 10 40 10 20 30 30 20 30 45 10 should this be 40? • Sparse index, another way? (clustering index) • place first new key from block 10 20 30 30

Clustering Indices • A clustering index can be used when the field (the clustering field) is non-key, and the data file is sorted by the clustering field. • A file can have at most one primary index or one clustering index, but not both. • A clustering file is also an ordered file with two fields: • Clustering field • pointer to the first block that has a record with that value for its clustering field. • There is one entry in the clustering index for each distinct value of the clustering field (rather than for every record). • Sparse index (nondense)

Clustering Indices • A clustering index on the DEPNo ordering nonkey field of an EMPLOYEE file.

Clustering Indices • Record insertion and deletion still cause problems • a solution; cluster of contiguous blocks • Good for range searches • Use location mechanism to locate index entry at start of range • This locates first data record. • Subsequent data records are contiguous if index is clustered (not so if unclustered)

Clustering Indices • Clustering index with a separate block cluster for each group of records that share the same value for the clustering field.

Secondary Indices • Secondary index: • is specified on any nonordering field of the file. • Non-ordering field can be a key (unique) or a non-key (duplicates) • The index has different ordering than the one of the file. • A file can have several secondary indices in addition to its primary index. • there is one index entry for each data record • index record points either to the block in which the record is stored, or to the record itself • Hence, such an index is dense

Secondary Indices • A secondary index usually needs more storage space and longer search time than does a primary index. • It has larger number of entries. • Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive • each record access may fetch a new block from disk

Secondary Indices • A dense secondary index (with block pointers) on a nonorderingKEY field.

Secondary Indices • A dense secondary index (with record pointers) on a non-ordering non-key field.

Index Types and Indexing Fields • Also, review Table 14.2.

Multilevel Indices • To search the index faster we can create an index for the index. • A multilevel index considers the index file as an ordered file and creates a primary index for the first level • outer index – a sparse index of primary index • inner index – the primary index file • The above process can be repeated for a higher level if the previous level needs more than one block of disk storage. • Read EXAMPLE 3

Multilevel Indices

B+-Tree Index • A B+-tree, of order f (fan-out --- maximum node capacity), is a rooted tree satisfying the following: • All paths from root to leaf are of the same length (balanced tree) • Each non-leaf node (except the root) has between f/2 and up to ftree pointers (f-1 key values). • A leaf node has between f/2 and f-1 data pointers (plus a pointer for sibling node). • If the root is not a leaf, it has at least 2 children. • If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and f-1 values.

B+-Tree Non-leaf Node Structure • Ki are the search-key values, K1 K2 K3  …Kf-1 • all keys in the subtree to which P1 points are K1. • all keys in the subtree to which Pf points are Kf-1. • for 2  i  f-1, all keys in the subtree to which Pi points have values  Ki-1 and  Ki. • Pi are pointers to children nodes (tree nodes).

B+-Tree Leaf Node Structure • for i = 1, 2, …, f-1, pointer Pri is a data pointer, that either points to a • file record with search-key value Ki, or • block of record pointers that point to records having search-key value Ki. (if search-key is not a key) • Pnext points to next leaf node in search-key order. • Within each leaf node, K1 K2 K3 …Kf-1 • If Li, Lj are leaf nodes and ij, then • Li’s search-key values Lj’s search-key values

From non-leaf node to next leaf in sequence Sample Leaf Node 57 81 95 To record with key 57 To record with key 81 To record with key 95

to keys to keys to keys to keys  57 57  k  81 81  k  95  95 Sample Non-Leaf Node 57 81 95

Root f=4 Example of a B+-Tree 35 110 130 179 11 3 5 11 120 130 180 200 100 101 110 150 156 179 30 35

Indexing Structures for Files