Temple University – CIS Dept. CIS331– Principles of Database Systems

Temple University – CIS Dept.CIS331– Principles of Database Systems V. Megalooikonomou Indexing and Hashing I (based on notes by Silberchatz, Korth, and Sudarshan and notes by C. Faloutsos at CMU)

General Overview - rel. model • Relational model - SQL • Formal & commercial query languages • Functional Dependencies • Normalization • Physical Design • Indexing

Indexing- overview • primary / secondary indices • index-sequential (ISAM) • B - trees, B+ - trees • hashing • static hashing • dynamic hashing

Basic Concepts • Indexing mechanisms speed up access to desired data • E.g., author catalog in library • Search Key - attribute to set of attributes used to look up records in a file • An index fileconsists of records (called index entries) of the form • Index files are typically much smaller than the original file • Two basic kinds of indices: • Ordered indices: search keys are stored in sorted order • Hash indices: search keys are distributed uniformly across “buckets” using a “hash function” search-key pointer

Indexing • once the records are stored in a file, how do you search efficiently? (e.g., ssn=123?)

Indexing • once the records are stored in a file, how do you search efficiently? • brute force: retrieve all records, report the qualifying ones • better:use indices (pointers) to locate the records directly

Indexing – main idea:

Measuring ‘goodness’ • retrieval time? • insertion / deletion? • space overhead? • reorganization? • range queries?

Main concepts • search keys are sorted in the index file and point to the actual records • primary vs. secondary indices • Clustering (sparse) vs non-clustering (dense) indices

Indexing Primary key index: on primary key (no duplicates)

Indexing secondary key index: duplicates may exist Address-index

Indexing secondary key index: typically, with ‘postings lists’ Postings lists

Main concepts – cont’d • Clustering (= sparse) index: records are physically sorted on that key (and not all key values are needed in the index) • Non-clustering (=dense) index: the opposite • E.g.:

Indexing- Sparse index Clustering/sparse index on ssn >=123 >=456

Sparse Index Files • Sparse Index: contains index records for only some search-key values • Applicable when records are sequentially ordered on search-key • To locate a record with search-key value K we: • Find index record with largest search-key value < K • Search file sequentially starting at the record to which the index record points • Less space and less maintenance overhead for insertions and deletions • Generally slower than dense index for locating records • Good tradeoff: sparse index with an index entryfor every block in file, corresponding to least search-key value in the block

Indexing – Dense Index Non-clustering / dense index

Summary • All combinations are possible… • at most one sparse/clustering index • as many as desired dense indices • usually: one primary-key index (maybe clustering) and a few secondary-key indices (non-clustering)

Indexing- overview • primary / secondary indices • index-sequential (ISAM) • B - trees, B+ - trees • hashing • static hashing • dynamic hashing

ISAM • What if index is too large to search sequentially?  use a multilevel index…

>=123 >=456 block ISAM

ISAM - observations • if index is too large, store it on disk and keep index-on-the-index • usually two levels of indices, one first-level entry per disk block (why? )

ISAM - Multilevel Index

>=123 >=456 124; peterson; fifth ave. ISAM - observations • What about insertions/deletions?

ISAM - observations • What about insertions/deletions? overflows 124; peterson; fifth ave. Problems?

ISAM - observations • What about insertions/deletions? overflows 124; peterson; fifth ave. • overflow chains may become very long - what to do?

ISAM - observations • What about insertions/deletions? overflows 124; peterson; fifth ave. • overflow chains may become very long - thus: • shut-down & reorganize • start with ~80% utilization

So far • … indices (like ISAM) suffer in the presence of frequent updates • sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive • each record access may fetch a new block from disk • alternative indexing structure: B - trees

Overview • primary / secondary indices • multilevel (ISAM) • B - trees, B+ - trees • hashing • static hashing • dynamic hashing

B-trees • the most successful family of index schemes (B-trees, B+-trees, B*-trees) • can be used for primary/secondary, clustering/non-clustering index • they are balanced “n-way” search trees

B-trees • Disadvantage of indexed-sequential files: performance degrades as file grows, since many overflow blocks get created. Periodic reorganization of entire file is required • Advantage of B+-treeindex files: • automatic self-reorganization with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required • Disadvantage of B+-trees: • extra insertion and deletion overhead, space overhead • Advantages of B+-trees outweigh disadvantages, and they are used extensively

6 9 <6 >9 <9 >6 3 1 7 13 B-trees E.g., B-tree of order 3 (i.e., at most 3 pointers from each node):

pn p1 … vn-1 v1 v2 B-tree properties: • each node, in a B-tree of order n : • key order • at most n pointers • at least n/2 pointers (except root) • all leaves at the same level • if number of pointers is k, then node has exactly k-1 keys

Properties • “block aware” nodes: each node -> disk page • O(log (N)) for everything! (ins/del/search) • typically, if N = 50 - 100, then 2 - 3 levels • utilization >= 50%, guaranteed; on average 69%

6 9 <6 >9 <9 >6 3 1 7 13 Queries • Algorithm for exact match query? (e.g., ssn=8?)

Queries • Algorithm for exact match query? (e.g., ssn=8?) 6 9 <6 >9 <9 >6 3 1 7 13

Queries • Algorithm for exact match query? (e.g., ssn=8?) 6 9 <6 H steps (= disk accesses) >9 <9 >6 3 1 7 13

Queries • what about range queries? (e.g., 5<salary<8) • Proximity/ nearest neighbor searches? (e.g., salary ~ 8 )

6 9 <6 >9 <9 >6 3 1 7 13 Queries • what about range queries? (e.g., 5<salary<8) • Proximity/ nearest neighbor searches? (e.g., salary ~ 8 )

6 9 <6 >9 <9 >6 3 1 7 13 Queries • what about range queries? (eg., 5<salary<8) • Proximity/ nearest neighbor searches? (eg., salary ~ 8 )

B-trees: Insertion • Insert in leaf; on overflow, push middle up (recursively) • split: preserves B - tree properties

6 9 <6 >9 <9 >6 3 1 7 13 B-trees Easy case: Tree T0; insert ‘8’

6 9 <6 >9 <9 >6 3 1 7 13 B-trees Tree T0; insert ‘8’ 8

6 9 <6 >9 <9 >6 3 1 7 13 B-trees Hardest case: Tree T0; insert ‘2’ 2

13 B-trees Hardest case: Tree T0; insert ‘2’ 6 9 2 1 3 7 push middle up

2 13 1 3 B-trees Hardest case: Tree T0; insert ‘2’ Ovf; push middle 2 6 9 7

13 1 3 B-trees Hardest case: Tree T0; insert ‘2’ 6 Final state 9 2 7

B-trees - insertion • Q: What if there are two middles? (e.g., order 4) • A: either one is fine

B-trees: Insertion • Insert in leaf; on overflow, push middle up (recursively – ‘propagate split’) • split: preserves all B - tree properties (!!) • notice how it grows: height increases when root overflows & splits • Automatic, incremental re-organization (contrast with ISAM!)

Temple University – CIS Dept. CIS331– Principles of Database Systems