Hashing Techniques

Hashing Techniques CST203-2 Database Management Systems Lecture 7

Disadvantages on index structure: • We must access an index structure to locate data, or must use binary search, and that results in more I/O operations • Space taken by index structure • Hashing allows us to avoid accessing an index structure

Hash File Organization • We obtain the address of the disk block containing a desired record directly by computing a function on the search-key value of the record • Bucket is a unit of storage that can store one or more records • A bucket is typically a disk block, but could be chosen to be smaller or larger than a disk block.

A hash function h is a function from K to B. h – hash function Ki - search key Bi = h(Ki) • If h(K5) = h(K7) • bucket h(K5) contains records with search-key values K5 and records with search-key values K7

Deletion • Equally straightforward • If the search-key value of the record to be deleted is Ki, we compute h(Ki), then search the corresponding bucket for that record, and delete the record from the bucket

Hash Functions • The worst possible hash function maps all search-key values to the same bucket • An ideal hash function distributes the stored keys uniformly across all the buckets, so that every bucket has the same number of records • We want to choose a hash function that assigns search-key values to buckets

The distribution is uniform. • The hash function assigns each bucket the same number of search-key values from the set of all possible search-key values • The distribution is random • Each bucket will have nearly the same number of values assigned to it, regardless of the actual distribution of search-key values. • The hash value will not be correlated to any externally visible ordering on the search-key values, such as alphabetic ordering or ordering by the length of the search keys

Example of Hash File Organization (Cont.) Hash file organization of account file, using branch-name as key • Assume…….. There are 26 buckets • hash function maps names beginning with the ith letter of the alphabet to the ith bucket • Simple but no uniform distribution • Assume……… there are 10 buckets, • The binary representation of the ith character is assumed to be the integer i. • The hash function returns the sum of the binary representations of the characters modulo 10 • E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3

int h = 0; for (inti = 0; i < n; i++) { h = h + s.charAt(i); } h = h % 10;

Example of Hash File Organization Hash file organization of account file, using branch-name as key

Handling of Bucket Overflows • We have assumed that, when a record is inserted, the bucket to which it is mapped has space to store the record • If the bucket does not have enough space, a bucket overflowis said to occur • Reasons: • Insufficient buckets. • Skew 1. Multiple records may have the same search key. 2. The chosen hash function may result in non-uniform distribution of search keys.

We handle bucket overflow by using overflow buckets. • If a record must be inserted into a bucket b, and b is already full, • the system provides an overflow bucket for b, and inserts the record into the overflow bucket. • If the overflow bucket is also full, the system provides another overflow bucket, and so on • All the overflow buckets of a given bucket are chained together in a linked list(overflow chaining)

Hash Indices • A hash index organizes the search keys, with their associated pointers, into a hash file structure • We apply a hash function on a search key to identify a bucket, and store the key and its associated pointers in the bucket (or in overflow buckets)

The term hash indexis used to denote hash file structuresas well as hash indices • Hash indices are only secondary index structures • But, hash file organization provides the same direct access to records that indexing provides, we pretend that a file organized by hashing also has a primary hash index on it

Dynamic Hashing • Most databases grow larger over time • For static hashing, • Choose a hash function based on the current file size. This option will result in performance degradation as the database grows • Choose a hash function based on the anticipated size of the file at some point in the future. Although performance degradation is avoided, a significant amount of space may be wasted initially • Periodically reorganize the hash structure in response to file growth. Such a reorganization involves choosing a new hash function, recomputing the hash function on every record in the file, and generating new bucket assignments. This reorganization is a massive, time-consuming operation. Furthermore, it is necessary to forbid access to the file during reorganization.

Hashing Techniques