1 / 25

Programming, Data Structures and Algorithms (Hashing)

Programming, Data Structures and Algorithms (Hashing). Anton Biasizzo. Hash table ADT. Search tree ADT Various operations on a set of elements. Find operates in fast O(log n) time. Insert and Delete require find procedure – both require O(log n) time Hash table ADT

mmickey
Download Presentation

Programming, Data Structures and Algorithms (Hashing)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming, Data Structures and Algorithms (Hashing) Anton Biasizzo

  2. Hash table ADT • Search tree ADT • Various operations on a set of elements. • Find operates in fast O(log n) time. • Insert and Delete require find procedure – both require O(log n) time • Hash table ADT • Supports only subset of the operations of search tree ADT (insert, delete, and find) • Very fast operations (close to constant time O(1)) • Does not provide ordering information • Implementations are referred as hashing

  3. General idea • Hash table is an array of fixed size. • The array contains keys (i.e. string with associated value). • The table size (TableSize) is a part of hash data structure. • Each key is mapped into some number in the range [0,TableSize-1] and stored in appropriate cell. • Mapping is called Hash function. • Hash function should be simple to implement.

  4. General idea • Returned values called hash values, hash codes, hash sums, or hashes. • Ideally distinct keys should have distinct hash values. • Finite number of cells (i.e. hash values). • Inexhaustible supply of keys. • Hash function should distribute keys evenly among the cells. • More keys map to same hash values – collision • Hash table implementation: • Choose hash function, • Manage collisions, • Determine table size.

  5. Hash function • If input keys are integers, hash function is typically Key mod TableSize: • Unless Key have some undesirable properties (i.e. Table size is 10 and keys end in zero). • Collisions can be reduced when the table size is a prime. • When keys are random integers they are evenly distributed. • Keys are usually strings: • Hash functions have to be chosen carefully. • One option is to sum ASCII values of characters in the string • Second option is to use only first few characters of key

  6. Hash function • Sum of ASCII codes: typedef unsigned int INDEX; INDEX hash( char *key, unsigned int H_SIZE ) { unsigned int hash_val = 0; while ( *key != ‘\0’ ) hash_val += *key++; return ( hash_val % H_SIZE); } • It is simple and fast hash function • If the table size is large, it does not distribute the keys well: • For keys with eight or fewer characters hash is between 0 and 1016

  7. Hash function • First three characters: typedef unsigned int INDEX; INDEX hash( char *key, unsigned int H_SIZE ) { return ( ( key[0] + 27*key[1] + 729*key[2] ) % H_SIZE); } • Assumes that key has at least three characters. • 27 is the number of letters in English alphabet. • This is good hash function if characters are random, not the case for any language.

  8. Hash function • Use all characters in key: typedef unsigned int INDEX; INDEX hash( char *key, unsigned int H_SIZE ) { unsigned int hash_val = 0; while ( *key != ‘\0’ ) hash_val = ( hash_val << 5 ) + *key++; return ( hash_val % H_SIZE); } • Multiplication with 32 instead of 27. • Simple and fast (if overflows are allowed) hash function. • If keys are very long: • it might be too time consuming. • first characters are shifted out • Use only some characters (odd, from different field, …)

  9. Collision resolution • Collision: When inserting new element, it hashes to the same value as an already inserted element. • Strategies to resolve collisions: • Open hashing, • Closed hashing.

  10. Open hashing • Open hashing or separate chaining. • Keep a list of all elements that hash to the same value. • ADT operations (find, insert,…) must be adopted. • In the example lists have headers. • Hash function is: mod 10 • Assume that keys are first 10 squares.

  11. Open hashing type declaration • Type declaration: typedef struct list_node *node_ptr; struct list_node { element_type element; node_ptr next; }; typedef tree_ptr LIST; typedef tree_ptr position; struct hash_tbl { unsigned int table_size; LIST *the_lists; } typedef struct hash_tbl *HASH_TABLE

  12. Open hashing operations • Initialization HASH_TABLE initialize_table( unsigned int table_size ) { HASH_TABLE H; int i; /* Allocate table */ H = (HASH_TABLE) malloc ( sizeof (struct hash_tbl) ); /* Allocate list pointers */ H->the_lists = (position *) malloc( sizeof (LIST) * H->table_size ); /* Allocate list headers */ for(i=0; i<H->table_size; i++ ) { H->the_lists[i] = (LIST) malloc sizeof (struct list_node) ); H->the_lists[i]->next = NULL; } return H; }

  13. Open hashing operations • Find operation • If keys are strings or complex structures appropriate functions must be used for key comparison. position find( element_type key, HASH_TABLE H ) { position p; LIST L; L = H->the_lists[ hash( key, H->table_size) ]; p = L->next; while ( (p != NULL) && (p->element != key) ) p = p->next; return p; }

  14. Open hashing operations • Insert operation (no duplicates) void insert( element_type key, HASH_TABLE H ) { position pos, new_cell; LIST L; pos = find( key, H); if ( pos == NULL ) new_cell = (position) malloc(sizeof(struct list_node)); L = H->the_lists[ hash( key, H->table size ) ]; new_cell->next = L->next; new_cell->element = key; L->next = new_cell; } • This implementation compute hash value twice.

  15. Open hashing • Any scheme could be used instead of linked lists to resolve the collisions (trees, other hash table, …) • We expect that if the table is large, the lists are short. • Load factor λ is a ratio of the number of elements in the hash table to the table size. • The average length of a list is λ. • Effort to perform a search is a constant time to calculate the hash value plus the time to traverse the list. • In an unsuccessful search, the number of links to traverse is λ on average. • The general rule for open hashing is to make table size about as large as the number of elements expected (λ ≈ 1)

  16. Closed hashing • Open hashing has disadvantage of requiring lists or other data structure. • Closed hashing or Open addressing is an alternative to resolve collisions with linked lists. • If collision occurs an alternate cells are tried until an empty cell is found. • Formally: Cells h0(X), h1(X), h2(X),… are tried in succession where • Function F is the collision resolution strategy (F(0) = 0). • For closed hashing bigger tables are needed. • In general the load factor should be below λ=0.5.

  17. Linear probing • Collision resolution function F is linear function (typically F(i)=i). • Cells are tried sequentially with wraparound in search of an empty cell. • As long as table is big enough a free cell can be found • Time to find empty cell can get quite large • Even when table are relatively empty blocks of occupied cells start forming – primary clustering

  18. Example of linear probing • Example of inserting keys {89, 18, 49, 58, 69}

  19. Quadratic probing • Collision resolution function F is quadratic function (typically F(i)=i2). • It eliminates primary clustering. • For linear probing it is bad if table gets almost full. • In quadratic probing only at most half of table can be used as alternate locations. • For quadratic probing there is no guarantee of finding an empty cell once the table gets more then half full. • If table size is not prime the empty cell might not be found even when the table is less than half full.

  20. Example of quadratic probing • Example of inserting keys {89, 18, 49, 58, 69}

  21. Double hashing • Collision resolution function F includes second hash function F(i) = i hash2(X) . • We probe at distance hash2(X) , 2 hash2(X) , 3 hash2(X) ,… • Good second hash function is essential. • Second hash function must never evaluate to zero! • Second hash function must be chosen such that all cells can be probed (prime table size).

  22. Example of double hashing • Hash2(X) = R – (X mod R), where R=7 • Example of inserting keys {89, 18, 49, 58, 69}

  23. Problems with closed hash table • Standard deletion cannot be performed, because the cell might have caused a collision to go past it. • Closed hash table require lazy deletion. Additional field is introduced to an element which tags it as deleted. • If the table gets too full, the operations gets slower and insertion might even fail. • This happens when many deletions are intermixed with insertions.

  24. Rehashing • Solution is rehashing: • Build another table that is twice as big with new hash function. • Scan original hash table • Insert all non-deleted elements into new hash table • It is expensive operation. • It happens infrequent. • Several strategies: • Rehash when the table is half full, • Rehash only when insertion fails, • Rehash on certain load factor.

  25. Hash tables • Hash tables are used to implement Insert and Find operation in constant average time. • Hash table usage: • Compilers to keep track of declared variables – symbol table • Graph theory where nodes have names instead of numbers • In playing games for recording positions – transposition table • For dictionary implementation (spell checker, search engines, …) • For database implementation

More Related