Algorithms and Data Structures

Algorithms and Data Structures BSc/BSc(Hons) Computer Science

Hash Tables • We can can store and retrieve information from lists with performance which is O(n) , O(logn) and O(log(logn) depending on the list organisation, i.e whether we are using pointers or arrays (closed or open addressing) or a search technique. • An efficient search technique is one that minimises the number of comparisons. Optimally we would like a table organisation an search technique where there are no unnecessary comparisons • If each key is to be retrieved in a single access then the location of the record within the table can depend only on the key and not other keys as in a Tree structure.

Hash Table Definiton • An array is the most efficient organisation for such a table • If the key is an integer or other ordinal type we can store the associated records in an array (a table) e.g. publicclass HashTable { DataItem[] hashArray; // array holds hash table of data items int nElems; // number of elements in hashtable public HashTable(int size) { nElems = 0; hashArray = new DataItem[size]; } • Then the record associated with any key value i is immediately available without any searching. • Why not use this for storing all collections of data ?

Problems • For example, a 5 digit integer key could result in 100,000 possible unique key values. • Also, it’s not always possible or convenient to have an ordinal key, for example the key could be an alphanumeric combination. A five character alphanumeric key converted to an integer could result in even more unique key values • Although only a subset of these unique keys may be actually used we still need to create an array large enough to provide direct addressing for 100,000 possible keys. • Even with today’s vast quantity of memory, it would be unfeasible to create a 100,000 element array. There is too much data. • What we need is to take the original key and convert it to a value in a restricted range i.e apply a hash function

Hash Function • Use a procedure to map the search key into an integer that can be used as an index into the array: public int hashFunc(SomeType key, int tablesize) ; /* where SomeType represents the data type of the key and tablesize represents the size of the restricted hash table */ • The hash function must: • ensure that the resultant integer is within the array bounds of the storing array • map keys randomly and evenly to the integers; and • be quick to calculate.

Example : Keys of Type String • How do we calculate a hash key when the key value is not an integer. For example the item type might be defined by a class as follows where the key is a String: public class DataItem { public String key; public String info; // could be more data fields // constructor public DataItem(String nKey, String nInfo) { key = nKey; info = nInfo; } } • We need a hash function which can calculate a hash index from a String.

Example String Hash Function • Thus in the case where keys are strings. • We need a mapping from the String to an array index (or hash index). • If we have an array of characters (more than two) as the key then the following is a reasonable hash function: public int hashFunc(String key, int tableSize) { int i, sum=0; for(i=0; i<key.length(); i++) { sum = sum + (int)key.charAt(i); } return sum % tableSize; }

Other Hash Algorithms • Mid Square Hashing • Key regarded as an integer and squared. Middle digits used as address • Key Squared Address • 12345 0152399025 39 • 00004 0000000016 00 • 50000 2500000000 00 Collision • Chopping and Adding • Key chopped into sections with same no. digits as address. Then added together and least significant digits used as address • Key Chopped Address • 12345 45 + 23 + 1 69 • 12543 43 + 25 + 1 69 • 00004 04 + 00 + 0 04 Collision

Other Hash Algorithms (cont) • Remainder Method • Key divided by size of hash table (i.e. array) and remainder used as address. The array size should be prime to avoid collisions with records having similar key patterns. • Key Div By Pages Address • 12345 127 rem 26 26 • 12543 129 rem 30 30 • 23989 247 rem 30 30 • Assume No Pages=97 Collision

Collisions • All these hash functions cannot guarantee a unique address given that there are more keys that array entries. The hash function will return the same integer for a number of keys and therefore there may be “multiple hits” or “collisions”. Each address generated corresponds to an array entry and if the entry is full then record must be inserted at some other position since its hash address is occupied. • The trick in a good hash function is that it minimises the number of collisions. • But as they can happen we need techniques to manage collisions.

Linear Probing (Open Addressing) • When the hash function returns a hash key and that key is already occupied in the hash table. • We can deal with collisions (resolve them) by checking successive locations. This is known as linear probing. • If array index is full then sequentially search following array index’s for an empty slot Key Address Table Contains Key 4 0 0 4 5 2 1 20 10 2 2 5 12 6 3 10 18 5 4 21 20 0 5 18 21 4 6 12 251 7 25 Hash Table

Inserting an Element • Consider how this code needs modified to cope with the following two special cases : • if hashArray[hashVal].key = item.key (the item is already in the table.) • is the table full i.e. have we checked all locations? // pre-condition: table is not full publicvoid insert(DataItem item) { int hashVal = hashFunc(item.key,hashArray.length); // hash the key // loop until empty cell while(hashArray[hashVal] != null) { hashVal++; // go to next cell hashVal = hashVal % hashArray.length; // wrap around if necessary } hashArray[hashVal] = item; // insert item nElems++; // increase element count }

Retrieving an Element // pre-condition: table is not full public DataItem find(String target) { int hashVal = hashFunc(target,hashArray.length); // hash the key // loop until empty cell, while(hashArray[hashVal] != null) { // found the key? if(hashArray[hashVal].key.equals(target)) { return hashArray[hashVal]; // return located DataItem } hashVal++; // go to next cell hashVal = hashVal % hashArray.length; // wraparound if necessary } returnnull; // can't find item } • Consider how this code needs modified to cope with the condition when the element is not in the hashtable and the hashtable is full.

Deleting an Element // pre-condition: table is not full public DataItem delete(String target) { int hashVal = hashFunc(target,hashArray.length); // hash the key while(hashArray[hashVal] != null) { // loop until empty cell if(hashArray[hashVal].key.equals(target)) { // found the key? DataItem temp = hashArray[hashVal]; // save item hashArray[hashVal= null; // delete item nElems--; // decrease element count return temp; // return item } hashVal++; // go to next cell hashVal = hashVal % hashArray.length; // wraparound if necessary } returnnull; // can't find item } • Consider how this code needs modified to cope with the condition when the element is not in the hashtable and the hashtable is full

Problems With Linear Probing • Linear probing leads to clusters which reduce the performance of the table the speed of insertion/retrieval/deletion. The reason why linear probing leads to clustering is as follows: • When the table is empty the probability of filling any slot is 1/N if the hash function generates values in the range 1..N with equal probability. • When one or more slots have been filled the probability of filling the succeeding locations is 2/N. Since it will be filled if the hash table generates either the index of the occupied slot or the index of the following slot. • The probability of filling the slot following k occupied slots is (k+1)/N since it will be filled if the hash table generates the index of any of the occupied slots or the index of the following slot. • We can see that there is therefore a tendency to form clusters. • Informally we can argue that this will reduce performance.

Quadratic Probing • When a collision occurs we can generate a new position using newposition = hashFunc(key,tablesize) + i * i where i = 1,2,3,..... • This mitigates against clusters forming as the positions examined are further and further away from the original position returned by the hash function. • But how do we know when we are beginning to check the same locations? • It can be shown that this is the case when we have checked (N+1)/2 positions. When this number of locations have been checked there is no point checking any further i.e. report table ’full’.

Chaining • If collision detected then data is inserted into the list located at that hash address. Key Address Table Contains Key(s) 4 0 0 4 -> 20 5 2 1 25 10 2 2 5 -> 10 12 6 3 18 5 4 21 20 0 5 18 21 4 6 12 251 7 Hash Table

Chaining - Java Implementation • Instead of an array of records use an array of linked lists publicclass HashTable { LinkedList[] hashArray; // Array of Lists int nElems; // number of elements in hashtable public HashTable(int size) { nElems = 0; hashArray = new LinkedList[size]; for (int i=0; i<size; i++) hashArray[i] = new LinkedList(); } } • There are several advantages to this, namely: • normal insertion and collision resolution is easy simply put the new item at the beginning of the list. • the table never becomes full. • deletions are no longer a problem. • it is simple to predict the performance

Performance of Hash Tables • A table implemented as an array over an integer enumerated type or other ordinal type provides direct access to elements of the table on specification of the index value (integer, enumerated type value, etc.). The compiler/run-time system provides such direct access primarily via address computation. • In the case of the hash table with either integer or string-based index set implemented (as an ADT) using hashing the performance of the table access is wholly dependant upon how full the table is. A measure of how full the table is is given by: n (number of entries) l= t (number of array slots) • NOTE: t is the same whether we are using chaining or not. The latter case is referred to as Open Addressing/Contiguous/Direct Addressing.

Performance Analysis Open Addressing • In this analysis we make some assumptions. We assume that all probes are independent and random, i.e. all probes are regarded as independent events: • The probability that a probe hits an occupied slot is l • The probability that a probe hits an unoccupied slot is (1 - l). • Unsuccessful retrieval terminates on an unoccupied list. The set of possibilities are: • unsuccessful after one probe has probability (1 - l). • unsuccessful after two probes has probability l(1 - l) • unsuccessful after three probes has probability l2(1 - l) • unsuccessful after k probes has probability l(k-1)(1 - l)

Performance Analysis - cont • Average number of probes = å each possibility multiplied by its probability ¥ = å k l(k - 1)(1 - l) = 1/(1- l). k = 1 • Successful retrieval with open addressing. It can be shown that the average number of probes in this case is 1/ l log 1/(1- l) Chaining - Unsuccessful Retrieval: • If a chain contains k items, unsuccessful retrieval implies that all k items are examined. • In fact k = i, the average number of items on each chain. • i.e. average number of probes =l.

Algorithms and Data Structures