Managing Data with Hash Tables & Storage Efficiency

Data StructuresHash Tables Phil Tayco Slide version 1.1 April 30, 2018

Hash Tables Storage space revisited A common argument in recent computing is the lower costs of acquiring large amounts of disk space Situations can then be adjusted that treat using large amounts as not as critical This implies the use of arrays for managing data sets

Hash Tables Sorted data If we are okay with using arrays, then certain situations using them could be identified Sorted data leads to O(log n) performance Sorting the data is at best O(n log n) using quicksort and O(n) if we kept the order while performing maintenance Performance is strong if the data is sorted but maintaining it can be costly

Hash Tables Unless we don’t need to sort Besides log n search, sorted data also helps with presenting parts or all of the data (such as a web page report) If there isn’t a need to show sorted data (such as an employee management system where records are maintained one at a time), then the need to sort the data is removed Unsorted data, however, is O(n) so we are now looking for a structure that helps with O(log n) maintenance performance (or better) that does not need the sorting (and we are okay with using arrays)

Hash Tables Array index as key To take advantage of this, we need to take advantage of the fact that arrays allow for direct access to array elements Direct access is achieved by using the array index number The question is how to maximize use of the array index when performing the maintenance functions?

Hash Tables An ideal example Consider a company of 1,000 employees and perhaps this particular company is very unlikely to exceed 100,000 Storage is not an issue and memory capacity can easily accommodate 100,000 records The program to maintain these records does not prioritize functionality that requires showing the employee records in any sorted way This is all great because an array can be used with a large amount of space that can handle the worst case 100,000 records

Hash Tables Index representation To take full advantage of the array, we treat the array index as a key value to identifying an employee Sequential employee id numbers make the perfect key (Employee 15 is employees[14]) On a larger scale, employee SSN can be used in the same way (assuming you can hold up to 999,999,999 records!) Each employee id is a unique index value so there would never be overlap (unless you reused employee ids after they left the company)

Hash Tables Ideal efficiency Just how fast does this performance lead to? Search: you know the id number, you know the array index and you have direct access Insert: maintaining the last known employee id number is easy enough to take advantage of adding new employees Update/Date: is a search followed by an appropriate change Each one of these ends up at O(1)!

Hash Tables Reality Such ideal situations are in fact that: ideal Some situations tend to lose out on some factor: Not quite enough storage space requiring a smaller array size ID values may not be a unique number Can we reduce the array size and find a way to line up a unique record ID with an array index?

Hash Tables Hashing Hashing involves deriving an index value through some logical calculation Derivation is applied to a field or combination of fields of the record that calculate an index Typical example: Adding all ASCII values of some field like first and last name and using mod to calculate the index

Hash Tables Calculations Example: “Phil Tayco” as the name of the record Add all ASCII character values 50 + 104 + 105 + 108 = 367 for “Phil” 54 + 97 + 121 + 99 + 111 = 482 for “Tayco Total = 849 Say we only allow for 500 array elements. We can also mod this value by the array size 849 % 500 = array index 349 Utilizing this approach means we have a consistent formula to derive an index value

Hash Tables Limitations Challenges immediately come to mind when looking at this example: Eventually, an index value calculation for 2 different records will derive the same value (called a “collision”) A calculation that guarantees a unique value often leads to a large amount of space required with heavy under utilization We need to keep the capacity of the array reasonable while handling the inevitable collisions

Hash Tables Collisions Multiple approaches for handling collisions when hashing Open addressing uses the strategy to find another open element in the array following a search-like algorithm Assumption is that there will be enough space for all entries (i.e. the estimated maximum capacity of the hash array is adequate

Hash Tables Linear Probing Linear probing is the basic open address agorithm If a collision occurs, look in the next immediate spot in the array If it is open, place the next item there If it is not, continue looking in the next array index (wrapping to index 0 if needed) until an open spot is found This is an issue only if the capacity is reached (making the initial estimate important)

Hash Tables Linear Probe Search If the hash array utilizes this form of collision handling on insert, the other functions must follow suit Search uses the hash function to find if a given record is at the hash location If it is “empty” at that location, the search if over If it is there, then the record is found Otherwise, the search continues with the next array element “Empty”, however, must be defined with a predetermined record value. Why?

Hash Tables Linear Probe Delete Because a delete cannot simply mean to perform the search and if the record is found, remove it from the array This would leave an empty spot in the array that may be interpreted as a record not found during a search Instead, the array element is changed to another pre-determined value of “deleted” Search does not treat this as an empty spot

Hash Tables Example: Records “T”, “Y” and “R” have been hashed into the array T Y R

Hash Tables New record “D” comes in and the hash function calculates its index as index [3] T Y R D

Hash Tables Record “D” collides with record “T”. Linear probe means try the next index T Y R D

Hash Tables However, record “Y” is already there, so we try the next one. It is open, so that’s where “D” goes T Y D R

Hash Tables Later on, record “Y” is called for deletion. When “Y” is hashed, its index value is [4]. “Y” is there, so the deletion is performed T Y D R

Hash Tables However, if we remove it, that creates an empty space… T D R

Hash Tables If we left it this way, when search for record “D” begins, its original hash value is still [3] T D R

Hash Tables Since index [3] is not “empty”, search goes to index [4] which is empty and then incorrectly returns “not found” T D R

Hash Tables Solution is instead of removing the record, put in a designated “deleted” value (such as -1) T -1 D R

Hash Tables Now when search for record “D” is performed, the linear probe will treat the “-1” as not empty and continue the search correctly T -1 D R

Hash Tables Linear Probe Efficiency As records start to fill up the array, you can infer that the efficiency of the algorithm degrades to O(n) The degradation is dependent on the complexity of the hash function (more spaced out locations) and nature of the data (does the selected fields of data result in spaced out hash values) Other methods of probing exist Quadratic probing Double hashing

Hash Tables Quadratic probing Here, the step size doubles in size with each probing step If a collision occurs, the next index location is one step away If search is to continue, the step size increases to 2, then 4, then 8, etc. Stepping “wraps” around as needed If the probe steps get too high, the probe starts at the next linear location and repeats The idea here is that the hashed elements can be more spaced out resulting in less initial collisions

Hash Tables Double hashing Here, the step size is calculated by a second hash function on the initial key If a collision occurs at index n, n is hashed with a secondary function to generate a probe step value Linear probing commences every step value number of elements One hash result may land at index n which generates a step value of 3 (probe every 3 elements from n) Another hash result may land at x which generates a step value of 4 (probe every 4 elements from x)

Hash Tables Double hashing The theory here is multiple formulas lead to less probing as collisions occur Hashing on index values can lead to dispersed hashed elements Mathematically works well when the hash table array size is a prime number (guarantees probe will visit every element if necessary)

Hash Tables The bottom line Whatever the hash function and open addressing probe approach you take, the logic and strategy is the same: Determine an appropriate field(s) for hash use Develop a hash function that generates reasonably spaced index values Design a collision handling approach that takes advantage of the hash strategy Best and worst case will always range from O(1) to O(n) Open addressing means trying to reduce the likelihood of O(n) Hash tables can be remapped to eliminate “deleted” elements

Hash Tables A more dynamic approach What if you’re not quite sure of your capacity estimate? Or, perhaps the maximum size is wildly outrageous and conducive to unused space A second collision handling approach allows for keeping a reasonably large sized array and dynamically addressing the collisions “Dynamic” memory management implies a second structure…

Hash Tables A hash array of linked lists This method, known as “Separate Chaining” makes each element of the array a “head” node of a linked list When insert is performed, the hash index is found and the new element is inserted into the linked list there If a collision occurs, it’s okay because the linked list insert handles it When search or delete is performed, the initial hash takes place followed by a standard linked list search or delete

Hash Tables Same example as before. 3 records as heads of lists in the hash array T Y R

Hash Tables Record “D” is hashed to index [3] and is inserted into the linked list (note that T is now the 2nd node in the linked list there) D Y R T

Hash Tables Delete of record “Y” is simply hashing to index [4] and performing a linked list delete D R T

Hash Tables Search for “D” hashes to index [3] as normal and a linked list search is performed (which happens to be the head node!) D R T

Hash Tables Separate Chaining pros and cons The overhead with using a linked list does impact performance but not necessarily the coding since the functions can be modularized In theory, the performance is the same as open addressing since it still depends on the hash function developed The size of the hash array is not a critical dependency since the linked lists handle the need for additional space The right combination of a hash function that yields wide ranging index values with the use of linked lists is generally preferred

Hash Tables Summary Hash tables have strong benefit for situations where single record search and maintenance is primary because of its near O(1) performance Obtaining records in ordered groups and data sets is challenging to do and not conducive to hash tables Collisions can be handled using open addressing or separate chaining, the latter of which is generally considered more flexible for performance and memory usage The key is the hash function itself – many formulas and theories exist on what fields and calculations to use to derive index values

Managing Data with Hash Tables & Storage Efficiency