Hashing for searching

Hashing for searching Consider searching a database of records on a given key. There are three standard techniques: Searching sequentially --- start at the first record and look at each record in turn until you find the right one (with the matching key). The advantage with this method is that it is simple to understand, implement and prove to be correct; in particular, changes to the database of records cannot stop the search from working correctly. The disadvantage … it is horribly inefficient for large databases (lots of records) Structured (ordered) searching --- where the database to be searched is structured in such a way as to make the searching process more efficient. For example, we have already seen (or will see) ordered lists, binary trees, balanced binary trees … which use an ordering between the keys to structure the data being stored. The advantage is that the searching can be made much more efficient. The disadvantage is that updating the database is much more complicated and has the potential for disrupting the search process: consequently, correct implementation is more difficult. Searching by hashing --- a completely different approach which does not depend on being able to sort the database records (by key), but does provide structure to improve the efficiency of searches. 2004/2005: CS211

Hashing --- some terminology • Hashing --- the process of accessing a record by mapping a key value to a position in a (database) table. • Hash function --- the function that maps any given key to some table position. This is usually denoted by h. • Hash table --- the data structure (usually an array) that holds the records. This is usually denoted by T. • Slot --- position in a hash table. • Hash modulus --- the number of slots in the hash table. This is usually denoted by M, with slots numbered 0 to M-1. • The goal when hashing is to arrange things such that for any key value K and some hash function h, we have the following: • 0<= h(K) <= M, and T[h(K)].key() = K • In this way, the hash function tells us where the record being searched for (by the given key K) can be found in the hash table. • But, as usual, it is more complicated than this … as we shall see in later slides! 2004/2005: CS211

When to use hashing • Generally used only for sets: in most cases it should not be used where multiple records with the same key are permitted. • Not normally used for range searches: for example, finding a record with a key in a certain alphabetic range is not easy to do by hashing. • Hashing should not be used if the ordering of the elements is important: for example, finding the largest, smallest or next value is not easy with hash tables. • Hashing is best for answering questions like: what record, if any, has the key value K? • For databases which are used to answer only questions of this type, hashing is the preferred method since it is very efficient (when done correctly!) • It is very easy to choose and implement a bad hashing strategy. • But, standard (good) strategies have been developed ,and we will look at some of the most important issues • First, we should look at a simple example ... 2004/2005: CS211

The simplest hashing:array indexing • In the simplest of cases, hashing is already provided in the form of array indexing. • For example: • When there are n records with unique key values in the range 0 to n-1 then we can use the hashing function h(k) = k. • Thus, a record with key i can be stored in T[i]. • In fact, we don’t even need to store the key value as part of the record since it is the same as the array index being used! • To find a record with key value i, simply look in T[i]. • Unfortunately, this case is the exception rather than the rule: • there are usually many more values in the key range than there are slots in the hash table. • We need to examine more realistic examples. 2004/2005: CS211

An introduction to collisions • Suppose we have a key range of 0 to 65,535 (a 2-bit signed integer) • We expect to have to store 1000 records, on average, at any one time. • It is impractical to use a table with 65,535 slots, where most would be empty. • Instead, we design a hash function that will store records in a much smaller table. (To store in a table of size 2000 we could just take the key modulus 2000) • Since, the possible key range is much larger than the size of the table we know that, unless we are very lucky, at least some of the slots must be mapped to by the same key value (like keys 2000 and 4000 which would both map to 0): • Given a hash function h and keys k1 and k2, if h(k1) = h(k2) = s, then we say that k1 and k2 have a collision at slot s under hash function h. • We need to decide upon a collision policy for resolving collisions. • Then, to find a record with key k, we compute the table position h(k) and starting at slot h(k) locate the record using knowledge of the collision policy 2004/2005: CS211

Hash functions • We should choose a hash function which helps to minimise the number of collisions. • Even in the best cases, collisions are practically unavoidable • Perfect hashing, where there are never any collisions, can be coded when all the records to be stored are known in advance. This is extremely efficient but design and implementation can be difficult. A typical use of perfect hashing is on CD-ROMs where the database will never change but access time can be expensive. • An imperfect hash example: • A head-teacher wishes to keep a database of student information where the school has 200 students. (S)he decides to use a hashing function which uses the birthday of each student to map them into a table of 365 elements. This will almost certainly lead to collisions because it takes only 23 students to give odds of better than evens that 2 of the students will share the same birthday. • The first guideline when deciding if a hash function is suitable is whether it keeps the hash table at least half-full at any one time. In the example above, this property is met. • A second guideline concerns the number of collisions that are acceptable, and this is usually something which designers must decide upon on a problem-to-problem basis. 2004/2005: CS211

Key data distribution In general, we would like to pick a hash function which distributes the records with equal probability to all hash table slots. This will depend on how well the key data is distributed. For example: if the keys are generated as a set of random numbers selected uniformly in the key range then any hash function that assigns the key range so that each slot receives an equal share of the range will also distribute the records evenly throughout the table. When input records are not well-distributed throughout the key range then it can be difficult to devise a hash function that does a good distribution through the table. This problem becomes even more complex if we do not know in advance the list of keys to be stored, or if this changes dynamically during execution! 2004/2005: CS211

Poorly distributed keys and distribution dependent hashing There are many reasons why data values may be poorly distributed. Many natural distributions are asymptotic curves Collected data is likely to be skewed in some way For example, if the keys are a collection of English words then the initial letter will not be evenly distributed so a hashing function which maps words to a table of 26 slots (1 for every letter of the alphabet which initial characters can take) would probably not be a very good idea In the above example, we should use a distribution-dependent hashing function that uses knowledge of the distribution of the keys in order to avoid collisions. Distribution-independent hashing functions are used when we have no knowledge of the distribution of the key values being used. 2004/2005: CS211

A simple example • The following hash function is used to hash integers to a table of 64 slots: • public static int h(int x){ return (x%64);} • The value returned by this hash function depends only on the least significant 6 bits of the key. These bits are very likely to be poorly distributed and so the hashing function is likely to produce a table which is unevenly filled (increasing the number of collisions to be resolved) • A better classic example: the mid square method --- • square the numerical (integer) key and take the middle r-bits for a table of size 2^r. • Question: try programming the mid-square method in Java Test it in comparison with the first hash function (h, above). • Question: Why do you think the mid-square method is deemed to be better than h (for the same sized hash tables)? 2004/2005: CS211

Hashing for strings • Consider the following hash function: • public static int h (String x, int M){ • int I, sum; • for (sum =0, I=0, I<x.length(); I++) • sum+= (int)x.charAt(I); • return (sum%M); • } • Here we sum the asciivalues of the letters in the string. Provided M (the size of the hashing table) is small, this should provide a good distribution because it gives equal weight to all characters. • This is an example of a folding method: the hash function folds up the sequence of characters using the plus operator. • Note: changing the order of characters does not change the slot calculated. • The last step is common to all hashing functions --- apply the modulus operator to make sure the value generated is within the table range. Question: what would be a good size for M if the average length of key strings is 10 characters? 2004/2005: CS211

Open hashing • Two classes of collision resolution techniques: • Open hashing … also known as separate chaining • Closed hashing … also known as open addressing • With open hashing, collisions lead to records being stored outside the hashing table • With closed hashing, collisions lead to records being stored in the table at a position different from that calculated by the hashing function. • Open Hashing Implementation • A simple design for open hashing defines each slot in the table to be the head of a linked list. All records hashed to that slot are then placed in this linked list. • Now we need only decide how records are ordered in each list -- the most popular techniques are by insertion time, or key value order, or frequency of access time. • When to use open hashing: • It is most appropriate when the hash table is to be kept in main memory; storing such a table on disk would be very inefficient. • Note: open hashing is based on the same idea as the bin sort (which we will see later). 2004/2005: CS211

Closed Hashing Closed hashing stores all records directly in the hash table. Each record has a home position defined by h(k), where k is the record’s key. If a record R is to be inserted and another record already occupies R’s home then R will be stored at some other slot. The collision resolution policy determines which slot. The same policy must also be followed when searching the database. A simple example: if a collision occurs just move onto the next position in the table. If the end of the table is reached then just loop around to the front. Question: this policy, although simple and correct, is not often used in real programs; can you see why? 2004/2005: CS211

Closed hashing continued ... • Bucket Hashing … group hash table slots into buckets • Hash table: array of M slots divided into B buckets, (M/B slots per bucket) • Hash function maps to buckets (1st slot free) • Overflow bucket --- when a bucket is full then record is store here. • All buckets share the same overflow • Goal: minimise use of overflow! • Searching: • hash the key to determine the bucket • if key not found in bucket and bucket is not full then search is complete • if key not found and bucket is full then check the overflow bucket • Note: searching the overflow can be expensive if there are many records contained within 2004/2005: CS211

A simple variation • Hash the key value to its home position as if no bucketing used • If home is taken then push record down until end of bucket is reached • If bottom of bucket is reached then move record to top. • Example: • assume 8 record buckets • if key is hashed to 5 and 5 is full then we try to find an empty slot in following order: • 6,7,0,1,2,3,4 • if all slots are taken then record is assigned to overflow bucket. • Advantage: collisions are reduced • Used for storing on disks where bucket size = block size. • Goal: maximise the likelihood that a record is stored in the same disk block as its home position. 2004/2005: CS211

Simple Linear Probing • Classic form: closed hashing with no bucketing and a collision resolution policy which can potentially use any slot in the table. • Collision resolution: generate a sequence of hash table slots that can potentially hold the record: the probe sequence • Assume p(K,I) is the probe function returning an offset from the home position for the Ith slot in the probe sequence. • Linear probing: just keep moving down (allowing for circular movement when bottom is reached) until an empty slot is found. • The probe function for linear probing is very simple: • p(K,I) = I. • Advantage: simple and all slots will eventually be taken. • Disadvantage: leads to clustering and uneven distribution … inefficient! 2004/2005: CS211

Improved Collision Resolution Methods • How to avoid clustering: • use linear probing but skip slots by a constant c other than 1. • Probe function: (H(K)+ic)mod M. • Records with adjacent home positions will not follow the same probe sequence • Completeness: probe sequence should cycle through all slots in the hash table before returning to the home position. • Not all probe functions are complete: • if c = 2 and the table contains even number of slots, then any key whose home position is in an even slot will have a probe sequence that cycles only through even slots. Similarly, an odd home position will cycle only through odd slots. • Completeness Requirement: c must be relatively prime to M. • For example, if M= 10 then we can choose c among 1,3,7,9 2004/2005: CS211

Pseudo-random probing Ideal probe function: select the next position in the probe sequence randomly! But, how would we search if we did this? We need to be able to duplicate the same probe sequence when searching the key. Pseudo-random probing -- use a common sequence of ‘random’ numbers for adding and searching. Advantages: eliminates primary clustering. Disadvantages: complex and less efficient (and choice of sequence is key) Note: there are many other techniques for eliminating primary clustering. 2004/2005: CS211

Analysis of Closed hashing Primary Operations: insert, delete and search Key property is how full the table is (on average) --- the load factor Typically, cost of hashing is close to 1 record access … this is super-efficient (a binary search takes log n, on average!) Mathematical analysis shows that the best policy is to aim for the hash table being, on average, half-full. But, this requires the implementor having some idea of record usage. Question: what are the extra considerations for deletion Answer: don’t hinder later searches, and don’t make slots unusable 2004/2005: CS211

Other Hashing Concerns: File Processing and External Sorting • Differences between primary memory and secondary storage as they affect algorithm and data structure designers: • speed of access, • quantity of data, • persistence of data • Primary problem: access to disk and tape drives is much slower than access to primary memory. • Persistent storage: disk and tape files do not lose data when they are switched off. • Volatile: all information is lost with power. • Access ration: primary storage is roughly 1 million times faster to access (in general). • Goal: minimize the number of disk accesses --- • use a good file structure • read more information than is needed and store in a cache! 2004/2005: CS211

Other Hashing Concerns: Disk Access Costs and Caching • Primary cost: the seek time (accessing a sector on disk) • But seeks are not always necessary (for example, sequential reading) • Note, there are other delay factors: • rotational delay, • startup time, • second seeks • Disk fragmentation problems: use disk defragmenters as often as possible • Sector buffering: keep at least 2 buffers for input and output so that read/writes to same disk sector do not interfere. (Standard on most modern computers) • More advanced techniques for parallelism: double buffering, eg. • Caching techniques: how to keep the most used records in the local cache (fastest available memory). • Caching layers: caching is such a good idea that there are often multiple layers. 2004/2005: CS211

Hashing for searching

Hashing for searching

Presentation Transcript

Hashing

Hashing

Hashing

Similarity Searching in High Dimensions via Hashing

Hashing

Hashing

Searching / Hashing

Searching, Maps, Hashing

Hashing

Searching and Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Similarity Searching in High Dimensions via Hashing

Hashing, Hashing Tables

Hashing

Hashing

Hashing