Hashing

1 / 43

# Hashing - PowerPoint PPT Presentation

Hashing. Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. The location of the record is calculated from the value of its key. No order in the stored records. Relatively easy to program as compared to trees.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Hashing' - loc

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Hashing
• Basis Ideas
• A data structure that allows insertion, deletion and search in O(1) in average.
• The location of the record is calculated from the value of its key.
• No order in the stored records.
• Relatively easy to program as compared to trees.
• Based on arrays, hence difficult to expand.
…Basic ideas
• Consider records with integer key values:
• 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
• Create a table of 10 cells: index of each cell in the range [0..9].
• Each record is stored in the cell whose index corresponds to its key value.

key: 2

key: 8

• Need to compress the huge range of numbers. Use of a hash function.
• It hashes a number in a large range into a number in a smaller range, corresponding to the index numbers in an array.
Definitions
• Hashing
• The process of accessing a record, stored in a table, by mapping the value of its key to a position in the table.
• Hash function
• A function that maps key values to table positions.
• Hash table
• The array where the records are stored.
• Hash value
• The value returned by the hash function. It usually corresponds to a position in the hash table.
Perfect hashing

Hash table

Key 2

Hash function:

Key

8

H(key) = key

Record

Key 8

…Perfect hashing
• Each key value maps to a different position in the table.
• All the keys need to be known before the table is created.
• Problem: what if the keys are neither contiguous nor in the range of the indices of the table?
• Solution: find a hash function that allows perfect hashing! Is this always possible?
Example:
• A company has 100 employees. Social Insurance Number (SIN) is used as a key for a each record.
• Given a 9 digits SIN, should we create a table of 1,000,000,000 cells for only 100 employees?
• Knowing the SI Numbers of all 100 employees in advance does not guarantee to find a perfect hash function.
• what is the number of persons that need to be together in a room in order to, “most likely”, have two of them with the same date of birth (month/day)?
• Hint: calculate p the probability that two persons have the same date of birth.
• 1 - 364/365 · 363/365 · 362/365 · … · (365 - n + 1)/365
• if N = 365 and there are 23 records to hash
• the probability of having at least one collision is… 0.5063!
• => It is easy to have identical value using a Random distribution. It is difficult to conceive a good hashing function.
• Hash functions that allow perfect hashing are so rare that it is worth looking for them only in special circumstances.
• In addition, it is often that the collection of records is not known in advance.
Collisions
• What if we cannot find a perfect hash function?
• Collision: more than one key will map to the same location in the table!
• Can we avoid collisions? No, except in the case of perfect hashing (rare).
• Solution: select a “good” hash function and use a collision-resolution strategy.
• A good hash function:
• The hash function, h, must be computationally simple
• It must distribute keys evenly in the address space
Example of collision:
• The keys are integers and the hash function is:
• hashValue = keymod tableSize
• If tableSize = 10, all records whose keys have the same rightmost digit have the same hash value.

Insert 13 and 23

23

A poor hash function:
• Maps keys non-uniformly into table locations, or maps a set of contiguous keys into clusters.
• An ideal hash function:
• Maps keys uniformly and randomly onto the entire range of table locations.
• Each location is equally likely to be used for a randomly chosen key.
• Fast computation.
To build a hash function:
• We will generally assume that the keys are the set of natural integer numbers N = {0, 1, 2, ……}.
• If they are not, then we can suitably interpret them to be natural numbers.
• Mapping:
• For example, a string over the set of ASCII characters, can be interpreted as an integer in base 128.
• Consider key = “data”
• hashValue = (‘a’+’t’×128+’a’ ×1282+’d’ ×1283) modtableSize
This method generates huge numbers that the machine might not store correctly.
• Goal: reduce the number of arithmetic operations and generate relatively small numbers.
• Solution: Compute the hash value in several step using each time the modulo operation.
• hashValue = ‘d’ modtableSize
• hashValue = (hashValue×128 + ‘a’) modtableSize
• hashValue = (hashValue×128 + ‘t’) modtableSize
• hashValue = (hashValue×128 + ‘a’) modtableSize
Hash function : division
• H(key) = keymodtableSize
• 0 ≤ keymodtableSize ≤ tableSize-1
• Empirical studies have shown that this function gives very good results.
• Assume H(key) = keymodtableSize
• All keys such that key mod tableSize = 0 map into position 0 in the table.
• All keys such that key mod tableSize = 1 map into position 1 in the table.
• This phenomenon is not a problem for position 0 and 1, but…
Assume tableSize = 25
• All keys that are multiples of 5 will map into positions 0, 5, 10, 15 and 20 in the table!
• Why? because key and tableSize have 5 as a common factor:
• There exists an integer m such that:
• key = m×5
• Therefore, keymod 25 = 5×(mmod5) is a multiple of 5
• We wish to avoid this phenomenon when possible.
A solution:
• Choose tableSize as a prime number.
• Example: tableSize = 29 (a prime number)
• 5mod29 = 5,
• 10 mod 29 = 10,
• 15 mod 29 = 15,
• 20 mod 29 = 20,
• 25 mod 29 = 25,
• 30 mod 29 = 1,
• 35 mod 29 = 6,
• 40 mod 29 = 11…
Hash function: digit selection

Digit(s) selection:

key = d1 d2 d3 d4 d5 d6 d7 d8 d9

H(key) = di

If the collection of records is known,

how to choose the digit(s) di?

Analysis of the occurrence of each digit.

Digit selection: analysis

Assume 100 records are to be stored:

Non-uniform distribution

Uniform distribution

Hash functions: mid-square

Mid-square: consider key = d1 d2 d3 d4 d5

d1 d2 d3 d4 d5

× d1 d2 d3 d4 d5

------------------------------------------

r1 r2 r3 r4 r5 r6 r7 r8 r9 r10

Select middle digits, for example r4 r5 r6

Why the middle digits and not leftmost or rightmost digits?

Mid-square: example
• Only 321 contribute in the 3 rightmost digits (041) of the multiplication result.

54321

×

54321

------------------------------------------

54321

108642

162963

217284

271605

------------------------------------------

2950771041

• Similar remark regarding the leftmost digits.
• All key digits contribute in the middle digits of the multiplication result.

Higher level of variety in the hash number => less chances of collision

Hash functions: folding
• Folding: consider key = d1 d2 d3 d4 d5
• Combine portions of the key to form a smaller result.
• In general, folding is used in conjunction with other functions.
• Example: H(key) = d1 + d2 + d3 + d4 + d5 ≤ 45
• or, H(key) = d1 + d2d3 + d4d5 ≤ 171
• Example:
• Consider a computer with 16-bit registers, i.e. integers < 216 = 65536
• Assume the 9-digit SIN is used as a key.
• SIN requires folding before it is used:
• d1 + d2d3d4d5 + d6d7d8d9 ≤ 13131
• Storing the record directly in the table.
• Deal with collisions using collision-resolution strategies.
• Chaining:
• Each cell of the hash table points towards a linked-list.
Chaining

H(key)=keymod tableSize

Insert 13

Insert 23

Insert 18

Collision is resolved by inserting the elements in a linked-list.

13

23

18

Linear Probing

Search sequentially (and by wrapping around the table if necessary) until an empty position is found.

Example: H(key)=key mod tableSize

Insert 89, insert 18, insert 58, insert 9, insert 49

58

9

49

89

0

1

2

3

4

5

6

7

8

9

hashValue = H(key)

Probe table positions :

(hashValue + i) mod tableSize

with i= 1,2,…tableSize-1

Until an empty position is found in the table, or all positions have been checked.

Example:

h(k) = k mod 10, n = 10

Insert 89

h(89) = 89 mod 10 = 9

18

18

49

89

89

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

Insert 18

h(18) = 18 mod 10 = 8

Insert 49

h(49) = 49 mod 10 = 9

We have a collision!

Search wraps around to location 0: 9 + 1 mod 10 = 0

Insert 58

h(58) = 58 mod 10 = 8

18

18

58

58

49

49

9

89

89

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

Collision again!

Search wraps around to location 1 :

8 + 1 mod 10 = 9 -> 8 + 2 mod 10 = 0 -> 8 + 3 mod 10 = 1

Insert 9

h(9) = 9 mod 10 = 9

Collision again!

Search wraps around to location 2 :

9 + 1 mod 10 = 0 -> 9 + 2 mod 10 = 1 -> 9 + 3 mod 10 = 2

Primary clustering!!

Linear probing is easy to implement…
• Linear probing makes that many items are stored in a few areas creating clusters:
• This is known as primary clustering.
• Contiguous keys are mapped into contiguous table locations.
• Consequence: Slow search even when the table’s load factor  is small:
• = (number of occupied locations)/tableSize
• Collision-resolution strategy that eliminates primary clustering.
• Quadratic probing creates spaces between the inserted elements hashing to the same position: eliminates primary clustering.
• In this case, the probe sequence is
• for i = 0, 1, …, n-1,
• where c1 and c2 are auxiliary constants
• Works much better than linear probing.
18

89

89

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

Example: Let c1 = 0 and c2 = 1

Insert 89

Insert 18

18

18

49

49

58

89

89

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

Insert 49

Collision!

Insert 58

Collision!

= (8+1) mod 10 = 9 Collision!

= (8+4) mod 10 = 2

18

49

58

89

0

1

2

3

4

5

6

7

8

9

Insert 9

Collision!

= (9+1) mod 10 = 0 Collision again!

= (9+4) mod 10 = 3

OK!

9

Use the hash function “mod tablesize” and quadratic probing with function “2i + i2” to insert the following numbers (is this order) 15, 23, 34, 26, 12, 37 in a hash table with tablesize = 11. Give all the steps.

15 -> position 4

23 -> position 1

34 -> position 1: collision

-> 1 + 3 -> position 4 : collision

-> 1 + 8 -> position 9

26 -> position 4 : collision

-> 4 + 3 -> position 7

12 -> position 1 : collision

-> 1 + 3 -> position 4 : collision

-> 1 + 8 -> position 9 : collision

-> 1 + 15 -> position 5

37 -> position 4 : collision

-> 4 + 3 -> position 7 : collision

-> 4 + 8 -> position 1 : collision

-> 4 + 15 -> position 8

Others operations
• Searching:
• The algorithm for searching for key k probes the same sequence of slots that the insertion algorithm examined when key k was inserted.
• The search can terminate (unsuccessfully) when it finds an empty slot…
• Why?
• If k was inserted, it would occupy a position … assuming that keys are not deleted from the hash table
• Deletion:
• When deleting a key from slot i, we should not physically remove that key.
• Doing so may make it impossible to retrieve a key k during whose insertion we probed slot i and found it occupied.
• A solution:
• Mark the slot by a special value (not deleting it).
Analysis of Linear Probing

Let , where m of n slots in the hash table are occupied

 is called the load factor and is clearly < 1

Theorem 1:

Assumption: Independence of probes

Given an open-address hash table, with load factor  < 1, the average number of probes in an insertion is

1/(1 - )

Find Operation

Theorem 2:

Assuming that each key in the table is equally likely to be searched for ( < 1)

The expected number of probes in a successful search is

The expected number of probes in an unsuccessful search is

• Crucial questions:
• Will we be always able to insert element x if table is not full?
• Ease of computation?
• What happens when the load factor gets too high?
• (this applies to linear probing as well)
• The following theorem addresses the first issue
• Theorem 3:
• If quadratic probing is used and the table size is prime,
• then a new element can be inserted if the table is at least half empty.
• Also, no cell is probed twice in the course of insertion.
• We assume that there exist
• i
• (hashValue+i2) mod tableSize=(hashValue+j2) mod tableSize
• Therefore, (i2 - j2) mod tableSize=0
• Leading to (i - j)(i + j) mod tableSize=0
• However, as tableSize is prime and (i+j)
• Because i≠j and i and j are positive integer, neither (i-j) or (i+j) can be equal to zero, then
• (i - j)(i + j) mod tableSize ≠ 0
• Then theorem 3 is true
The expected number of probes in a successful search is

1/(1- )

The expected number of probes in an unsuccessful search is

-(1/ )ln(1- )

Comparison with the linear probing

U S

Linear probing  = 0.1 1.11 1.05

 = 0.5 2.50 1.5

 = 0.9 50.5 5.5

Quadratic probing  = 0.1 1.11 1.05

 = 0.5 2.00 1.38

 = 0.9 10.00 2.55

Secondary clustering

Secondary clustering:

Elements that hash to the same position will also probe the same positions in the hash table.

Note:

Quadratic probing eliminates primary clustering but does not eliminate secondary clustering.

Nevertheless quadratic probing is efficient. Good distribution of the data then low probability of collision. Fast to compute.

What do we do when the load factor gets too high?
• Rehash!
• Double the size of the hash table
• Rehash:
• Scan the entries in the current table, and insert them in a new hash table
Double hashing
• Double hashing eliminates secondary clustering:
• It uses 2 hash functions
• hashValue = H1(key) + iH2(key) mod tableSize
• for i=0,1,2...
• The idea is that even if two items hash to the same value of H1, they will have different values of key, so that different probe sequences will be followed.
• H2(key) should never be zero or we will get stuck in the same location in the table.
• tableSize should be prime
Given the restriction on the range of H2, the simplest choice for H2 is:
• 1 + (key mod tableSize -1)
• Then H2 can never be 0
• We have to calculate the hash value for key only once
• There is no restriction on the load factor.