- 136 Views
- Uploaded on
- Presentation posted in: General

Hashing

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Hashing

- Basis Ideas
- A data structure that allows insertion, deletion and search in O(1) in average.
- The location of the record is calculated from the value of its key.
- No order in the stored records.
- Relatively easy to program as compared to trees.
- Based on arrays, hence difficult to expand.

…Basic ideas

- Consider records with integer key values:
- 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
- Create a table of 10 cells: index of each cell in the range [0..9].
- Each record is stored in the cell whose index corresponds to its key value.

key: 2

…

…

key: 8

…

…

- Need to compress the huge range of numbers. Use of a hash function.
- It hashes a number in a large range into a number in a smaller range, corresponding to the index numbers in an array.

- Definitions
- Hashing
- The process of accessing a record, stored in a table, by mapping the value of its key to a position in the table.

- Hash function
- A function that maps key values to table positions.

- Hash table
- The array where the records are stored.

- Hash value
- The value returned by the hash function. It usually corresponds to a position in the hash table.

Perfect hashing

Hash table

Key 2

Hash function:

Key

…

…

8

H(key) = key

Record

Key 8

- …Perfect hashing
- Each key value maps to a different position in the table.
- All the keys need to be known before the table is created.
- Problem: what if the keys are neither contiguous nor in the range of the indices of the table?
- Solution: find a hash function that allows perfect hashing! Is this always possible?

- Example:
- A company has 100 employees. Social Insurance Number (SIN) is used as a key for a each record.
- Given a 9 digits SIN, should we create a table of 1,000,000,000 cells for only 100 employees?
- Knowing the SI Numbers of all 100 employees in advance does not guarantee to find a perfect hash function.

- The birthday paradox:
- what is the number of persons that need to be together in a room in order to, “most likely”, have two of them with the same date of birth (month/day)?

- Answer: only 23 people.
- Hint: calculate p the probability that two persons have the same date of birth.
- 1 - 364/365 · 363/365 · 362/365 · … · (365 - n + 1)/365
- if N = 365 and there are 23 records to hash
- the probability of having at least one collision is… 0.5063!

- => It is easy to have identical value using a Random distribution. It is difficult to conceive a good hashing function.
- Hash functions that allow perfect hashing are so rare that it is worth looking for them only in special circumstances.
- In addition, it is often that the collection of records is not known in advance.

- Collisions
- What if we cannot find a perfect hash function?
- Collision: more than one key will map to the same location in the table!

- Can we avoid collisions? No, except in the case of perfect hashing (rare).
- Solution: select a “good” hash function and use a collision-resolution strategy.

- A good hash function:
- The hash function, h, must be computationally simple
- It must distribute keys evenly in the address space

- Example of collision:
- The keys are integers and the hash function is:
- hashValue = keymod tableSize
- If tableSize = 10, all records whose keys have the same rightmost digit have the same hash value.
Insert 13 and 23

23

- A poor hash function:
- Maps keys non-uniformly into table locations, or maps a set of contiguous keys into clusters.

- An ideal hash function:
- Maps keys uniformly and randomly onto the entire range of table locations.
- Each location is equally likely to be used for a randomly chosen key.
- Fast computation.

- To build a hash function:
- We will generally assume that the keys are the set of natural integer numbers N = {0, 1, 2, ……}.
- If they are not, then we can suitably interpret them to be natural numbers.
- Mapping:
- For example, a string over the set of ASCII characters, can be interpreted as an integer in base 128.
- Consider key = “data”
- hashValue = (‘a’+’t’×128+’a’ ×1282+’d’ ×1283) modtableSize

- This method generates huge numbers that the machine might not store correctly.
- Goal: reduce the number of arithmetic operations and generate relatively small numbers.
- Solution: Compute the hash value in several step using each time the modulo operation.
- hashValue = ‘d’ modtableSize
- hashValue = (hashValue×128 + ‘a’) modtableSize
- hashValue = (hashValue×128 + ‘t’) modtableSize
- hashValue = (hashValue×128 + ‘a’) modtableSize

- Hash function : division
- H(key) = keymodtableSize
- 0 ≤ keymodtableSize ≤ tableSize-1
- Empirical studies have shown that this function gives very good results.
- Assume H(key) = keymodtableSize
- All keys such that key mod tableSize = 0 map into position 0 in the table.
- All keys such that key mod tableSize = 1 map into position 1 in the table.
- This phenomenon is not a problem for position 0 and 1, but…

- Assume tableSize = 25
- All keys that are multiples of 5 will map into positions 0, 5, 10, 15 and 20 in the table!
- Why? because key and tableSize have 5 as a common factor:
- There exists an integer m such that:

- key = m×5
- Therefore, keymod 25 = 5×(mmod5) is a multiple of 5

- We wish to avoid this phenomenon when possible.

- A solution:
- Choose tableSize as a prime number.
- Example: tableSize = 29 (a prime number)
- 5mod29 = 5,
- 10 mod 29 = 10,
- 15 mod 29 = 15,
- 20 mod 29 = 20,
- 25 mod 29 = 25,
- 30 mod 29 = 1,
- 35 mod 29 = 6,
- 40 mod 29 = 11…

Hash function: digit selection

Digit(s) selection:

key = d1 d2 d3 d4 d5 d6 d7 d8 d9

H(key) = di

If the collection of records is known,

how to choose the digit(s) di?

Analysis of the occurrence of each digit.

Digit selection: analysis

Assume 100 records are to be stored:

Non-uniform distribution

Uniform distribution

Hash functions: mid-square

Mid-square: consider key = d1 d2 d3 d4 d5

d1 d2 d3 d4 d5

× d1 d2 d3 d4 d5

------------------------------------------

r1 r2 r3 r4 r5 r6 r7 r8 r9 r10

Select middle digits, for example r4 r5 r6

Why the middle digits and not leftmost or rightmost digits?

Mid-square: example

- Only 321 contribute in the 3 rightmost digits (041) of the multiplication result.

54321

×

54321

------------------------------------------

54321

108642

162963

217284

271605

------------------------------------------

2950771041

- Similar remark regarding the leftmost digits.

- All key digits contribute in the middle digits of the multiplication result.

Higher level of variety in the hash number => less chances of collision

- Hash functions: folding
- Folding: consider key = d1 d2 d3 d4 d5
- Combine portions of the key to form a smaller result.
- In general, folding is used in conjunction with other functions.
- Example: H(key) = d1 + d2 + d3 + d4 + d5 ≤ 45

- or, H(key) = d1 + d2d3 + d4d5 ≤ 171
- Example:
- Consider a computer with 16-bit registers, i.e. integers < 216 = 65536
- Assume the 9-digit SIN is used as a key.
- SIN requires folding before it is used:
- d1 + d2d3d4d5 + d6d7d8d9 ≤ 13131

Open-addressing vs. chaining

- Open-addressing:
- Storing the record directly in the table.
- Deal with collisions using collision-resolution strategies.

- Chaining:
- Each cell of the hash table points towards a linked-list.

Chaining

H(key)=keymod tableSize

Insert 13

Insert 23

Insert 18

Collision is resolved by inserting the elements in a linked-list.

13

23

18

Collision-resolution strategies in open addressing

Linear Probing

If H(key) is already occupied:

Search sequentially (and by wrapping around the table if necessary) until an empty position is found.

Example: H(key)=key mod tableSize

Insert 89, insert 18, insert 58, insert 9, insert 49

58

9

49

89

0

1

2

3

4

5

6

7

8

9

hashValue = H(key)

Probe table positions :

(hashValue + i) mod tableSize

with i= 1,2,…tableSize-1

Until an empty position is found in the table, or all positions have been checked.

Example:

h(k) = k mod 10, n = 10

Insert 89

h(89) = 89 mod 10 = 9

18

18

49

89

89

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

Insert 18

h(18) = 18 mod 10 = 8

Insert 49

h(49) = 49 mod 10 = 9

We have a collision!

Search wraps around to location 0: 9 + 1 mod 10 = 0

Insert 58

h(58) = 58 mod 10 = 8

18

18

58

58

49

49

9

89

89

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

Collision again!

Search wraps around to location 1 :

8 + 1 mod 10 = 9 -> 8 + 2 mod 10 = 0 -> 8 + 3 mod 10 = 1

Insert 9

h(9) = 9 mod 10 = 9

Collision again!

Search wraps around to location 2 :

9 + 1 mod 10 = 0 -> 9 + 2 mod 10 = 1 -> 9 + 3 mod 10 = 2

Primary clustering!!

- Linear probing is easy to implement…
- Linear probing makes that many items are stored in a few areas creating clusters:
- This is known as primary clustering.
- Contiguous keys are mapped into contiguous table locations.
- Consequence: Slow search even when the table’s load factor is small:
- = (number of occupied locations)/tableSize

- Quadratic probing:
- Collision-resolution strategy that eliminates primary clustering.
- Quadratic probing creates spaces between the inserted elements hashing to the same position: eliminates primary clustering.

- In this case, the probe sequence is
- for i = 0, 1, …, n-1,
- where c1 and c2 are auxiliary constants
- Works much better than linear probing.

18

89

89

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

Example: Let c1 = 0 and c2 = 1

Insert 89

Insert 18

18

18

49

49

58

89

89

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

Insert 49

Collision!

Insert 58

Collision!

= (8+1) mod 10 = 9 Collision!

= (8+4) mod 10 = 2

18

49

58

89

0

1

2

3

4

5

6

7

8

9

Insert 9

Collision!

= (9+1) mod 10 = 0 Collision again!

= (9+4) mod 10 = 3

OK!

9

Use the hash function “mod tablesize” and quadratic probing with function “2i + i2” to insert the following numbers (is this order) 15, 23, 34, 26, 12, 37 in a hash table with tablesize = 11. Give all the steps.

15 -> position 4

23 -> position 1

34 -> position 1: collision

-> 1 + 3 -> position 4 : collision

-> 1 + 8 -> position 9

26 -> position 4 : collision

-> 4 + 3 -> position 7

12 -> position 1 : collision

-> 1 + 3 -> position 4 : collision

-> 1 + 8 -> position 9 : collision

-> 1 + 15 -> position 5

37 -> position 4 : collision

-> 4 + 3 -> position 7 : collision

-> 4 + 8 -> position 1 : collision

-> 4 + 15 -> position 8

- Others operations
- Searching:
- The algorithm for searching for key k probes the same sequence of slots that the insertion algorithm examined when key k was inserted.
- The search can terminate (unsuccessfully) when it finds an empty slot…
- Why?
- If k was inserted, it would occupy a position … assuming that keys are not deleted from the hash table
- Deletion:
- When deleting a key from slot i, we should not physically remove that key.
- Doing so may make it impossible to retrieve a key k during whose insertion we probed slot i and found it occupied.
- A solution:
- Mark the slot by a special value (not deleting it).

Analysis of Linear Probing

Let , where m of n slots in the hash table are occupied

is called the load factor and is clearly < 1

Theorem 1:

Assumption: Independence of probes

Given an open-address hash table, with load factor < 1, the average number of probes in an insertion is

1/(1 - )

Find Operation

Theorem 2:

Assuming that each key in the table is equally likely to be searched for ( < 1)

The expected number of probes in a successful search is

The expected number of probes in an unsuccessful search is

Expected number of probes

- Analysis of Quadratic Probing
- Crucial questions:
- Will we be always able to insert element x if table is not full?

- Ease of computation?
- What happens when the load factor gets too high?
- (this applies to linear probing as well)
- The following theorem addresses the first issue
- Theorem 3:
- If quadratic probing is used and the table size is prime,
- then a new element can be inserted if the table is at least half empty.
- Also, no cell is probed twice in the course of insertion.

- Proof (by contradiction)
- We assume that there exist
- i<tableSize/2 and j<tableSize/2 such that i≠jand
- (hashValue+i2) mod tableSize=(hashValue+j2) mod tableSize

- Therefore, (i2 - j2) mod tableSize=0
- Leading to (i - j)(i + j) mod tableSize=0
- However, as tableSize is prime and (i+j)<tableSize, in order for the above equality to be true, either (i-j) or (i+j) need to be zero.
- Because i≠j and i and j are positive integer, neither (i-j) or (i+j) can be equal to zero, then
- (i - j)(i + j) mod tableSize ≠ 0
- Then theorem 3 is true

The expected number of probes in a successful search is

1/(1- )

The expected number of probes in an unsuccessful search is

-(1/ )ln(1- )

Comparison with the linear probing

US

Linear probing = 0.1 1.11 1.05

= 0.5 2.50 1.5

= 0.9 50.5 5.5

Quadratic probing = 0.1 1.11 1.05

= 0.5 2.00 1.38

= 0.9 10.00 2.55

Secondary clustering

Secondary clustering:

Elements that hash to the same position will also probe the same positions in the hash table.

Note:

Quadratic probing eliminates primary clustering but does not eliminate secondary clustering.

Nevertheless quadratic probing is efficient. Good distribution of the data then low probability of collision. Fast to compute.

- What do we do when the load factor gets too high?
- Rehash!
- Double the size of the hash table
- Rehash:
- Scan the entries in the current table, and insert them in a new hash table

- Double hashing
- Double hashing eliminates secondary clustering:
- It uses 2 hash functions
- hashValue = H1(key) + iH2(key) mod tableSize
- for i=0,1,2...
- The idea is that even if two items hash to the same value of H1, they will have different values of key, so that different probe sequences will be followed.
- H2(key) should never be zero or we will get stuck in the same location in the table.
- tableSize should be prime

- Given the restriction on the range of H2, the simplest choice for H2 is:
- 1 + (key mod tableSize -1)
- Then H2 can never be 0
- We have to calculate the hash value for key only once
- There is no restriction on the load factor.