- 71 Views
- Uploaded on
- Presentation posted in: General

Introduction to Algorithms

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Introduction to Algorithms

Jiafen Liu

Sept. 2013

Hashing

- Direct access tables
- Choosing good hash functions
- Division Method
- Multiplication Method

- Resolving collisions by chaining
- Resolving collisions by open addressing

- Hashing comes up in compilers called the Symbol Table Problem.
- Suppose: Table S holding n records:
- Operations on S:
- INSERT(S, x)
- DELETE(S, x)
- SEARCH(S, k)
- Dynamic Set vs Static Set

- Suppose that the keys are drawn from the set U⊆{0, 1, …, m–1}, and keys are distinct.
- Direct access Table: set up an array T[0 . .m–1]
if x∈S and key[x] = k,

otherwise.

- In the worst case, the 3 operations take time of
- Θ(1)

- Limitations of direct-access table?
- The range of keys can be large: 64-bit numbers
- character strings (difficult to represent it).

- Hashing: Try to keep the table small, while preserving the property of linear running time.

- Solution: Use a hash function h to map the keys of records in S into {0, 1, …, m–1}.

T

0

h(k4)

k1

k3

h(k1)

k5

Keys

h(k2)

k4

=h(k5)

k2

m-1

h(k3)

- When a record to be inserted maps to an already occupied slot in T, a collision occurs.
- The Simplest way to solve collision?
- Link records in the same slot into a list.

49

86

52

h(49)=h(86)=h(52)=i

- What’s the worst case of chaining?
- Each key hashes to the same slot. The table turn out to be a chaining list.
- Access Time in the worst case?
- Θ(n) if we assume the size of S is n.

- In order to analyze the average case
- we should know all possible inputs and their probability.
- We don’t know exactly the distribution, so we always make assumptions.

- Here, we make the assumption of simple uniform hashing:
- Each key k in S is equally likely be hashed to any slot in T, independent of other keys.
- Simple uniform hashing includes an independence assumption.

- Let n be the number of keys in the table, and let m be the number of slots.
- Under simple uniform hashing assumption what’s the possibility of two keys are hashed to the same slot?
- 1/m.

- The average number of keys per slot.

- The expected time for an unsuccessful search for a record with a given key is?
Θ(1 + α)

- If α= O(1), expected search time = Θ(1)
- How about a successful search?
- It has same asymptotic bound.
- Reserved for your homework.

apply hash function and access slot

search the list

- The assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice.
- A good hash function should distribute the keys uniformly into all the slots.
- Regularity of the key distribution should not affect this uniformity.
- For example, all the keys are even numbers.

- The simplest way to distribute keys to m slots evenly?

- Assume all keys are integers, and define
h(k) = k mod m.

- Advantage: Simple and practical usually.
- Caution:
- Be careful about choice of modulus m.
- It doesn't work well for every size m of table.

- Example: if we pick m with a small divisor d.

- Deficiency: if we pick m with a small divisor d.
- Example: d=2, so that m is an even number.
- It happens to all keys are even.
- What happens to the hash table?
- We will never hash anything to an odd-numbered slot.

- Extreme deficiency: If m= 2r, that’s to say, all its factors are small divisors.
- If k= (1011000111011010)2 and m=26, What the hash value turns out to be?
- The hash value doesn’t evenly depend on all the bits of k.
- Suppose: all the low order bits are the same, and all the high order bits differ.

- Heuristics for choosing modulus m:
- Choose m to be a prime
- Make m not close to a power of two or ten.

- Division method is not a really good one:
- Sometimes, making the table size a prime is inconvenient. We often want to create a table in size 2r.
- The other reason is division takes more time to compute compared with multiplication or addition on computers.

- Multiplication method is a little more complicated but superior.
- Assume that all keys are integers, m= 2r, and our computer has w-bit words.
- Define h(k) = (A·k mod 2w) rsh (w–r):
- A is an odd integer in the range 2w–1< A< 2w.
- (Both the highest bit and the lowest bit are 1)
- rsh is the “bitwise right-shift” operator .

- Multiplication modulo 2w is fast compared to division, and the rsh operator is fast.
- Tips: Don’t pick A too close to 2w–1 or 2w.

- Suppose that m= 8 = 23, r=3, and that our computer has w= 7-bit words:
- We chose A=1 0 1 1 0 0 1
- k=1 1 0 1 0 1 1
- 1 0 0 1 0 1 0 0 1 1 0 0 1 1

Ignored by rsh

Ignored by mod

h(k)

- We’ve talked about resolving collisions by chaining. With chaining, we need an extra link field in each record.

- The hash function depends on both the key and probe number:
universe of keys probe number slot number

- The probe sequence 〈h(k,0), h(k,1), …, h(k,m–1)〉should be a permutation of {0, 1, …, m–1}.

- What about HASH-SEARCH(T,k)?

- The hash table may fill up.
- We must have the number of elements less than or equal to the table size.

- Deletion is difficult, why?
- When we remove a key out of the table, and somebody is going to find his element.
- The probe sequence he uses happens to hit the key we’ve deleted.
- He finds it's an empty slot, and says the key I am looking for probably isn't in the table.
- We should keep deleted things marked.

- We can record the largest times of probes needed to do an insertion globally.
- A search never looks more than that number.

- There are lots of ideas about forming a probe sequence effectively.
- The simplest one is ?
- linear probing.

- Linear probing: given an hash function h(k), linear probing uses
h(k,i) = (h(k,0) +i) mod m

- Advantage: Simple
- Disadvantage?
- primary clustering

- It suffers from primary clustering, where regions of the hash table get full.
- Anything that hashes into that region has to look through all the stuff.
- What’s more, where long runs of occupied slots build up, increasing the average search time.

- Double hashing: given two ordinary hash functions h1(k), h2(k), double hashing uses
h(k,i) = ( h1(k) +i⋅h2(k) ) mod m

- If h2(k) is relatively prime to m, double hashing generally produces excellent results.
- We always make m a power of 2 and design h2(k) to produce only odd numbers.

- We make the assumption of uniform hashing:
- Each key is equally likely to have any one of the m! permutations as its probe sequence, independent of other keys.

- Theorem. Given an open-addressed hash table with load factor α= n/m< 1, the expected number of probes in an unsuccessful search is at most 1/(1–α) .

Proof:

- At least one probe is always necessary.
- With probability , the first probe hits an occupied slot, and a second probe is necessary.
- With probability ,the second probe hits an occupied slot, and a third probe is necessary.
- With probability ,the third probe hits an occupied slot, etc.
- And then how to prove?
- Observe that for i= 1, 2, …, n.

n/m

(n–1)/(m–1)

(n–2)/(m–2)

- Therefore, the expected number of probes is

(geometric series)

- If α is constant, then accessing an open-addressed hash table takes constant time.
- If the table is half full, then the expected number of probes is ?
- 1/(1–0.5) = 2.

- If the table is 90%full, then the expected number of probes is ?
- 1/(1–0.9) = 10.
- Full utilization in spaces causes hashing slow.

- Universal hashing
- Perfect hashing

- Problem: For any hash function h, there exists a bad set of keys that all hash to the same slot.
- It causes the average access time of a hash table to skyrocket.
- An adversary can pick all keys from {k: h(k) = i } for some slot i.

- IDEA: Choose the hash function at random, independently of the keys.

- Theorem:
- Let h be a hash function chosen at random from a universal set H of hash functions.
- Suppose h is used to hash n arbitrary keys into the m slots of a table T.
- Then for a given key x, we have:
E[number of collisions with x] < n/m.

- Proof. Let Cx be the random variable denoting the total number of collisions of keys in T with x, and let

For E[cxy]=1/m

- One method to construct a set of universal hash functions:
- Let m be prime. Decompose key k into r+1 digits, each with value in the set {0, 1, …, m–1}.
- That is, let k = <k0, k1, …, kr>, where 0≤ki<m.
- Randomized strategy:
- Pick a = 〈a0, a1, …, ar〉 where each ai is chosen randomly from {0, 1, …, m–1}.

- Define

- How big is H = {ha}?
- |H| = mr + 1.

- Theorem. The set H = {ha} is universal.
- Proof.
- Suppose that x = 〈x0, x1, …, xr〉 and y = 〈y0, y1, …, yr〉 be distinct keys.
- Thus, they differ in at least one digit position.
- Without loss of generality, position 0.
- For how many ha∈H do x and y collide?

- ha(x) = ha(y), which implies that
- Equivalently, we have

- We just have
and since x0≠ y0, an inverse (x0– y0)–1 must exist, which implies that

- Thus, for any choices of a1, a2, …, ar, exactly one choice of a0 causes x and y to collide.

- How many ha will cause x and y to collide?
- There are m choices for each of a1, a2, …, ar, but once these are chosen, exactly one choice for a0causes x and y to collide,

- Thus, the number of h that cause x and y to collide is mr ·1 = mr = |H|/m.

- Requirement: Given a set of n keys, construct a static hash table of size m = O(n) such that SEARCH takes Θ(1) time in the worst case.
- IDEA: Two- level scheme with universal hashing at both levels. No collisions at level 2 !

- Theorem. Let H be a class of universal hash functions for a table of size m = n2. If we use a random h∈H to hash n keys into the table, the expected number of collisions is at most 1/2.
- Proof. By the definition of universality, the probability that two given keys collide under h is 1/m = 1/n2. There are pairs of keys that can possibly collide, the expected number of collisions is

- Markov’s inequality says that for any non negative random variable X, we have Pr{X ≥ t} ≤ E[X]/t.
- Theorem. The probability of no collisions is at least 1/2.
- Proof. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2.
- Conclusion: Just by testing random hash functions in H, we’ll quickly find one that works.

Have FUN !