Introduction to Algorithms

Introduction to Algorithms Jiafen Liu Sept. 2013

Today’s Tasks Hashing • Direct access tables • Choosing good hash functions • Division Method • Multiplication Method • Resolving collisions by chaining • Resolving collisions by open addressing

Symbol-Table Problem • Hashing comes up in compilers called the Symbol Table Problem. • Suppose: Table S holding n records: • Operations on S: • INSERT(S, x) • DELETE(S, x) • SEARCH(S, k) • Dynamic Set vs Static Set

The Simplest Case • Suppose that the keys are drawn from the set U⊆{0, 1, …, m–1}, and keys are distinct. • Direct access Table: set up an array T[0 . .m–1] if x∈S and key[x] = k, otherwise. • In the worst case, the 3 operations take time of • Θ(1) • Limitations of direct-access table? • The range of keys can be large: 64-bit numbers • character strings (difficult to represent it). • Hashing: Try to keep the table small, while preserving the property of linear running time.

Naïve Hashing • Solution: Use a hash function h to map the keys of records in S into {0, 1, …, m–1}. T 0 h(k4) k1 k3 h(k1) k5 Keys h(k2) k4 =h(k5) k2 m-1 h(k3)

Collisions • When a record to be inserted maps to an already occupied slot in T, a collision occurs. • The Simplest way to solve collision? • Link records in the same slot into a list. 49 86 52 h(49)=h(86)=h(52)=i

Worst Case of Chaining • What’s the worst case of chaining? • Each key hashes to the same slot. The table turn out to be a chaining list. • Access Time in the worst case? • Θ(n) if we assume the size of S is n.

Average Case of Chaining • In order to analyze the average case • we should know all possible inputs and their probability. • We don’t know exactly the distribution, so we always make assumptions. • Here, we make the assumption of simple uniform hashing: • Each key k in S is equally likely be hashed to any slot in T, independent of other keys. • Simple uniform hashing includes an independence assumption.

Average Case of Chaining • Let n be the number of keys in the table, and let m be the number of slots. • Under simple uniform hashing assumption what’s the possibility of two keys are hashed to the same slot? • 1/m. • Define: load factor of T to be α= n/m, that means? • The average number of keys per slot.

Search Cost • The expected time for an unsuccessful search for a record with a given key is? Θ(1 + α) • If α= O(1), expected search time = Θ(1) • How about a successful search? • It has same asymptotic bound. • Reserved for your homework. apply hash function and access slot search the list

Choosing a hash function • The assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice. • A good hash function should distribute the keys uniformly into all the slots. • Regularity of the key distribution should not affect this uniformity. • For example, all the keys are even numbers. • The simplest way to distribute keys to m slots evenly?

Division Method • Assume all keys are integers, and define h(k) = k mod m. • Advantage: Simple and practical usually. • Caution: • Be careful about choice of modulus m. • It doesn't work well for every size m of table. • Example: if we pick m with a small divisor d.

Deficiency of Division Method • Deficiency: if we pick m with a small divisor d. • Example: d=2, so that m is an even number. • It happens to all keys are even. • What happens to the hash table? • We will never hash anything to an odd-numbered slot.

Deficiency of Division Method • Extreme deficiency: If m= 2r, that’s to say, all its factors are small divisors. • If k= (1011000111011010)2 and m=26, What the hash value turns out to be? • The hash value doesn’t evenly depend on all the bits of k. • Suppose: all the low order bits are the same, and all the high order bits differ.

How to choose modulus? • Heuristics for choosing modulus m: • Choose m to be a prime • Make m not close to a power of two or ten. • Division method is not a really good one: • Sometimes, making the table size a prime is inconvenient. We often want to create a table in size 2r. • The other reason is division takes more time to compute compared with multiplication or addition on computers.

Another method—Multiplication • Multiplication method is a little more complicated but superior. • Assume that all keys are integers, m= 2r, and our computer has w-bit words. • Define h(k) = (A·k mod 2w) rsh (w–r): • A is an odd integer in the range 2w–1< A< 2w. • (Both the highest bit and the lowest bit are 1) • rsh is the “bitwise right-shift” operator . • Multiplication modulo 2w is fast compared to division, and the rsh operator is fast. • Tips: Don’t pick A too close to 2w–1 or 2w.

Example of multiplication method • Suppose that m= 8 = 23, r=3, and that our computer has w= 7-bit words: • We chose A =1 0 1 1 0 0 1 • k =1 1 0 1 0 1 1 • 1 0 0 1 0 1 0 0 1 1 0 0 1 1 Ignored by rsh Ignored by mod h(k)

Another way to solve collision • We’ve talked about resolving collisions by chaining. With chaining, we need an extra link field in each record. • There's another way—open addressing, with idea: No storage for links. • We should systematically probe the table until an empty slot is found.

Open Addressing • The hash function depends on both the key and probe number: universe of keys probe number slot number • The probe sequence 〈h(k,0), h(k,1), …, h(k,m–1)〉should be a permutation of {0, 1, …, m–1}.

Implementation of Insertion • What about HASH-SEARCH(T,k)?

Implementation of Searching

More about Open Addressing • The hash table may fill up. • We must have the number of elements less than or equal to the table size. • Deletion is difficult, why? • When we remove a key out of the table, and somebody is going to find his element. • The probe sequence he uses happens to hit the key we’ve deleted. • He finds it's an empty slot, and says the key I am looking for probably isn't in the table. • We should keep deleted things marked.

Example of open addressing

Some heuristics about probe • We can record the largest times of probes needed to do an insertion globally. • A search never looks more than that number. • There are lots of ideas about forming a probe sequence effectively. • The simplest one is ? • linear probing.

The simplest probing strategy • Linear probing: given an hash function h(k), linear probing uses h(k,i) = (h(k,0) +i) mod m • Advantage: Simple • Disadvantage? • primary clustering

Primary Clustering • It suffers from primary clustering, where regions of the hash table get full. • Anything that hashes into that region has to look through all the stuff. • What’s more, where long runs of occupied slots build up, increasing the average search time.

Another probing strategy • Double hashing: given two ordinary hash functions h1(k), h2(k), double hashing uses h(k,i) = ( h1(k) +i⋅h2(k) ) mod m • If h2(k) is relatively prime to m, double hashing generally produces excellent results. • We always make m a power of 2 and design h2(k) to produce only odd numbers.

Analysis of open addressing • We make the assumption of uniform hashing: • Each key is equally likely to have any one of the m! permutations as its probe sequence, independent of other keys. • Theorem. Given an open-addressed hash table with load factor α= n/m< 1, the expected number of probes in an unsuccessful search is at most 1/(1–α) .

Proof of the theorem Proof: • At least one probe is always necessary. • With probability , the first probe hits an occupied slot, and a second probe is necessary. • With probability ,the second probe hits an occupied slot, and a third probe is necessary. • With probability ,the third probe hits an occupied slot, etc. • And then how to prove? • Observe that for i= 1, 2, …, n. n/m (n–1)/(m–1) (n–2)/(m–2)

Proof of the theorem • Therefore, the expected number of probes is (geometric series)

Implications of the theorem • If α is constant, then accessing an open-addressed hash table takes constant time. • If the table is half full, then the expected number of probes is ? • 1/(1–0.5) = 2. • If the table is 90%full, then the expected number of probes is ? • 1/(1–0.9) = 10. • Full utilization in spaces causes hashing slow.

Still Hashing • Universal hashing • Perfect hashing

A weakness of hashing • Problem: For any hash function h, there exists a bad set of keys that all hash to the same slot. • It causes the average access time of a hash table to skyrocket. • An adversary can pick all keys from {k: h(k) = i } for some slot i. • IDEA: Choose the hash function at random, independently of the keys.

Universal hashing

Universality is good • Theorem: • Let h be a hash function chosen at random from a universal set H of hash functions. • Suppose h is used to hash n arbitrary keys into the m slots of a table T. • Then for a given key x, we have: E[number of collisions with x] < n/m.

Universality theorem • Proof. Let Cx be the random variable denoting the total number of collisions of keys in T with x, and let

Universality theorem For E[cxy]=1/m

Construction universal hash function set • One method to construct a set of universal hash functions: • Let m be prime. Decompose key k into r+1 digits, each with value in the set {0, 1, …, m–1}. • That is, let k = <k0, k1, …, kr>, where 0≤ki<m. • Randomized strategy: • Pick a = 〈a0, a1, …, ar〉 where each ai is chosen randomly from {0, 1, …, m–1}. • Define

One method of Construction • How big is H = {ha}? • |H| = mr + 1. • Theorem. The set H = {ha} is universal. • Proof. • Suppose that x = 〈x0, x1, …, xr〉 and y = 〈y0, y1, …, yr〉 be distinct keys. • Thus, they differ in at least one digit position. • Without loss of generality, position 0. • For how many ha∈H do x and y collide?

One method of Construction • ha(x) = ha(y), which implies that • Equivalently, we have

Fact from number theory

Back to the proof • We just have and since x0≠ y0, an inverse (x0– y0)–1 must exist, which implies that • Thus, for any choices of a1, a2, …, ar, exactly one choice of a0 causes x and y to collide.

Proof • How many ha will cause x and y to collide? • There are m choices for each of a1, a2, …, ar, but once these are chosen, exactly one choice for a0causes x and y to collide, • Thus, the number of h that cause x and y to collide is mr ·1 = mr = |H|/m.

Perfect hashing • Requirement: Given a set of n keys, construct a static hash table of size m = O(n) such that SEARCH takes Θ(1) time in the worst case. • IDEA: Two- level scheme with universal hashing at both levels. No collisions at level 2 !

Example of Perfect hashing

Collisions at level 2 • Theorem. Let H be a class of universal hash functions for a table of size m = n2. If we use a random h∈H to hash n keys into the table, the expected number of collisions is at most 1/2. • Proof. By the definition of universality, the probability that two given keys collide under h is 1/m = 1/n2. There are pairs of keys that can possibly collide, the expected number of collisions is

Another fact from number theory • Markov’s inequality says that for any non negative random variable X, we have Pr{X ≥ t} ≤ E[X]/t. • Theorem. The probability of no collisions is at least 1/2. • Proof. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2. • Conclusion: Just by testing random hash functions in H, we’ll quickly find one that works.

Introduction to Algorithms