WEEK 2 Hashing Part II

WEEK 2 Hashing Part II CE222 – Data Structures & Algorithms II Chapter 5.4-5.6 (based on the book by M. A. Weiss, Data Structures and Algorithm Analysis in C++, 3rd edition, 2006)

OUTLINE  HANDLING COLLISIONS • Separate Chaining (PreviousWeek) • Open Addressing • Linear Probing • Quadratic Probing • Double Hashing CE 222-Data Structures & Algorithms II, Izmir University of Economics

Separate Chaining • Separate chaining has the disadvantage of using linked lists. • Requires the implementation of a second data structure. • In an open addressing hashing system, all the data go inside the table. • Thus, a bigger table is needed. • Generally the load factor () should be below 0.5. • If a collision occurs, alternative cells are tried until an empty cell is found. CE 222-Data Structures & Algorithms II, Izmir University of Economics

Open Addressing • More formally: • Cells h0(x), h1(x), h2(x), …are tried in succession where hi(x) = (hash(x) + f(i)) mod TableSize, with f(0) = 0. • The function f is the collision resolution strategy. • There are three common collision resolution strategies: • Linear Probing f is a linear function • Quadratic probing  f is a quadratic function • Double hashing f includes a second hash function CE 222-Data Structures & Algorithms II, Izmir University of Economics

Linear Probing • Collisions are resolved by sequentially scanning an arrayuntil an empty cell is found. • f is a linear function of i, typically f(i)= i • Example: • Insert items with keys: 89, 18, 49, 58, 9 into an empty hash table. • Table size is 10. (not prime!) • Hash function is hash(x) = x mod 10. • f(i) = i CE 222-Data Structures & Algorithms II, Izmir University of Economics

Linear Probing : Example (Insertion) Keysare 89, 18,49,58,9 in order. 1. Hash0(89)=Hash(89)=89%10 =9 2. Hash0(18)=Hash(18)=18%10 =8 3. Hash0(49)=49%10 =9 (cell 9 is active apply Hash1(49)) Hash1(49)=(Hash(49)+1)%10 =0 4. Hash0(58)=58%10 =8 Hash1(58)=(Hash(58)+1)%10 =9 Hash2(58)=(Hash(58)+2)%10 =0 Hash3(58)=(Hash(58)+3)%10 =1 5. Hash0(9)=9%10 =9 Hash1(9)=(Hash(9)+1)%10 =0 Hash2(9)=(Hash(9)+2)%10 =1 Hash3(9)=(Hash(9)+3)%10 =2 CE 222-Data Structures & Algorithms II, Izmir University of Economics

Linear Probing: Search and Delete • The search(find) and delete algorithms follow the same probe sequence as the insert algorithm. • A find for 89 would involve 1 probe. • A find for 58 would involve 4 probes. • We must use lazy deletion(i.e. marking items as deleted) • Standard deletion (i.e. physically removing the item) cannot be performed. • e.g. remove 89 from hash table. CE 222-Data Structures & Algorithms II, Izmir University of Economics

LinearProbing Deletion Problem -- SOLUTION • Standard deletioncannot be performed in a probinghashtable, becausethecellmighthavecaused a collisontogopast it “LAZY DELETION” • Each cell is in one of 3 possible states: • active • empty • deleted • For Find or Delete • only stop search when EMPTY state detected (not DELETED) or the key is hit in any state CE 222-Data Structures & Algorithms II, Izmir University of Economics

Linear Probing: Deletion-AwareAlgorithms • Insert • callFind, • ifcell = active alreadythere • if cell = deletedoremptyinsert, cell = active • Find • cell empty  NOT found • cell deleted  if key == key -> NOT FOUND • else H = (H + 1) mod TableSize • cell active  if key == key -> FOUND • else H = (H + 1) mod TableSize • Delete • callFind, • ifcell active DELETE; cell=deleted • ifcell deleted  NOT found • ifcell empty  NOT found CE 222-Data Structures & Algorithms II, Izmir University of Economics

Linear Probing : Performance Analysis 1/3 • As long as table is big enough, a free cell can always be found, but the time to do so can get quite large !!! • Worse, even if the table is relatively empty, blocks of occupied cells start forming Elements tend to cluster in dense intervals in the array. • This effect is known as primary clustering • Any key that hashes into the cluster will require several attempts to resolve the collision, and then it will add to the cluster.  CE 222-Data Structures & Algorithms II, Izmir University of Economics

LinearProbing : PerformanceAnalysis 2/3 • Assumptions : Table is relatively empty, block of occupied cells start forming (primary clustering ) Expected number of probes : • For insertions and unsuccessful searches : (1 + 1/(1 – λ)2) / 2 • For successful searches: (1 + 1/(1 – λ)) / 2 CE 222-Data Structures & Algorithms II, Izmir University of Economics

LinearProbing : PerformanceAnalysis 3/3 • Assumptions: If clustering is not a problem, large table, probes are independent of each other • Expected # of probes for unsuccessful searches • (=expected # of probes until an empty cell is found) • Expected # of probes for successful searches • (=expected # of probes when an element was inserted) • (=expected # of probes for an unsuccessful search) • Average cost of an insertion (fraction of empty cells = 1 -λ) (earlier insertion are cheaper) CE 222-Data Structures & Algorithms II, Izmir University of Economics

Linear Probing : Analysis Example • What is the average number of probes for a successful search and an unsuccessful search for this hash table? • Hash Function: h(x) = x mod 11 Successful Search: • 20: 9 -- 30: 8 -- 2 : 2 -- 13: 2, 3 -- 25: 3,4 • 24: 2,3,4,5 -- 10: 10 -- 9: 9,10, 0 Avg. Probe for SS = (1+1+1+2+2+4+1+3)/8=15/8 Unsuccessful Search: • We assume that the hash function uniformly distributes the keys. • 0: 0,1 -- 1: 1 -- 2: 2,3,4,5,6 -- 3: 3,4,5,6 • 4: 4,5,6 -- 5: 5,6 -- 6: 6 -- 7: 7 -- 8: 8,9,10,0,1 • 9: 9,10,0,1 -- 10: 10,0,1 Avg. Probe for US =(2+1+5+4+3+2+1+1+5+4+3)/11=31/11 CE 222-Data Structures & Algorithms II, Izmir University of Economics

QuadraticProbing • Quadratic Probing eliminates primary clustering problem of linear probing. • Collision function is quadratic. • The popular choice is f(i) = i2 hi(x) = (hash(x) + i2) mod TableSize • Example: • Insert items with keys: 89, 18, 49, 58, 9 into an empty hash table. • Table size is 10. • Hash function is hash(x) = x mod 10. • f(i) = i2 CE 222-Data Structures & Algorithms II, Izmir University of Economics

Quadratic Probing : Example (Insertion) Keysare 89, 18,49,58,9 in order. 1. Hash0(89)=Hash(89)=89%10 =9 2. Hash0(18)=Hash(18)=18%10 =8 3. Hash0(49)=49%10 =9 (cell 9 is active apply Hash1(49)) Hash1(49)=(Hash(49)+1)%10 =0 4. Hash0(58)=58%10 =8 Hash1(58)=(Hash(58)+1)%10 =9 Hash2(58)=(Hash(58)+2*2)%10 =2 5. Hash0(9)=9%10 =9 Hash1(9)=(Hash(9)+1)%10 =0 Hash2(9)=(Hash(9)+2*2)%10 =1 CE 222-Data Structures & Algorithms II, Izmir University of Economics

QuadraticProbing • Problem: • We may not be sure that we will probe all locations in the table (i.e. there is no guarantee to find an empty cell if table is more than half full.) • If the hash table size is not prime this problem will be much severe. • However, there is a theorem stating that: If quadratic probing is used, and the table sizeis prime, then a new element can always be inserted if thetable is at least half empty. CE 222-Data Structures & Algorithms II, Izmir University of Economics

Quadratic Probing : Proof of Theorem • Let M be the size of the table and it is prime. We show that the first M/2 alternative locations are distinct. • Let two of these locations be h + i2 and h + j2, where i, j are two probes s.t. 0 i,j M/2. Suppose for the sake of contradiction, that these two locations are the same but i  j. Then h + i2 = h + j2 (mod M) i2 = j2 (mod M) i2 - j2 = 0 (mod M) (i-j)(i+j) = 0 (mod M) • Because M is prime, either (i-j) or (i+j) is divisible by M. Neither of these possibilities can occur. Thus we obtain a contradiction. • It follows that the first M/2 alternative are all distinct and since there are at most M/2 items in the hash table it is guaranteed that an insertion must succeed if the table is at least half full. CE 222-Data Structures & Algorithms II, Izmir University of Economics

Quadratic Probing : Analysis • Tends to distribute keys better than linear probing,alleviates the problem of clustering (primary clustering) • There remains the problem of secondary clusteringin which elements that hash to the same positionwill probe the same alternative cells • Runs the risk of an infinite loop on insertion, unless precautions are taken. • e.g., consider inserting the key 16 into a table ofsize 16, with positions 0, 1, 4 and 9 already occupied. • Therefore, table size should be prime CE 222-Data Structures & Algorithms II, Izmir University of Economics

Double Hashing • f(i)=i*hash2(x) is a popular choice • hash2(x)should never evaluate to zero • Now the increment is a function of the key • The slots visited by the hash function will vary even if the initial slot was the same • Avoids clustering • Theoretically interesting, but in practice slower than quadratic probing, because of the need to evaluate a second hash function. CE 222-Data Structures & Algorithms II, Izmir University of Economics

Double Hashing • Typical second hash function • hash2(x)=R − ( x % R ) • where R is a prime number, R < N CE 222-Data Structures & Algorithms II, Izmir University of Economics

Double Hashing : Example 1 Where do you store 99 ? R=11; TableSize=15 hash2(x)= 11 − (x % 11) hash(99)=t=9 active hash2(99)=11 – (99%11)= d=11 (hash(99)+1*hash2(99)) %15= (9+11)%15=5active (hash(99)+2*hash2(99)) %15= (9+22)%15=1active (hash(99)+3*hash2(99)) %15= (9+33)%15=12insert 99 Note: R=11, N=15Attempt to store key in array elements (t+d)%N, (t+2d)%N, (t+3d)%N … CE 222-Data Structures & Algorithms II, Izmir University of Economics

Double Hashing : Example 2 Where do you store 127 ? R=11; TableSize=15 hash2(x)= 11 − (x % 11) hash(127)=7=t hash2(127)=11-(127%11)= 5=d (7+5)%15, (7+2*5)%15, (7+3*5)%15, (7+4*5)%15, (7+5*5)%15 … 12 --------2 ------------7 --------------12 ------------ 2---- > >INFINITE CE 222-Data Structures & Algorithms II, Izmir University of Economics

Rehashing • Ifthetablegetstoofull, therunningtimesfortheoperationswill start takingtoolong. • When the load factor exceeds a threshold, double the table size (smallest prime > 2 * old table size). • Rehash each record in the old table into the new table. • Expensive: O(N) work done in copying. • However, if the threshold is large (e.g., ½), then we need to rehash only once per O(N) insertions, so the cost is “amortized” constant-time. CE 222-Data Structures & Algorithms II, Izmir University of Economics

Factors affecting efficiency in HashTables • Choice of hash function • Collision resolution strategy • Load Factor • Hashing offers excellent performance for insertion and retrieval of data. CE 222-Data Structures & Algorithms II, Izmir University of Economics

Summary: Hash Table • HashTable • Average Speed O(1) • Find Min/Max No • Sorted Input No problems • Use HashTable if there is any suspicion of SORTED input & NO ordering information is required. CE 222-Data Structures & Algorithms II, Izmir University of Economics

Homework Assignments • 5.1, 5.2, 5.12 • You are requested to study and solve the exercises. Note that these are for you to practice only. You are not to deliver the results to me. Izmir University of Economics

WEEK 2 Hashing Part II

WEEK 2 Hashing Part II

Presentation Transcript

WEEK 3 : Part II

Hashing - 2

Week 3 – Part 2

Week 2 Assignment Part 2

CS16 Week 2 Part 2

Hashing II:

Week 7 Part 2

Week 12 – XML Part II

Week 1 Part II

Week 5 Part 2

Hashing – Part I

WEEK 1 Hashing Part I

Hashing – Part III

Week 5 Part 2

Week Seven – Part II

Lecture week 2 – part 2

Week 1 Part II

CS16 Week 2 Part 2

Week 9 Part 2

Week 5 Part II