140 likes | 222 Views
Learn about hashing, hash functions, collisions, and hash function selection for optimal search efficiency in large files. Includes examples and practical insights.
E N D
Intro to Hashing Hashing Part One Reaching for the Perfect Search Most of this material stolen from "File Structures" by Folk, Zoellick and Riccardi
Searching for Data in Large Files • Text File v. Binary File • Unordered Binary File • average search takes N/2 file operations • Ordered Binary File • average search takes Log2N file operations • but keeping the data file sorted is costly • Indexed File • average search takes 3 or 4 file operations • Perfect Search • search time = 1file read
Hash Function • Definition: • a magic black box that converts a key to the file address of that record Hash Function Dannelly
Example Hashing Function: • Key = Customer's Name • Function = 1st letter x 2nd letter, then use rightmost 4 letters. Name ascii product RRN BALL 66x65 = 4290 290 LOWELL 76x79 = 6004 004 TREE 84x82 = 6888 888 OLIVIER 79x76 = 6004 004
Collision • Definition: • When two or more keys hash to the same address. • Minimizing the Number of Collisions: • pick a hash function that avoids collisions, i.e. one with a seemingly random distribution • e.g. our previous function is terrible because letters like "E" and "L" occur frequently, while no one's name starts with "XZ". • spread out the records • 300 records in a file with space for 1000 records will have many fewer collisions than 300 records in a file with capacity of 400
Hash Function Selection • Our objective is to muddle the relationship between the keys and the addresses. • Good Ideas: • use both addition and multiplication • avoid integer overflow • so mix in some subtraction and division too • divide by prime numbers
Improved Hash Function Why 19,937 ? • pad the name with spaces • fold and add pairs of letters • mod by a prime after each add • divide sum by file size • Example: Key="LOWELL" and file size = 1,000 L O W E L L 76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 32 7679 + 8769 = 16,448 % 19,937 = 16,448 16448 + 7676 = 24,124 % 19,937 = 4,187 4187 + 3232 = 7,419 % 19,937 = 7,419 7419 + 3232 = 10,651 % 19,937 = 10,651 10651 + 3232 = 13,883 % 19,937 = 13,833 13833 % 1000 = 833 19,937 is the largest prime that insures the next add will not cause integer overflow.
Class Exercise • The simplest hash function for a string is "add up all the characters, then divide by filesize" • For example, • filesize = 100 records • key = "pen" • address = ( 16 + 5 + 14 ) % 100 = 35 • Find another word with the same mapping • Give an improvement to this hash function
Optimal Hash Function • The optimal hash function for a set of keys: • will evenly distribute the keys across the address space, and • every address has a equal chance of being used. • Uniform distribution is nearly impossible. Good Mapping Poor Mapping key address key address 1 1 2 2 A 3 A 3 B 4 B 4 C 5 C 5 D 6 D 6 E 7 E 7 8 8 9 9 10 10
Selecting a File Size • Suppose we have a file of 10,000 records, finding a hash function that will take our 10,000 keys and yield 10,000 different addresses is essentially impossible. • So, our 10,000 records are stored in a larger file. • How much larger than 10,000? • 10,500? • 12,000? • 50,000? • It Depends • larger datafile: • more empty (wasted) space • fewer collisions
When Collisions Occur • Even with a very good hash function, collisions will occur. • We must have an algorithm to locate alternative addresses. • Example, • Suppose "dog" and "cat" both hash to location 25. • If we add "dog" first, then dog goes in location 25. • If we later add "cat", where does it go? • Same idea for searching. If cat is supposed to be at 25 but dog is there, where do we look next?
Simple Collision Resolution • "Linear Probing" or "Progressive Overflow" • When a key maps to address already in use, just try the next one. If that one is in use, try the next one. yaddayadda • Easy to implement. • Usually works well, especially with a non-dense file and a good hash function. • Can lead to clumps of records.
Clumping • Assume these keys map to these addresses: • adams = 20 • bates = 22 • cole = 20 • dean = 21 • evans = 23 • Where will each record be placed if inserted in that order? • Using linear probing, how many file accesses for each?
Next Class • How many collisions is acceptable? • Analysis: packing density v probing length • Is there a collision resolution algorithm better than linear probing? • buckets