Intro to Hashing

1 / 14

# Intro to Hashing - PowerPoint PPT Presentation

Intro to Hashing. Hashing Part One Reaching for the Perfect Search. Most of this material stolen from &quot;File Structures&quot; by Folk, Zoellick and Riccardi. Searching for Data in Large Files. Text File v. Binary File Unordered Binary File average search takes N/2 file operations

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Intro to Hashing' - glenna

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Intro to Hashing

Hashing Part One

Reaching for the Perfect Search

Most of this material stolen from

"File Structures" by Folk, Zoellick and Riccardi

Searching for Data in Large Files
• Text File v. Binary File
• Unordered Binary File
• average search takes N/2 file operations
• Ordered Binary File
• average search takes Log2N file operations
• but keeping the data file sorted is costly
• Indexed File
• average search takes 3 or 4 file operations
• Perfect Search
• search time = 1file read
Hash Function
• Definition:
• a magic black box that converts a key to the file address of that record

Hash

Function

Dannelly

Example Hashing Function:

• Key = Customer\'s Name
• Function = 1st letter x 2nd letter, then use rightmost 4 letters.

Name ascii product RRN

BALL 66x65 = 4290 290

LOWELL 76x79 = 6004 004

TREE 84x82 = 6888 888

OLIVIER 79x76 = 6004 004

Collision
• Definition:
• When two or more keys hash to the same address.
• Minimizing the Number of Collisions:
• pick a hash function that avoids collisions, i.e. one with a seemingly random distribution
• e.g. our previous function is terrible because letters like "E" and "L" occur frequently, while no one\'s name starts with "XZ".
• 300 records in a file with space for 1000 records will have many fewer collisions than 300 records in a file with capacity of 400
Hash Function Selection
• Our objective is to muddle the relationship between the keys and the addresses.
• Good Ideas:
• use both addition and multiplication
• avoid integer overflow
• so mix in some subtraction and division too
• divide by prime numbers
Improved Hash Function

Why 19,937 ?

• pad the name with spaces
• fold and add pairs of letters
• mod by a prime after each add
• divide sum by file size
• Example: Key="LOWELL" and file size = 1,000

L O W E L L

76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 32

7679 + 8769 = 16,448 % 19,937 = 16,448

16448 + 7676 = 24,124 % 19,937 = 4,187

4187 + 3232 = 7,419 % 19,937 = 7,419

7419 + 3232 = 10,651 % 19,937 = 10,651

10651 + 3232 = 13,883 % 19,937 = 13,833

13833 % 1000 = 833

19,937 is the largest

prime that insures the

cause integer overflow.

Class Exercise
• The simplest hash function for a string is "add up all the characters, then divide by filesize"
• For example,
• filesize = 100 records
• key = "pen"
• address = ( 16 + 5 + 14 ) % 100 = 35
• Find another word with the same mapping
• Give an improvement to this hash function
Optimal Hash Function
• The optimal hash function for a set of keys:
• will evenly distribute the keys across the address space, and
• every address has a equal chance of being used.
• Uniform distribution is nearly impossible.

Good Mapping Poor Mapping

1 1

2 2

A 3 A 3

B 4 B 4

C 5 C 5

D 6 D 6

E 7 E 7

8 8

9 9

10 10

Selecting a File Size
• Suppose we have a file of 10,000 records, finding a hash function that will take our 10,000 keys and yield 10,000 different addresses is essentially impossible.
• So, our 10,000 records are stored in a larger file.
• How much larger than 10,000?
• 10,500?
• 12,000?
• 50,000?
• It Depends
• larger datafile:
• more empty (wasted) space
• fewer collisions
When Collisions Occur
• Even with a very good hash function, collisions will occur.
• We must have an algorithm to locate alternative addresses.
• Example,
• Suppose "dog" and "cat" both hash to location 25.
• If we add "dog" first, then dog goes in location 25.
• If we later add "cat", where does it go?
• Same idea for searching. If cat is supposed to be at 25 but dog is there, where do we look next?
Simple Collision Resolution
• "Linear Probing" or "Progressive Overflow"
• When a key maps to address already in use, just try the next one. If that one is in use, try the next one. yaddayadda
• Easy to implement.
• Usually works well, especially with a non-dense file and a good hash function.
• Can lead to clumps of records.
Clumping
• Assume these keys map to these addresses: