- 113 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Intro to Hashing' - glenna

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Intro to Hashing

Hashing Part One

Reaching for the Perfect Search

Most of this material stolen from

"File Structures" by Folk, Zoellick and Riccardi

Searching for Data in Large Files

- Text File v. Binary File
- Unordered Binary File
- average search takes N/2 file operations

- Ordered Binary File
- average search takes Log2N file operations
- but keeping the data file sorted is costly

- Indexed File
- average search takes 3 or 4 file operations

- Perfect Search
- search time = 1file read

Hash Function

- Definition:
- a magic black box that converts a key to the file address of that record

Hash

Function

Dannelly

- Example Hashing Function:
- Key = Customer's Name
- Function = 1st letter x 2nd letter, then use rightmost 4 letters.
Name ascii product RRN

BALL 66x65 = 4290 290

LOWELL 76x79 = 6004 004

TREE 84x82 = 6888 888

OLIVIER 79x76 = 6004 004

Collision

- Definition:
- When two or more keys hash to the same address.

- Minimizing the Number of Collisions:
- pick a hash function that avoids collisions, i.e. one with a seemingly random distribution
- e.g. our previous function is terrible because letters like "E" and "L" occur frequently, while no one's name starts with "XZ".

- spread out the records
- 300 records in a file with space for 1000 records will have many fewer collisions than 300 records in a file with capacity of 400

Hash Function Selection

- Our objective is to muddle the relationship between the keys and the addresses.
- Good Ideas:
- use both addition and multiplication
- avoid integer overflow
- so mix in some subtraction and division too

- divide by prime numbers

Improved Hash Function

Why 19,937 ?

- pad the name with spaces
- fold and add pairs of letters
- mod by a prime after each add
- divide sum by file size
- Example: Key="LOWELL" and file size = 1,000
L O W E L L

76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 32

7679 + 8769 = 16,448 % 19,937 = 16,448

16448 + 7676 = 24,124 % 19,937 = 4,187

4187 + 3232 = 7,419 % 19,937 = 7,419

7419 + 3232 = 10,651 % 19,937 = 10,651

10651 + 3232 = 13,883 % 19,937 = 13,833

13833 % 1000 = 833

19,937 is the largest

prime that insures the

next add will not

cause integer overflow.

Class Exercise

- The simplest hash function for a string is "add up all the characters, then divide by filesize"
- For example,
- filesize = 100 records
- key = "pen"
- address = ( 16 + 5 + 14 ) % 100 = 35

- Find another word with the same mapping
- Give an improvement to this hash function

Optimal Hash Function

- The optimal hash function for a set of keys:
- will evenly distribute the keys across the address space, and
- every address has a equal chance of being used.

- Uniform distribution is nearly impossible.
Good Mapping Poor Mapping

key address key address

1 1

2 2

A 3 A 3

B 4 B 4

C 5 C 5

D 6 D 6

E 7 E 7

8 8

9 9

10 10

Selecting a File Size

- Suppose we have a file of 10,000 records, finding a hash function that will take our 10,000 keys and yield 10,000 different addresses is essentially impossible.
- So, our 10,000 records are stored in a larger file.
- How much larger than 10,000?
- 10,500?
- 12,000?
- 50,000?

- It Depends
- larger datafile:
- more empty (wasted) space
- fewer collisions

- larger datafile:

When Collisions Occur

- Even with a very good hash function, collisions will occur.
- We must have an algorithm to locate alternative addresses.
- Example,
- Suppose "dog" and "cat" both hash to location 25.
- If we add "dog" first, then dog goes in location 25.
- If we later add "cat", where does it go?
- Same idea for searching. If cat is supposed to be at 25 but dog is there, where do we look next?

Simple Collision Resolution

- "Linear Probing" or "Progressive Overflow"
- When a key maps to address already in use, just try the next one. If that one is in use, try the next one. yaddayadda
- Easy to implement.
- Usually works well, especially with a non-dense file and a good hash function.
- Can lead to clumps of records.

Clumping

- Assume these keys map to these addresses:
- adams = 20
- bates = 22
- cole = 20
- dean = 21
- evans = 23

- Where will each record be placed if inserted in that order?
- Using linear probing, how many file accesses for each?

Next Class

- How many collisions is acceptable?
- Analysis: packing density v probing length

- Is there a collision resolution algorithm better than linear probing?
- buckets

Download Presentation

Connecting to Server..