intro to hashing
Download
Skip this Video
Download Presentation
Intro to Hashing

Loading in 2 Seconds...

play fullscreen
1 / 14

Intro to Hashing - PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on

Intro to Hashing. Hashing Part One Reaching for the Perfect Search. Most of this material stolen from "File Structures" by Folk, Zoellick and Riccardi. Searching for Data in Large Files. Text File v. Binary File Unordered Binary File average search takes N/2 file operations

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Intro to Hashing' - glenna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
intro to hashing

Intro to Hashing

Hashing Part One

Reaching for the Perfect Search

Most of this material stolen from

"File Structures" by Folk, Zoellick and Riccardi

searching for data in large files
Searching for Data in Large Files
  • Text File v. Binary File
  • Unordered Binary File
    • average search takes N/2 file operations
  • Ordered Binary File
    • average search takes Log2N file operations
    • but keeping the data file sorted is costly
  • Indexed File
    • average search takes 3 or 4 file operations
  • Perfect Search
    • search time = 1file read
hash function
Hash Function
  • Definition:
    • a magic black box that converts a key to the file address of that record

Hash

Function

Dannelly

slide4

Example Hashing Function:

    • Key = Customer\'s Name
    • Function = 1st letter x 2nd letter, then use rightmost 4 letters.

Name ascii product RRN

BALL 66x65 = 4290 290

LOWELL 76x79 = 6004 004

TREE 84x82 = 6888 888

OLIVIER 79x76 = 6004 004

collision
Collision
  • Definition:
    • When two or more keys hash to the same address.
  • Minimizing the Number of Collisions:
  • pick a hash function that avoids collisions, i.e. one with a seemingly random distribution
    • e.g. our previous function is terrible because letters like "E" and "L" occur frequently, while no one\'s name starts with "XZ".
  • spread out the records
    • 300 records in a file with space for 1000 records will have many fewer collisions than 300 records in a file with capacity of 400
hash function selection
Hash Function Selection
  • Our objective is to muddle the relationship between the keys and the addresses.
  • Good Ideas:
    • use both addition and multiplication
    • avoid integer overflow
      • so mix in some subtraction and division too
    • divide by prime numbers
improved hash function
Improved Hash Function

Why 19,937 ?

  • pad the name with spaces
  • fold and add pairs of letters
  • mod by a prime after each add
  • divide sum by file size
  • Example: Key="LOWELL" and file size = 1,000

L O W E L L

76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 32

7679 + 8769 = 16,448 % 19,937 = 16,448

16448 + 7676 = 24,124 % 19,937 = 4,187

4187 + 3232 = 7,419 % 19,937 = 7,419

7419 + 3232 = 10,651 % 19,937 = 10,651

10651 + 3232 = 13,883 % 19,937 = 13,833

13833 % 1000 = 833

19,937 is the largest

prime that insures the

next add will not

cause integer overflow.

class exercise
Class Exercise
  • The simplest hash function for a string is "add up all the characters, then divide by filesize"
  • For example,
    • filesize = 100 records
    • key = "pen"
    • address = ( 16 + 5 + 14 ) % 100 = 35
  • Find another word with the same mapping
  • Give an improvement to this hash function
optimal hash function
Optimal Hash Function
  • The optimal hash function for a set of keys:
    • will evenly distribute the keys across the address space, and
    • every address has a equal chance of being used.
  • Uniform distribution is nearly impossible.

Good Mapping Poor Mapping

key address key address

1 1

2 2

A 3 A 3

B 4 B 4

C 5 C 5

D 6 D 6

E 7 E 7

8 8

9 9

10 10

selecting a file size
Selecting a File Size
  • Suppose we have a file of 10,000 records, finding a hash function that will take our 10,000 keys and yield 10,000 different addresses is essentially impossible.
  • So, our 10,000 records are stored in a larger file.
  • How much larger than 10,000?
    • 10,500?
    • 12,000?
    • 50,000?
  • It Depends
    • larger datafile:
      • more empty (wasted) space
      • fewer collisions
when collisions occur
When Collisions Occur
  • Even with a very good hash function, collisions will occur.
  • We must have an algorithm to locate alternative addresses.
  • Example,
    • Suppose "dog" and "cat" both hash to location 25.
    • If we add "dog" first, then dog goes in location 25.
    • If we later add "cat", where does it go?
    • Same idea for searching. If cat is supposed to be at 25 but dog is there, where do we look next?
simple collision resolution
Simple Collision Resolution
  • "Linear Probing" or "Progressive Overflow"
  • When a key maps to address already in use, just try the next one. If that one is in use, try the next one. yaddayadda
  • Easy to implement.
  • Usually works well, especially with a non-dense file and a good hash function.
  • Can lead to clumps of records.
clumping
Clumping
  • Assume these keys map to these addresses:
    • adams = 20
    • bates = 22
    • cole = 20
    • dean = 21
    • evans = 23
  • Where will each record be placed if inserted in that order?
  • Using linear probing, how many file accesses for each?
next class
Next Class
  • How many collisions is acceptable?
    • Analysis: packing density v probing length
  • Is there a collision resolution algorithm better than linear probing?
    • buckets
ad