More Hashing

1 / 18

# More Hashing - PowerPoint PPT Presentation

More Hashing. Hashing Part Two Better Collision Resolution. Small parts of this material stolen from &quot;File Organization and Access&quot; by Austing and Cassel. Recap of Last Class. Hash function converts key to file address Collision is when two or more keys hash to the same address

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'More Hashing' - marlee

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### More Hashing

Hashing Part Two

Better Collision Resolution

Small parts of this material stolen from

"File Organization and Access" by Austing and Cassel

Recap of Last Class
• Hash function converts key to file address
• Collision is when two or more keys hash to the same address
• Collision Avoidance
• Good Hash Function spreads out the keys evenly along the whole address space
• Non-Dense File decreases chance of collisions and decreases probes after a collision
Recap: Linear Probing
• Very simple collision resolution
• if H(key) = A, and A is already used, try A+1, then A+2, etc
• easy to implement
• guaranteed to use all addresses
• clustering / clumping
Clumping
• Given the following hashes and linear probing:
• bates = 22
• cole = 20
• dean = 21
• evans = 23
• Result of either
• poor hash function
• dense file
Random Probing
• True random would not work. Instead use pseudo-random.

While A is in use

A = (A + R) mod T

R = prime

T = Table Size

But what if 25 and 30 already had keys directly hashed to those locations? Cole would be at 35 -- 4 probes away.

• bates = 22
• cole = 20
• dean = 21
• evans = 23
Chaining
• Assuming a better hash function and less dense file are not options...
• And assuming linear and random probing lead to coalesced lists...

19 : null

20 : 35 -> null

21 : null

• Advantage: Faster at resolving collisions
Re-Cap from weeks ago
• File Read Time = seek time + latency + data read time
• Smallest Readable Portion = 1 cluster = 4KB (usually)
• To access portion of a file, most of the time is in seek time and latency, not read time
• so, number of file reads is more important than size of reads, until size gets really big
• SO... reading a few records from a file takes no more time than reading just one record
Buckets
• Given, collisions will occur...
• Why not just read 2, or 3, or 4 records instead of just 1 on each read operation?
• "Bucket" - a group of records at the same address
• "Hash File of Buckets" - hashed keys collide to small arrays of records in the data file
Bucket Size?
• use avg collisions and stddev?
• if 1000 records and 200 addresses
• then avg is 5.0
• but stddev might be 1.0
• start by determining how many records can fit in one or more disk clusters
• then design a good hash function to match that address space
• Can achieve relatively fast access
• Remember, the hash function tells us where the record is located, so only 1 read operation. And even with collisions, the list of possible records is read into memory, which searches fast.
• Search Time = time to read bucket + time to search the array
• What do we do when the bucket is full?
• solutions are similar to collision resolution
• we end up reading multiple sets of records
Predicting Collision Rates
• Collisions will happen!
• Poisson Function:
• p(x) gives the probability that a given address will have had x records assigned to it.

(r/N)x e-(r/N)

p(x) = ---------------

x!

N = number of available addresses

r = number of records to be stored

x = number of records assigned to a given address

Analysis continued
• Given
• N = 1000
• r = 1000
• Probability that a given address will have exactly one, two, or three keys hashed to it:

p(1) = 0.368

p(2) = 0.184

p(3) = 0.061

Analysis Continued
• Given
• N = 10,000
• R = 10,000
• How many addresses should have one, two, or three keys hashed to them?

10,000 x p(1) = 10000x0.3679 = 3679

10,000 x p(2) = 10000x0.1839 = 1839

10,000 x p(3) = 10000x0.0613 = 613

• So, 1839 keys will collide once and 613 will collide at least twice.
• Many of those collisions will disrupt probing.
Impact of Packing Density

Records that never collide = 303

Records that cannot go at their home = 107

Records at their home, but cause collisions = 90

Total = 500

• Given
• r = 500
• N = 1000
• Addresses with exact one record?

N x p(1) = 1000 x 0.303 = 303

• How many overflow records?

1 x N x p(2) + 2 x N x p(3) + 3 x N x p(4) + ...

= N x [1 x p(2) + 2 x p(3) + 3 x p(4)]

= 1000 x [ 1 x 0.076 + 2 x 0.013 + 3 x 0.002]

= 107

• Percentage of Records NOT stored at home address

107 / 500 = 21.4%

Real Life
• We must balance many factors:
• file size
• e.g., wasted space in hashed files
• e.g., extra space for index files
• disk access times
• available memory
• frequency of additions and deletions compared to searches
• Best Solution of All?
• probably a combination of indexed files, hashing, and buckets
Next Classes…
• Thursday April 14
• No Class
• Tuesday April 19
• B-Trees
• Thursday April 21
• Review