More hashing
Download
1 / 18

More Hashing - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

More Hashing. Hashing Part Two Better Collision Resolution. Small parts of this material stolen from "File Organization and Access" by Austing and Cassel. Recap of Last Class. Hash function converts key to file address Collision is when two or more keys hash to the same address

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' More Hashing' - marlee


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
More hashing

More Hashing

Hashing Part Two

Better Collision Resolution

Small parts of this material stolen from

"File Organization and Access" by Austing and Cassel


Recap of last class
Recap of Last Class

  • Hash function converts key to file address

  • Collision is when two or more keys hash to the same address

  • Collision Avoidance

    • Good Hash Function spreads out the keys evenly along the whole address space

    • Non-Dense File decreases chance of collisions and decreases probes after a collision


Recap linear probing
Recap: Linear Probing

  • Very simple collision resolution

  • if H(key) = A, and A is already used, try A+1, then A+2, etc

  • Advantages

    • easy to implement

    • guaranteed to use all addresses

  • Disadvantages

    • clustering / clumping


Clumping
Clumping

  • Given the following hashes and linear probing:

    • adams = 20

    • bates = 22

    • cole = 20

    • dean = 21

    • evans = 23

  • Result of either

    • poor hash function

    • dense file


Random probing
Random Probing

  • Instead of adding 1, spread out by random amount

  • True random would not work. Instead use pseudo-random.

    While A is in use

    A = (A + R) mod T

    A = address

    R = prime

    T = Table Size


But what if 25 and 30 already had keys directly hashed to those locations? Cole would be at 35 -- 4 probes away.

  • adams = 20

  • bates = 22

  • cole = 20

  • dean = 21

  • evans = 23


Chaining
Chaining those locations? Cole would be at 35 -- 4 probes away.

  • Assuming a better hash function and less dense file are not options...

  • And assuming linear and random probing lead to coalesced lists...

  • Chaining : maintain a linked list of collisions, one head per address

    • Example, after addition of Adams and Cole, and R=5:

      19 : null

      20 : 35 -> null

      21 : null

  • Advantage: Faster at resolving collisions

  • Disadvantage : Space


Re cap from weeks ago
Re-Cap from weeks ago those locations? Cole would be at 35 -- 4 probes away.

  • File Read Time = seek time + latency + data read time

  • Smallest Readable Portion = 1 cluster = 4KB (usually)

  • To access portion of a file, most of the time is in seek time and latency, not read time

    • so, number of file reads is more important than size of reads, until size gets really big

  • SO... reading a few records from a file takes no more time than reading just one record


Buckets
Buckets those locations? Cole would be at 35 -- 4 probes away.

  • Given, collisions will occur...

  • Why not just read 2, or 3, or 4 records instead of just 1 on each read operation?

  • "Bucket" - a group of records at the same address

  • "Hash File of Buckets" - hashed keys collide to small arrays of records in the data file


Bucket size
Bucket Size? those locations? Cole would be at 35 -- 4 probes away.

  • use avg collisions and stddev?

  • if 1000 records and 200 addresses

    • then avg is 5.0

    • but stddev might be 1.0

  • start by determining how many records can fit in one or more disk clusters

  • then design a good hash function to match that address space


Advantages and disadvantages of buckets
Advantages and those locations? Cole would be at 35 -- 4 probes away.Disadvantages of Buckets

  • Advantages:

  • Can achieve relatively fast access

    • Remember, the hash function tells us where the record is located, so only 1 read operation. And even with collisions, the list of possible records is read into memory, which searches fast.

    • Search Time = time to read bucket + time to search the array

  • Disadvantages:

  • What do we do when the bucket is full?

    • solutions are similar to collision resolution

    • we end up reading multiple sets of records


Predicting collision rates
Predicting Collision Rates those locations? Cole would be at 35 -- 4 probes away.

  • Collisions will happen!

  • Poisson Function:

    • p(x) gives the probability that a given address will have had x records assigned to it.

      (r/N)x e-(r/N)

      p(x) = ---------------

      x!

      N = number of available addresses

      r = number of records to be stored

      x = number of records assigned to a given address


Analysis continued
Analysis continued those locations? Cole would be at 35 -- 4 probes away.

  • Given

    • N = 1000

    • r = 1000

  • Probability that a given address will have exactly one, two, or three keys hashed to it:

    p(1) = 0.368

    p(2) = 0.184

    p(3) = 0.061


Analysis continued1
Analysis Continued those locations? Cole would be at 35 -- 4 probes away.

  • Given

    • N = 10,000

    • R = 10,000

  • How many addresses should have one, two, or three keys hashed to them?

    10,000 x p(1) = 10000x0.3679 = 3679

    10,000 x p(2) = 10000x0.1839 = 1839

    10,000 x p(3) = 10000x0.0613 = 613

  • So, 1839 keys will collide once and 613 will collide at least twice.

  • Many of those collisions will disrupt probing.


Impact of packing density
Impact of Packing Density those locations? Cole would be at 35 -- 4 probes away.

Records that never collide = 303

Records that cannot go at their home = 107

Records at their home, but cause collisions = 90

Total = 500

  • Given

    • r = 500

    • N = 1000

    • one record per address

  • Addresses with exact one record?

    N x p(1) = 1000 x 0.303 = 303

  • How many overflow records?

    1 x N x p(2) + 2 x N x p(3) + 3 x N x p(4) + ...

    = N x [1 x p(2) + 2 x p(3) + 3 x p(4)]

    = 1000 x [ 1 x 0.076 + 2 x 0.013 + 3 x 0.002]

    = 107

  • Percentage of Records NOT stored at home address

    107 / 500 = 21.4%


Impact of packing density1
Impact of Packing Density those locations? Cole would be at 35 -- 4 probes away.


Real life
Real Life those locations? Cole would be at 35 -- 4 probes away.

  • We must balance many factors:

    • file size

      • e.g., wasted space in hashed files

      • e.g., extra space for index files

    • disk access times

    • available memory

    • frequency of additions and deletions compared to searches

  • Best Solution of All?

    • probably a combination of indexed files, hashing, and buckets


Next classes
Next Classes… those locations? Cole would be at 35 -- 4 probes away.

  • Thursday April 14

    • No Class

  • Tuesday April 19

    • B-Trees

  • Thursday April 21

    • Review


ad