hashing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Hashing PowerPoint Presentation
Download Presentation
Hashing

Loading in 2 Seconds...

play fullscreen
1 / 127

Hashing - PowerPoint PPT Presentation


  • 152 Views
  • Uploaded on

Hashing. Joe Meehean. 1. Motivation. BST easy to implement average-case times O(LogN ) worst-case times O(N) AVL Trees harder to implement worst case times O( LogN ) Can we do better in the average-case?. Concept. “Dictionary” ADT

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Hashing' - cathal


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
hashing

Hashing

Joe Meehean

1

motivation
Motivation
  • BST
    • easy to implement
    • average-case times O(LogN)
    • worst-case times O(N)
  • AVL Trees
    • harder to implement
    • worst case times O(LogN)
  • Can we do better in the average-case?
concept
Concept
  • “Dictionary” ADT
    • average-case time O(1) for lookup, insert, and delete
  • Idea
    • stores keys (and associated values) in an array
    • compute each key’s array index as a function of its value
    • take advantage of array’s fast random access
  • Alternative implementation for sets and maps
example
Example
  • Goal
    • Store info about a companies 50 employees
    • Each employee has a unique employee ID
      • in range 100-200
  • Approach
    • use an array of size 101 (range of IDs)
    • store employee E’s info in array[E-100]
  • Result
    • insert, lookup, delete each O(1)
    • Wasted space, 51 locations
drawbacks
Drawbacks
  • Less functionality than trees
  • Hash tables cannot efficiently
    • find min
    • find max
    • print entire table in sorted order
  • Must be very careful how we use them
terminology
Terminology
  • Hashtable
    • the underlying array
  • Hash function
    • function that converts a key to an index
    • in example: hash(x) = x – 100
  • TableSize
    • size of underlying array or vector
  • Bucket
    • single cell of a hash table array
  • Collision
    • when two keys hash to the same bucket
assumptions
Assumptions
  • Keys we are using have a hash function
    • or we can define good hash functions for them
  • Keys overload the following operators
    • ==
    • !=
resolving obvious problems
Resolving Obvious Problems

How do we make a good hash function?

What should we do about collisions?

How large should we make our hash table?

hash function goals
Hash Function Goals
  • Hash function should be fast
  • Keys should be evenly distributed
    • different keys should have different hash values
  • Should reduce space needed
    • e.g., student IDs are 10 digits
    • do not need an array size of 10,000,000,000
    • there are only ~3,000 students
hash functions approach
Hash Functions Approach
  • Convert key to an intn
    • scramble up the data
    • ensure the data spreads over the entire integer space
  • Return n % TableSize
    • ensures that n doesn’t fall off the end of the table
example converting strings
Example: Converting Strings
  • Method 1
    • convert each char to an int
    • sum them
    • return sum % TableSize
  • Advantages
    • simple
    • time is O(key length)
example converting strings1
Example: Converting Strings
  • Method 1
    • convert each char to an int
    • sum them
    • return sum % TableSize
  • Problems
    • short keys may not reach end of table
      • sum of characters < TableSize(by a lot)
    • maps all permutations to same hash
      • hash(“able”) = hash(“bale”)
    • Time is O(key length)
example converting strings2
Example: Converting Strings
  • Method 2
    • Multiply individual chars by different values
    • Then sum
    • a[0] * 37n + a[1] * 37n-1 + … + a[n-1] * 37
    • a[i] * 37n-i
  • Advantages
    • produces big range of values
    • permutations hash to different values
example converting strings3
Example: Converting Strings
  • Method 2
    • Multiply individual chars by different values
    • Then sum
  • Disadvantages
    • relies on integer overflow
    • need to worry about negative hashes
  • Handling negative hash
    • hash = hash % TableSize
    • if(hash < 0) hash += TableSize
hash function tradeoffs
Hash Function Tradeoffs
  • Fast hash vs. evenly distributed hash
    • often faster leads to less evenly distributed
    • even distribution leads to slower
  • String example
    • could use only some of the characters
    • faster, but more collisions likely
resolving obvious problems1
Resolving Obvious Problems

How do we make a good hash function?

What should we do about collisions?

How large should we make our hash table?

handling collisions
Handling collisions
  • What if two keys hash to the same bucket (array entry)?
  • Array entries are linked lists (or trees)
    • different keys with same hash value stored in same list (or tree)
    • commonly called chained bucket hashing, or just chaining
handling collisions example
Handling Collisions Example
  • TableSize = 10
  • keys: 10 digit student IDs
  • hashfn = sum of digits % TableSize
example1
Example

0

2

4

1

3

5

7

9

6

8

C

E

B

A

D

handling collisions1
Handling collisions
  • During a lookup
  • How can we tell which value we want if there are > 1 entries in the bucket?
  • Compare the keys
    • buckets store keys and values
resolving obvious problems2
Resolving Obvious Problems

How do we make a good hash function?

What should we do about collisions?

How large should we make our hash table?

hash table size
Hash Table Size
  • Related to load factor
    • ratio of items in hash table to TableSize
    • average length of bucket list is
  • Goal is to keep around 1
hash table size1
Hash Table Size
  • Related to hashing function
  • Some hashing functions lead to data clustered together
  • Using a prime TableSize helps resolve this issue
    • hashing function not like to share factor with table size
hash table size2
Hash Table Size
  • If number of keys known in advance
    • make the hash table a little larger
    • prime near 1.25 * the number of keys
    • a little room to avoid collisions
    • trades space for potentially faster lookup
  • If number of keys not known in advance
    • plan to expand array as needed
    • coming up in another lecture
hash table operations
Hash Table operations
  • Lookup Key k
    • compute h = hash(k)
    • see if k is in the list in hashtable[h]
  • Insert Key k
    • Compute h = hash(k)
    • Make sure k is not already in hashtable[h]
    • Add k to the list in hashtable[h]
  • Delete Key k
    • Compute h = hash(k)
    • Remove k from list in hashtable[h]
hashset class
HashSet Class

template<class K, class Hash>

class HashSet{

private:

vector< list<K> > table;

intcurrentSize;

Hash hashfn;

public:

bool contains(const K&) const;

void insert(const K&);

void remove(const K&);

};

alternative to chaining
Alternative to chaining
  • Recall chaining hash tables
    • array cells stored linked lists
    • 2 keys with same hash end up in same list
  • Chaining hash tables
    • require 2 data structures
    • hash table and linked list
  • Can we solve collisions with more hashing?
    • use just one data structure
probing hash tables
Probing Hash Tables
  • No linked list in array cells
  • Collisions handled using alternative hash
    • try cells h0(x), h1(x), h2(x),…
    • until an empty cell is found
    • hi(x) = hash(x) + f(i)
    • f(i)is collision resolution strategy
  • Probing
    • looking for alternative hash locations
probing hash tables1
Probing Hash Tables
  • All data goes directly into table
    • instead of into lists in the table
  • Need a bigger table
    • ≈ 0.5 (half full)
  • More wasted space
  • Marginally less complexity
linear probing
Linear probing
  • f(i) is a linear function
    • often f(i)= i
  • If a collision occurs, look in the next cell
    • hash(x) + 1
    • keep looking until an empty cell is found
    • hash(x) + 2, hash(x) + 3, …
    • use modulus to wrap around table
  • Should eventually find an empty cell
    • if the table is not full
linear probing1
Linear probing

Insert 89

0

2

4

1

3

5

7

9

6

8

89

h0(x)

Simple hash: h(x) = x % TableSize

linear probing2
Linear probing

Insert 18

0

2

4

1

3

5

7

9

6

8

89

18

h0(x)

Simple hash: h(x) = x % TableSize

linear probing3
Linear probing

Insert 49

0

2

4

1

3

5

7

9

6

8

89

18

h0(x)

Collision

Simple hash: h(x) = x % TableSize

linear probing4
Linear probing

Insert 49

0

2

4

1

3

5

7

9

6

8

49

89

18

h1(x)

Simple hash: h(x) = x % TableSize

linear probing5
Linear probing

Insert 58

0

2

4

1

3

5

7

9

6

8

49

89

18

h0(x)

Collision

Simple hash: h(x) = x % TableSize

linear probing6
Linear probing

Insert 58

0

2

4

1

3

5

7

9

6

8

49

89

18

h1(x)

Collision

Simple hash: h(x) = x % TableSize

linear probing7
Linear probing

Insert 58

0

2

4

1

3

5

7

9

6

8

49

89

18

h2(x)

Collision

Simple hash: h(x) = x % TableSize

linear probing8
Linear probing

Insert 58

0

2

4

1

3

5

7

9

6

8

49

58

89

18

h3(x)

Simple hash: h(x) = x % TableSize

linear probing9
Linear probing
  • Advantages
    • no need for list
    • collision resolution function is fast
  • Disadvantages
    • requires more book keeping
    • primary clustering
probing extra book keeping
Probing Extra Book Keeping

Delete 89

0

2

4

1

3

5

7

9

6

8

49

58

89

18

h0(x)

What if an entry is deleted and we try to lookup another entry that collided with it?

probing extra book keeping1
Probing Extra Book Keeping

Lookup 49

0

2

4

1

3

5

7

9

6

8

49

58

18

Not Found

h0(x)

What if an entry is deleted an we try to lookup another entry that collided with it?

probing extra book keeping2
Probing Extra Book Keeping
  • Need extra information per cell
  • Differentiate between states
    • ACTIVE: cell contains a valid key
    • EMPTY: cell never contained a valid key
    • DELETED: previously contained a valid key
  • All cells start EMPTY
  • Lookup
    • keep looking until you find key or EMPTY cell
probing extra book keeping3
Probing Extra Book Keeping

Delete 89

0

2

4

1

3

5

7

9

6

8

49

58

89

18

A

E

E

A

E

E

E

A

E

A

h0(x)

probing extra book keeping4
Probing Extra Book Keeping

Delete 89

0

2

4

1

3

5

7

9

6

8

49

58

89

18

A

E

E

A

E

E

E

D

E

A

h0(x)

probing extra book keeping5
Probing Extra Book Keeping

Lookup 49

0

2

4

1

3

5

7

9

6

8

49

58

89

18

A

E

E

A

E

E

E

D

E

A

h0(x)

Collision

probing extra book keeping6
Probing Extra Book Keeping

Lookup 49

0

2

4

1

3

5

7

9

6

8

49

58

89

18

A

E

E

A

E

E

E

D

E

A

h1(x)

probing hashset class
Probing HashSet Class

template<class K, class Hash>

class HashSet{

private:

vector<HashEntry> table;

intcurrentSize;

...

};

class HashEntry{

public:

enumEntryType{ACTIVE, EMPTY, DELETED};

private:

K element;

EntryType info;

friend class HashSet;

};

linear probing hashing recall
Linear Probing Hashing Recall
  • No more bucket lists
  • Use collision resolution strategy
    • hi(x) = hash(x) + f(i)
  • If collision occurs, try the next cell
    • f(i) = i
    • repeat until you find an empty cell
  • Need extra book keeping
    • ACTIVE, EMPTY, DELETED
linear probing hashing
Linear Probing Hashing
  • What could go wrong?
  • How can we fix it?
    • Professor Meehean, you haven’t told us what “it” is yet.
primary clustering
Primary clustering
  • Clusters of data
    • requires several attempts to resolve collisions
    • makes cluster even bigger
    • too many 9’s eat up all of 8’s space
    • then the 8’s eat up 7’s space, etc…
  • Inserting keys in space that should be empty results in collisions
    • clusters have overrun the whole chunks of the hash table
primary clustering1
Primary clustering

Insert 30

0

2

4

1

3

5

7

9

6

8

49

29

58

89

18

h0(x)

Collision

primary clustering2
Primary clustering

Insert 30

0

2

4

1

3

5

7

9

6

8

49

29

58

89

18

h1(x)

Collision

primary clustering3
Primary clustering

Insert 30

0

2

4

1

3

5

7

9

6

8

49

29

58

89

18

h2(x)

Collision

primary clustering4
Primary clustering

Insert 30

0

2

4

1

3

5

7

9

6

8

49

29

58

30

89

18

h3(x)

primary clustering5
Primary Clustering

Only gets worse as load factor gets larger

As memory use gets more efficient

Performance gets worse

quadratic probing
Quadratic Probing
  • Primary clustering caused by linear nature of linear probing
    • collision end up right next to each other
  • What if we jumped farther away on a collision?
    • f(i) = i2
  • If a collision occurs…
    • hash(x) + 1, hash(x) + 4, hash(x) + 9, …
quadratic probing1
Quadratic Probing
  • Restrictions
    • TableSizemust be a prime
    • table must be less than half full
      • i.e., ≤ 0.5
  • If these restrictions are met
    • guaranteed to find an empty cell, eventually
  • If not
    • no guarantee of finding an empty cell
    • an insert might fail due a continuous collisions
quadratic probing2
Quadratic probing

Insert 58

0

2

4

1

3

5

7

9

6

8

49

89

18

h0(x)

Collision

hi(x) = h(x) + i2

quadratic probing3
Quadratic probing

Insert 58

0

2

4

1

3

5

7

9

6

8

49

89

18

h1(x)

Collision

h1(x) = h(x) + 1

quadratic probing4
Quadratic probing

Insert 58

0

2

4

1

3

5

7

9

6

8

49

58

89

18

h2(x)

h2(x) = h(x) + 4

secondary clustering
Secondary Clustering
  • Quadratic probing eliminates primary clustering
  • Keys with the same hash…
    • probe the same alternative cells
    • clusters still exist per bucket
    • just spread out
  • Called secondary clustering
  • Can we beat secondary clustering?
secondary clustering1
Secondary Clustering
  • Quadratic probing eliminates primary clustering
  • Keys with the same hash…
    • probe the same alternative cells
    • clusters still exist per bucket
    • just spread out
  • Called secondary clustering
  • Can we beat secondary clustering?
double hashing
Double Hashing
  • If the first hashing function causes a collision, try a second hashing function
    • hi(x) = hash(x) + f(i)
    • f(i) = i • hash2(x)
    • h0(x) = hash(x)
    • h1(x) = hash(x) + hash2(x)
    • h2(x) = hash(x) + 2 • hash2(x)
    • h3(x) = hash(x) + 3 • hash2(x)
double hashing1
Double Hashing
  • hash2(x) must be carefully selected
  • It can never be 0
    • h1(x) = hash(x) + 1 • 0
    • h2(x) = hash(x) + 2 • 0
    • h1(x) = h2(x) = h3(x) = hn(x)
  • It must eventually probe all cells
    • quadratic probed half
    • requires TableSize to be prime
double hashing2
Double Hashing
  • hash2(x) = R – (x % R)
  • where R is a prime smaller than TableSize
    • previous value of TableSize?
double hashing3
Double Hashing

Insert 49

0

2

4

1

3

5

7

9

6

8

89

18

Collision

h0(x)

  • hi(x) = hash(x) + i • hash2(x)
  • hash2(x) = R – (x % R)
  • R = 7
double hashing4
Double Hashing

Insert 49

0

2

4

1

3

5

7

9

6

8

89

49

18

h1(x)

  • h1(x) = 9+ 1 • hash2(x)
  • hash2(x) = 7 – (49 % 7) = 7 – 0 = 7
  • h1(x) = 16
double hashing5
Double Hashing

0

2

4

1

3

5

7

9

6

8

69

58

89

49

18

  • Why prime TableSizeis important
double hashing6
Double Hashing

Insert 23

0

2

4

1

3

5

7

9

6

8

69

58

89

49

18

  • Why prime TableSizeis important
  • hi(x) = (x % TableSize )+ i • hash2(x)) % TableSize
  • hash2(x) = 7 – (x % 7)
double hashing7
Double Hashing

Insert 23

0

2

4

1

3

5

7

9

6

8

69

58

89

49

18

Collision

h0(x)

  • Why prime TableSizeis important
  • hi(x) = (3 + i • 5) % 10
  • hash2(x) = 7 – (23 % 7) = 7 – 2 = 5
double hashing8
Double Hashing

Insert 23

0

2

4

1

3

5

7

9

6

8

69

58

89

49

18

Collision

h1(x)

  • Why prime TableSizeis important
  • hi(x) = (3 + i • 5) % 10
  • h1(x) = (3 + 1 • 5) % 10 = 8
double hashing9
Double Hashing

Insert 23

0

2

4

1

3

5

7

9

6

8

69

58

89

49

18

Collision

h2(x)

  • Why prime TableSizeis important
  • hi(x) = (3 + i • 5) % 10
  • h2(x) = (3 + 2 • 5) % 10 = 3
double hashing10
Double Hashing

Insert 23

0

2

4

1

3

5

7

9

6

8

69

58

89

49

18

Collision

h2(x)

  • Why prime TableSizeis important
  • hi(x) = (3 + i • 5) % 10
  • h3(x) = (3 + 3• 5) % 10 = 8
double hashing11
Double Hashing
  • Why prime TableSizeis important
  • hi(x) = (x % TableSize )+ i • hash2(x)) % TableSize
  • hash2(x) = 7 – (23 % 7) = 7 – 2 = 5
  • hi(x) = (3 + i • 5) % 10
  • 5 is a factor of 10
  • hash function will wrap infinitely, landing on same buckets
  • if TableSize is prime, result of hash2(x) can never be factor
hash table problems
Hash Table Problems
  • What to do when hash table gets too full?
    • problem for both chained an probing HTs
    • degrades performance
    • may cause insert failure for quadratic probing
rehash
Rehash
  • Create another table 2x the size
    • nearest prime 2x table size
  • Scan original table
    • compute new hash for valid entries
    • insert into new table
rehash1
Rehash

Insert 23

0

2

4

1

3

37

49

58

A

A

D

E

E

Collision

h0(x)

Assume quadratic probing

hash(x) = x % TableSize

rehash2
Rehash

Insert 23

0

2

4

1

3

37

49

58

A

A

D

E

E

Collision

h1(x)

Assume quadratic probing

hash(x) = x % TableSize

rehash3
Rehash

Insert 23

0

2

4

1

3

37

49

58

A

A

D

E

E

h2(x)

Assume quadratic probing

hash(x) = x % TableSize

rehash4
Rehash

Insert 23

0

2

4

1

3

23

49

58

A

A

A

E

E

h2(x)

Assume quadratic probing

hash(x) = x % TableSize

rehash5
Rehash

Table Too Full

Rehash

0

2

4

1

3

23

49

58

A

A

A

E

E

rehash6
Rehash

10

0

2

4

5

7

9

1

3

6

8

E

E

E

E

E

E

E

E

E

E

E

0

2

4

1

3

23

49

58

A

A

A

E

E

i

rehash7
Rehash

10

0

2

4

5

7

9

1

3

6

8

E

E

E

E

E

E

E

E

E

E

E

0

2

4

1

3

23

49

58

A

A

A

E

E

i

rehash8
Rehash

10

0

2

4

5

7

9

1

3

6

8

23

E

E

E

E

E

E

A

E

E

E

E

0

2

4

1

3

23

49

58

A

A

A

E

E

i

rehash9
Rehash

10

0

2

4

5

7

9

1

3

6

8

23

58

E

A

E

E

E

E

A

E

E

E

E

0

2

4

1

3

23

49

58

A

A

A

E

E

i

rehash10
Rehash

10

0

2

4

5

7

9

1

3

6

8

49

23

58

E

A

E

E

E

E

A

E

A

E

E

0

2

4

1

3

23

49

58

A

A

A

E

E

i

rehash complexity
Rehash Complexity
  • O(N)
  • Initialization or offline (batch)
    • cost is amortized
    • at least N/2 inserts between rehash
  • Interactive
    • can cause periodic unresponsiveness
    • program is snappy for N/2 – 1 operations
    • N/2th causes rehash
when to rehash
When to rehash
  • Chaining
    • when ≈ 1 (around full)
  • Probing options
    • as soon as the table is ½ full
    • when an insert fails to find an empty cell
    • middle road: some arbitrary > 0.5
when to rehash probing hts
When to rehash probing HTs?
  • counts the number of non-empty cells
    • both active and deleted cells are counted against the load
    • deleted cells do not contain useful data
  • Why?
    • lookups keep looking until they find an empty cell
    • table full of deleted cells has a O(N) lookup time
      • item is not in the table
      • need to look at all cells to be sure
external hashing
External Hashing
  • What if our hash table is huge?
    • does not fit in main memory
  • How do we store a hash table on disk?
external hashing1
External Hashing
  • What if our hash table is huge?
    • does not fit in main memory
  • How do we store a hash table on disk?
    • each cell is stored in a disk block
  • How do we find the right disk block?
external hashing2
External Hashing
  • What if our hash table is huge?
    • does not fit in main memory
  • How do we store a hash table on disk?
    • each cell is stored in a disk block
  • How do we find the right disk block?
    • in-memory directory
    • directs us to correct block
external hashing3
External Hashing
  • If hash table has N entries
  • And each disk block can store M entries
  • Need at least N/M disk blocks and directory entries
    • may need more if keys not evenly distributed
    • unevenly distributed data may fill some blocks and leave others empty
external hashing4
External Hashing
  • If
    • hash table has K possible keys
    • each disk block can store M entries
  • Need
    • K/M disk blocks
    • K/M directory entries
  • May waste space and slowdown lookups
    • disk blocks may not be full
    • extra directory entries to look at
external hashing5
External Hashing

110

100

101

000

001

010

011

111

00010

01010

00100

00101

00110

10000

01100

10100

10110

10111

11100

11101

Assume 5-bit hash keys

Disk blocks can store 4 entries

Directory entry gives first 3 bits

external hashing6
External Hashing

110

100

101

000

001

010

011

111

00010

01010

00100

00101

00110

10000

01100

10100

10110

10111

11100

11101

5 of the 8 blocks are less than half full

Wasted disk and directory space

extendible hashing
Extendible Hashing
  • Use extendible hashing
  • Start with the smallestdirectory possible
    • at least one disk block is full
    • or would overflow if we shrunk the directory
  • Grow the directory and add disk blocks as needed
extendible hashing1
Extendible Hashing
  • Formally
  • D
    • number of bits used by directory
    • can be leading or trailing (first or last)
  • Number of directory entries = 2D
  • dLnumber of bits shared by all hash keys in disk block L
    • dL≤ D, depends on block (more later)
    • can be leading or trailing bits(must choose)
extendible hashing2
Extendible Hashing

10

11

01

00

11100

11101

10000

10100

10110

10111

00010

00100

00101

00110

01010

01100

D = 2 (leading bits)

dL= 2 for all blocks

Cannot make D any smaller

extendible hashing3
Extendible Hashing

10

11

01

00

11100

11101

10000

10100

10110

10111

00010

00100

00101

00110

01010

01100

Insert hash key17 (10001)

No room in 10 disk block

Need to expand

extendible hashing4
Extendible Hashing

101

100

10

10100

10110

10111

10000

10001

10000

10100

10110

10111

Split 10 block

Use 3 bits to lookup

Then insert hash key 17 (10001)

extendible hashing5
Extendible Hashing

110

100

101

000

001

010

011

111

10000

10001

10100

10110

10111

Double the directory to hold new entries(100 & 101)

extendible hashing6
Extendible Hashing

110

100

101

000

001

010

011

111

10000

10001

10100

10110

10111

11100

11101

01010

01100

00010

00100

00101

00110

Assign new directory entries to blocks

Some blocks have hash keys from 2 directory entries

extendible hashing7
Extendible Hashing

110

100

101

000

001

010

011

111

10000

10001

10100

10110

10111

11100

11101

01010

01100

00010

00100

00101

00110

  • Only had to update directory
    • no disk access required for unsplit blocks
  • Limits wasted block space
extendible hashing8
Extendible Hashing

110

100

101

000

001

010

011

111

dL= 3

10000

10001

dL= 3

10100

10110

10111

dL= 2

11100

11101

dL= 2

01010

01100

dL= 2

00010

00100

00101

00110

Need extra book keeping

Differentiate between split and unsplit blocks

Store dLfor each block

extendible hashing9
Extendible Hashing

110

100

101

000

001

010

011

111

dL= 3

10000

10001

dL= 3

10100

10110

10111

dL= 2

11100

11101

dL= 2

01010

01100

dL= 2

00010

00100

00101

00110

  • Overflows for unsplit blocks
    • only cause block to split
    • no doubling of directory
  • Insert 3 (00011)
extendible hashing10
Extendible Hashing

110

100

101

000

001

010

011

111

dL= 3

00010

00011

dL= 3

00100

00101

00110

dL= 3

10000

10001

dL= 3

10100

10110

10111

dL= 2

11100

11101

dL= 2

01010

01100

  • Overflows for unsplit blocks
    • only cause block to split
    • no doubling of directory
  • Insert 3 (00011)
extendible hashing11
Extendible Hashing
  • Additional advantage
  • Rehash for an on-disk hash table would be very expensive
    • need to read all disk blocks
  • Extendible hashing amortizes the cost
    • splits each bucket individually
    • spreads cost over multiple inserts
extendible hashing12
Extendible Hashing
  • Special cases
  • Splitting if keys in disk block share more than D +1 leading bits
    • may need to double directory multiple times
    • until keys divide over two blocks
  • M duplicate hash keys
    • no amount of splitting will fix this
    • may need an “overflow” block to store these
hashing wrap up
Hashing Wrap Up
  • How do we use C++ hash_maps and hash_sets?
  • When should we use a map…
    • backed by a hash table
    • backed by a tree (e.g., BST, B+)
  • When should we use a set
    • backed by a hash table
    • backed by a tree (e.g., BST, B+)
hashing in stl
Hashing in STL
  • hash_map and hash_set
    • alternative implementation of map and set
    • use a hash table
  • Not a STL standard
  • Most compilers provide an implementation
hashing in visual c
Hashing in Visual C++
  • Found in the <hash_map> and <hash_set> headers
  • In the stdextnamespace
    • not in the stdnamespace like <vector>, …
  • Similar methods to map and set
  • You can provide two functors
    • hash: provides the hashing function
    • compare: compares two keys
    • defaults to identity hash and <
lookup
Lookup
  • Lookup Key k
    • compute h = hash(k)
    • see if k is in the list in hashtable[h]
  • Time for lookup
    • Time for step 1 + step 2
  • Worst-case for step 2
    • All keys hash to same index
    • O(# keys in table) = O(N)
hashing loophole
Hashing Loophole
  • If
    • hash function distributes keys uniformly
    • probability that hash(k) = h is 1/TableSize
    • for all h in range 0 to TableSize
  • Then
    • probability of a collision = N/TableSize
    • if N ≤ TableSize, then p(collision) ≤ 1
hashing loophole1
Hashing Loophole
  • Loophole compacts to…
    • If hash function distributes keys uniformly
    • AND, subset of keys distributes uniformly
    • AND, # of keys ≤ TableSize
    • AND, hash function is O(1)
    • Then, average time for lookup is O(1)
insert
Insert
  • Insert Key k
    • compute h = hash(k)
    • put k in table at or near table[h]
  • Complexity
    • hash function: should be O(1)
    • collision resolution: O(N)
      • chained: must check all keys in list
      • probing: probe may hit every other filled cell
    • Worst case: O(N)
    • Loophole average case: O(1)
delete
Delete
  • Delete Key k
    • compute h = hash(k)
    • remove k from at or near table[h]
  • Complexity
    • same as lookup and insert
    • O(N) in the worst case
    • O(1) in the loophole average case
summary
Summary
  • Loophole
    • limited collisions
    • O(1) average complexity for lookup, insert, and delete
  • Worst case times
    • insert: O(N)
      • even with loophole rehash makes this possible
    • lookup, delete: O(N)
summary1
Summary
  • Alternative implementation for sets and maps, but…
  • Balanced tree, all operations are: O(LogN)
    • safe middle of the road performance
  • Gamble on hash implementations
    • potential O(1) operations
    • potential O(N) operations
  • Some operations are not efficient
    • print in sorted order
    • find largest/smallest
when hashing is guaranteed safe
When Hashing is Guaranteed Safe
  • Must be positive there will be a small # of hash key collisions
    • not just small probability
    • an actual worst-case small # of collisions
  • All keys are known in advance and hashing doesn’t cause are large # of collisions
  • The map/set will always store all keys
    • no collisions due to modulus
    • no key similarities due to select sample