- 150 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Hashing' - cathal

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Motivation

- BST
- easy to implement
- average-case times O(LogN)
- worst-case times O(N)
- AVL Trees
- harder to implement
- worst case times O(LogN)
- Can we do better in the average-case?

Concept

- “Dictionary” ADT
- average-case time O(1) for lookup, insert, and delete
- Idea
- stores keys (and associated values) in an array
- compute each key’s array index as a function of its value
- take advantage of array’s fast random access
- Alternative implementation for sets and maps

Example

- Goal
- Store info about a companies 50 employees
- Each employee has a unique employee ID
- in range 100-200
- Approach
- use an array of size 101 (range of IDs)
- store employee E’s info in array[E-100]
- Result
- insert, lookup, delete each O(1)
- Wasted space, 51 locations

Drawbacks

- Less functionality than trees
- Hash tables cannot efficiently
- find min
- find max
- print entire table in sorted order
- Must be very careful how we use them

Terminology

- Hashtable
- the underlying array
- Hash function
- function that converts a key to an index
- in example: hash(x) = x – 100
- TableSize
- size of underlying array or vector
- Bucket
- single cell of a hash table array
- Collision
- when two keys hash to the same bucket

Assumptions

- Keys we are using have a hash function
- or we can define good hash functions for them
- Keys overload the following operators
- ==
- !=

Resolving Obvious Problems

How do we make a good hash function?

What should we do about collisions?

How large should we make our hash table?

Hash Function Goals

- Hash function should be fast
- Keys should be evenly distributed
- different keys should have different hash values
- Should reduce space needed
- e.g., student IDs are 10 digits
- do not need an array size of 10,000,000,000
- there are only ~3,000 students

Hash Functions Approach

- Convert key to an intn
- scramble up the data
- ensure the data spreads over the entire integer space
- Return n % TableSize
- ensures that n doesn’t fall off the end of the table

Example: Converting Strings

- Method 1
- convert each char to an int
- sum them
- return sum % TableSize
- Advantages
- simple
- time is O(key length)

Example: Converting Strings

- Method 1
- convert each char to an int
- sum them
- return sum % TableSize
- Problems
- short keys may not reach end of table
- sum of characters < TableSize(by a lot)
- maps all permutations to same hash
- hash(“able”) = hash(“bale”)
- Time is O(key length)

Example: Converting Strings

- Method 2
- Multiply individual chars by different values
- Then sum
- a[0] * 37n + a[1] * 37n-1 + … + a[n-1] * 37
- a[i] * 37n-i
- Advantages
- produces big range of values
- permutations hash to different values

Example: Converting Strings

- Method 2
- Multiply individual chars by different values
- Then sum
- Disadvantages
- relies on integer overflow
- need to worry about negative hashes
- Handling negative hash
- hash = hash % TableSize
- if(hash < 0) hash += TableSize

Hash Function Tradeoffs

- Fast hash vs. evenly distributed hash
- often faster leads to less evenly distributed
- even distribution leads to slower
- String example
- could use only some of the characters
- faster, but more collisions likely

Resolving Obvious Problems

How do we make a good hash function?

What should we do about collisions?

How large should we make our hash table?

Handling collisions

- What if two keys hash to the same bucket (array entry)?
- Array entries are linked lists (or trees)
- different keys with same hash value stored in same list (or tree)
- commonly called chained bucket hashing, or just chaining

Handling Collisions Example

- TableSize = 10
- keys: 10 digit student IDs
- hashfn = sum of digits % TableSize

Handling collisions

- During a lookup
- How can we tell which value we want if there are > 1 entries in the bucket?
- Compare the keys
- buckets store keys and values

Resolving Obvious Problems

How do we make a good hash function?

What should we do about collisions?

How large should we make our hash table?

Hash Table Size

- Related to load factor
- ratio of items in hash table to TableSize
- average length of bucket list is
- Goal is to keep around 1

Hash Table Size

- Related to hashing function
- Some hashing functions lead to data clustered together
- Using a prime TableSize helps resolve this issue
- hashing function not like to share factor with table size

Hash Table Size

- If number of keys known in advance
- make the hash table a little larger
- prime near 1.25 * the number of keys
- a little room to avoid collisions
- trades space for potentially faster lookup
- If number of keys not known in advance
- plan to expand array as needed
- coming up in another lecture

Hash Table operations

- Lookup Key k
- compute h = hash(k)
- see if k is in the list in hashtable[h]
- Insert Key k
- Compute h = hash(k)
- Make sure k is not already in hashtable[h]
- Add k to the list in hashtable[h]
- Delete Key k
- Compute h = hash(k)
- Remove k from list in hashtable[h]

HashSet Class

template<class K, class Hash>

class HashSet{

private:

vector< list<K> > table;

intcurrentSize;

Hash hashfn;

public:

…

bool contains(const K&) const;

void insert(const K&);

void remove(const K&);

};

Alternative to chaining

- Recall chaining hash tables
- array cells stored linked lists
- 2 keys with same hash end up in same list
- Chaining hash tables
- require 2 data structures
- hash table and linked list
- Can we solve collisions with more hashing?
- use just one data structure

Probing Hash Tables

- No linked list in array cells
- Collisions handled using alternative hash
- try cells h0(x), h1(x), h2(x),…
- until an empty cell is found
- hi(x) = hash(x) + f(i)
- f(i)is collision resolution strategy
- Probing
- looking for alternative hash locations

Probing Hash Tables

- All data goes directly into table
- instead of into lists in the table
- Need a bigger table
- ≈ 0.5 (half full)
- More wasted space
- Marginally less complexity

Linear probing

- f(i) is a linear function
- often f(i)= i
- If a collision occurs, look in the next cell
- hash(x) + 1
- keep looking until an empty cell is found
- hash(x) + 2, hash(x) + 3, …
- use modulus to wrap around table
- Should eventually find an empty cell
- if the table is not full

Linear probing

- Advantages
- no need for list
- collision resolution function is fast
- Disadvantages
- requires more book keeping
- primary clustering

Probing Extra Book Keeping

Delete 89

0

2

4

1

3

5

7

9

6

8

49

58

89

18

h0(x)

What if an entry is deleted and we try to lookup another entry that collided with it?

Probing Extra Book Keeping

Lookup 49

0

2

4

1

3

5

7

9

6

8

49

58

18

Not Found

h0(x)

What if an entry is deleted an we try to lookup another entry that collided with it?

Probing Extra Book Keeping

- Need extra information per cell
- Differentiate between states
- ACTIVE: cell contains a valid key
- EMPTY: cell never contained a valid key
- DELETED: previously contained a valid key
- All cells start EMPTY
- Lookup
- keep looking until you find key or EMPTY cell

Probing HashSet Class

template<class K, class Hash>

class HashSet{

private:

vector<HashEntry> table;

intcurrentSize;

...

};

class HashEntry{

public:

enumEntryType{ACTIVE, EMPTY, DELETED};

private:

K element;

EntryType info;

friend class HashSet;

};

Linear Probing Hashing Recall

- No more bucket lists
- Use collision resolution strategy
- hi(x) = hash(x) + f(i)
- If collision occurs, try the next cell
- f(i) = i
- repeat until you find an empty cell
- Need extra book keeping
- ACTIVE, EMPTY, DELETED

Linear Probing Hashing

- What could go wrong?
- How can we fix it?
- Professor Meehean, you haven’t told us what “it” is yet.

Primary clustering

- Clusters of data
- requires several attempts to resolve collisions
- makes cluster even bigger
- too many 9’s eat up all of 8’s space
- then the 8’s eat up 7’s space, etc…
- Inserting keys in space that should be empty results in collisions
- clusters have overrun the whole chunks of the hash table

Primary Clustering

Only gets worse as load factor gets larger

As memory use gets more efficient

Performance gets worse

Quadratic Probing

- Primary clustering caused by linear nature of linear probing
- collision end up right next to each other
- What if we jumped farther away on a collision?
- f(i) = i2
- If a collision occurs…
- hash(x) + 1, hash(x) + 4, hash(x) + 9, …

Quadratic Probing

- Restrictions
- TableSizemust be a prime
- table must be less than half full
- i.e., ≤ 0.5
- If these restrictions are met
- guaranteed to find an empty cell, eventually
- If not
- no guarantee of finding an empty cell
- an insert might fail due a continuous collisions

Secondary Clustering

- Quadratic probing eliminates primary clustering
- Keys with the same hash…
- probe the same alternative cells
- clusters still exist per bucket
- just spread out
- Called secondary clustering
- Can we beat secondary clustering?

Secondary Clustering

- Quadratic probing eliminates primary clustering
- Keys with the same hash…
- probe the same alternative cells
- clusters still exist per bucket
- just spread out
- Called secondary clustering
- Can we beat secondary clustering?

Double Hashing

- If the first hashing function causes a collision, try a second hashing function
- hi(x) = hash(x) + f(i)
- f(i) = i • hash2(x)
- h0(x) = hash(x)
- h1(x) = hash(x) + hash2(x)
- h2(x) = hash(x) + 2 • hash2(x)
- h3(x) = hash(x) + 3 • hash2(x)

Double Hashing

- hash2(x) must be carefully selected
- It can never be 0
- h1(x) = hash(x) + 1 • 0
- h2(x) = hash(x) + 2 • 0
- h1(x) = h2(x) = h3(x) = hn(x)
- It must eventually probe all cells
- quadratic probed half
- requires TableSize to be prime

Double Hashing

- hash2(x) = R – (x % R)
- where R is a prime smaller than TableSize
- previous value of TableSize?

Double Hashing

Insert 49

0

2

4

1

3

5

7

9

6

8

89

18

Collision

h0(x)

- hi(x) = hash(x) + i • hash2(x)
- hash2(x) = R – (x % R)
- R = 7

Double Hashing

Insert 49

0

2

4

1

3

5

7

9

6

8

89

49

18

h1(x)

- h1(x) = 9+ 1 • hash2(x)
- hash2(x) = 7 – (49 % 7) = 7 – 0 = 7
- h1(x) = 16

Double Hashing

Insert 23

0

2

4

1

3

5

7

9

6

8

69

58

89

49

18

- Why prime TableSizeis important
- hi(x) = (x % TableSize )+ i • hash2(x)) % TableSize
- hash2(x) = 7 – (x % 7)

Double Hashing

Insert 23

0

2

4

1

3

5

7

9

6

8

69

58

89

49

18

Collision

h0(x)

- Why prime TableSizeis important
- hi(x) = (3 + i • 5) % 10
- hash2(x) = 7 – (23 % 7) = 7 – 2 = 5

Double Hashing

Insert 23

0

2

4

1

3

5

7

9

6

8

69

58

89

49

18

Collision

h1(x)

- Why prime TableSizeis important
- hi(x) = (3 + i • 5) % 10
- h1(x) = (3 + 1 • 5) % 10 = 8

Double Hashing

Insert 23

0

2

4

1

3

5

7

9

6

8

69

58

89

49

18

Collision

h2(x)

- Why prime TableSizeis important
- hi(x) = (3 + i • 5) % 10
- h2(x) = (3 + 2 • 5) % 10 = 3

Double Hashing

Insert 23

0

2

4

1

3

5

7

9

6

8

69

58

89

49

18

Collision

h2(x)

- Why prime TableSizeis important
- hi(x) = (3 + i • 5) % 10
- h3(x) = (3 + 3• 5) % 10 = 8

Double Hashing

- Why prime TableSizeis important
- hi(x) = (x % TableSize )+ i • hash2(x)) % TableSize
- hash2(x) = 7 – (23 % 7) = 7 – 2 = 5
- hi(x) = (3 + i • 5) % 10
- 5 is a factor of 10
- hash function will wrap infinitely, landing on same buckets
- if TableSize is prime, result of hash2(x) can never be factor

Hash Table Problems

- What to do when hash table gets too full?
- problem for both chained an probing HTs
- degrades performance
- may cause insert failure for quadratic probing

Rehash

- Create another table 2x the size
- nearest prime 2x table size
- Scan original table
- compute new hash for valid entries
- insert into new table

Rehash Complexity

- O(N)
- Initialization or offline (batch)
- cost is amortized
- at least N/2 inserts between rehash
- Interactive
- can cause periodic unresponsiveness
- program is snappy for N/2 – 1 operations
- N/2th causes rehash

When to rehash

- Chaining
- when ≈ 1 (around full)
- Probing options
- as soon as the table is ½ full
- when an insert fails to find an empty cell
- middle road: some arbitrary > 0.5

When to rehash probing HTs?

- counts the number of non-empty cells
- both active and deleted cells are counted against the load
- deleted cells do not contain useful data
- Why?
- lookups keep looking until they find an empty cell
- table full of deleted cells has a O(N) lookup time
- item is not in the table
- need to look at all cells to be sure

External Hashing

- What if our hash table is huge?
- does not fit in main memory
- How do we store a hash table on disk?

External Hashing

- What if our hash table is huge?
- does not fit in main memory
- How do we store a hash table on disk?
- each cell is stored in a disk block
- How do we find the right disk block?

External Hashing

- What if our hash table is huge?
- does not fit in main memory
- How do we store a hash table on disk?
- each cell is stored in a disk block
- How do we find the right disk block?
- in-memory directory
- directs us to correct block

External Hashing

- If hash table has N entries
- And each disk block can store M entries
- Need at least N/M disk blocks and directory entries
- may need more if keys not evenly distributed
- unevenly distributed data may fill some blocks and leave others empty

External Hashing

- If
- hash table has K possible keys
- each disk block can store M entries
- Need
- K/M disk blocks
- K/M directory entries
- May waste space and slowdown lookups
- disk blocks may not be full
- extra directory entries to look at

External Hashing

110

100

101

000

001

010

011

111

00010

01010

00100

00101

00110

10000

01100

10100

10110

10111

11100

11101

Assume 5-bit hash keys

Disk blocks can store 4 entries

Directory entry gives first 3 bits

External Hashing

110

100

101

000

001

010

011

111

00010

01010

00100

00101

00110

10000

01100

10100

10110

10111

11100

11101

5 of the 8 blocks are less than half full

Wasted disk and directory space

Extendible Hashing

- Use extendible hashing
- Start with the smallestdirectory possible
- at least one disk block is full
- or would overflow if we shrunk the directory
- Grow the directory and add disk blocks as needed

Extendible Hashing

- Formally
- D
- number of bits used by directory
- can be leading or trailing (first or last)
- Number of directory entries = 2D
- dLnumber of bits shared by all hash keys in disk block L
- dL≤ D, depends on block (more later)
- can be leading or trailing bits(must choose)

Extendible Hashing

10

11

01

00

11100

11101

10000

10100

10110

10111

00010

00100

00101

00110

01010

01100

D = 2 (leading bits)

dL= 2 for all blocks

Cannot make D any smaller

Extendible Hashing

10

11

01

00

11100

11101

10000

10100

10110

10111

00010

00100

00101

00110

01010

01100

Insert hash key17 (10001)

No room in 10 disk block

Need to expand

Extendible Hashing

101

100

10

10100

10110

10111

10000

10001

10000

10100

10110

10111

Split 10 block

Use 3 bits to lookup

Then insert hash key 17 (10001)

Extendible Hashing

110

100

101

000

001

010

011

111

10000

10001

10100

10110

10111

Double the directory to hold new entries(100 & 101)

Extendible Hashing

110

100

101

000

001

010

011

111

10000

10001

10100

10110

10111

11100

11101

01010

01100

00010

00100

00101

00110

Assign new directory entries to blocks

Some blocks have hash keys from 2 directory entries

Extendible Hashing

110

100

101

000

001

010

011

111

10000

10001

10100

10110

10111

11100

11101

01010

01100

00010

00100

00101

00110

- Only had to update directory
- no disk access required for unsplit blocks
- Limits wasted block space

Extendible Hashing

110

100

101

000

001

010

011

111

dL= 3

10000

10001

dL= 3

10100

10110

10111

dL= 2

11100

11101

dL= 2

01010

01100

dL= 2

00010

00100

00101

00110

Need extra book keeping

Differentiate between split and unsplit blocks

Store dLfor each block

Extendible Hashing

110

100

101

000

001

010

011

111

dL= 3

10000

10001

dL= 3

10100

10110

10111

dL= 2

11100

11101

dL= 2

01010

01100

dL= 2

00010

00100

00101

00110

- Overflows for unsplit blocks
- only cause block to split
- no doubling of directory
- Insert 3 (00011)

Extendible Hashing

110

100

101

000

001

010

011

111

dL= 3

00010

00011

dL= 3

00100

00101

00110

dL= 3

10000

10001

dL= 3

10100

10110

10111

dL= 2

11100

11101

dL= 2

01010

01100

- Overflows for unsplit blocks
- only cause block to split
- no doubling of directory
- Insert 3 (00011)

Extendible Hashing

- Additional advantage
- Rehash for an on-disk hash table would be very expensive
- need to read all disk blocks
- Extendible hashing amortizes the cost
- splits each bucket individually
- spreads cost over multiple inserts

Extendible Hashing

- Special cases
- Splitting if keys in disk block share more than D +1 leading bits
- may need to double directory multiple times
- until keys divide over two blocks
- M duplicate hash keys
- no amount of splitting will fix this
- may need an “overflow” block to store these

Hashing Wrap Up

- How do we use C++ hash_maps and hash_sets?
- When should we use a map…
- backed by a hash table
- backed by a tree (e.g., BST, B+)
- When should we use a set
- backed by a hash table
- backed by a tree (e.g., BST, B+)

Hashing in STL

- hash_map and hash_set
- alternative implementation of map and set
- use a hash table
- Not a STL standard
- Most compilers provide an implementation

Hashing in Visual C++

- Found in the <hash_map> and <hash_set> headers
- In the stdextnamespace
- not in the stdnamespace like <vector>, …
- Similar methods to map and set
- You can provide two functors
- hash: provides the hashing function
- compare: compares two keys
- defaults to identity hash and <

Lookup

- Lookup Key k
- compute h = hash(k)
- see if k is in the list in hashtable[h]
- Time for lookup
- Time for step 1 + step 2
- Worst-case for step 2
- All keys hash to same index
- O(# keys in table) = O(N)

Hashing Loophole

- If
- hash function distributes keys uniformly
- probability that hash(k) = h is 1/TableSize
- for all h in range 0 to TableSize
- Then
- probability of a collision = N/TableSize
- if N ≤ TableSize, then p(collision) ≤ 1

Hashing Loophole

- Loophole compacts to…
- If hash function distributes keys uniformly
- AND, subset of keys distributes uniformly
- AND, # of keys ≤ TableSize
- AND, hash function is O(1)
- Then, average time for lookup is O(1)

Insert

- Insert Key k
- compute h = hash(k)
- put k in table at or near table[h]
- Complexity
- hash function: should be O(1)
- collision resolution: O(N)
- chained: must check all keys in list
- probing: probe may hit every other filled cell
- Worst case: O(N)
- Loophole average case: O(1)

Delete

- Delete Key k
- compute h = hash(k)
- remove k from at or near table[h]
- Complexity
- same as lookup and insert
- O(N) in the worst case
- O(1) in the loophole average case

Summary

- Loophole
- limited collisions
- O(1) average complexity for lookup, insert, and delete
- Worst case times
- insert: O(N)
- even with loophole rehash makes this possible
- lookup, delete: O(N)

Summary

- Alternative implementation for sets and maps, but…
- Balanced tree, all operations are: O(LogN)
- safe middle of the road performance
- Gamble on hash implementations
- potential O(1) operations
- potential O(N) operations
- Some operations are not efficient
- print in sorted order
- find largest/smallest

When Hashing is Guaranteed Safe

- Must be positive there will be a small # of hash key collisions
- not just small probability
- an actual worst-case small # of collisions
- All keys are known in advance and hashing doesn’t cause are large # of collisions
- The map/set will always store all keys
- no collisions due to modulus
- no key similarities due to select sample

Download Presentation

Connecting to Server..