1 / 30

A Look at Modern Dictionary Structures & Algorithms

A Look at Modern Dictionary Structures & Algorithms. Warren Hunt. Dictionary Structures. Used for storing information (key, value) pairs Bread and Butter of a Data-structures and Algorithms course. Common Dictionary Structures. List (Array) Sorted List Linked List Move to Front List

loring
Download Presentation

A Look at Modern Dictionary Structures & Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Look at Modern Dictionary Structures & Algorithms Warren Hunt

  2. Dictionary Structures • Used for storing information • (key, value) pairs • Bread and Butter of a Data-structures and Algorithms course

  3. Common Dictionary Structures • List (Array) • Sorted List • Linked List • Move to Front List • Inverted Index List • Skip List  check this one out • …

  4. Common Dictionary Structures • (Balanced) Binary Search Trees • AVL Tree • Red-Black Tree • Splay Tree • B-Tree • Trie • Patricia Tree • …

  5. Common Dictionary Structures • Hash Tables • Linear (or Quadratic) Probing • Separate Chaining (or Treeing) • Double Hashing • Perfect Hashing • Hash Trees • Cuckoo Hashing • d-ary • binned • …

  6. +Every Hybrid You Can Think Of! • Unfortunately, they don’t teach the cool ones… • Skip lists are a faster/easier to code alternative to most binary search trees • Invented in 1990! • Cuckoo Hashing has a huge number of nice properties (IMHO far superior to all other hashing designs) • Invented in 2001

  7. So many to choose from!Which is best? • That Depends on your needs… • Sorted Lists are simple and easy to implement (simple means fast on small datasets!) • Binary search trees and sorted lists provide easy access to sorted data • B-trees have great page-performance for databases • Hash tables have fastest asymptotic lookup time

  8. Focus On Hashing for Now • Fastest lookup/insert/delete time: O(1) • Used in Bloom-filters • not the graphics kind! • Useful in garbage collection (or anywhere you want to mark things as visited) • Small hash-tables implement an associative cache • Easy to implement! (no pointer chasing)

  9. Traditional Hashing • Just make up an address in an array for some piece of data and stick it there • Hash function generates the address • Problems arise when two things have the same address, so we’ll address that: • Linear (or Quadratic) Probing • Separate Chaining (Treeing…) • Double Hashing

  10. Problems With Traditional Hashing • Without separate chaining, they can’t get too full or bad things happen • With separate chaining, we have poor cache performance and still O(n) worst case behavior • Separate treeing provides O(log n) worst case, but they don’t teach that in school… • Linear probing is still the most common (fastest cache behavior, bite the bullet on poorer memory utilization)

  11. Good Hash Functions • All hash table implementations require good hash functions (with the exception of separate treeing) • Universal hash functions are required (number theory, I won’t discuss it here) • Cuckoo hashing is less strict (different assumptions are made in each paper to make proofs easier)

  12. Cuckoo Hashing • Guaranteed O(1) lookup/delete • Amortized O(1) insert • 50% space efficient • Requires *mostly* random hash functions • Newish and largely unknown (barely mentioned in Wikipedia-Hash Tables)

  13. Cuckoo Hashing • Use two hash tables and two hash functions • Each element will have exactly one “nest” (hash location) in each table • Guarantee that any element will only ever exist in one of its “nests” • Lookup/delete are O(1) because we can check 2 locations (“nests”) in O(1) time

  14. Cuckoo Hashing - Insertion • Insert an element by finding one of its “nests” and putting it there • This may evict another element! (goto 2.) • Insert the evicted element into its *other* “nest” This may evict another element! (goto 2.) • Under reasonable assumptions, this process will terminate in O(1) time…

  15. Why does this work? • Matching property of random graphs • With high probability, any matching under a saturation threshold (50% in this case) can take another edge without breaking • More details in the paper

  16. Overflowing the Table • Insertion can potentially fail causing an infinite insertion loop • Detected using a depth cutoff • Due to unlucky hash functions • Due to a full hash table • Double the size of the table (if need be), choose new hash functions and rehash all of the elements

  17. Example • To the board!

  18. Asymetric Cuckoo Hashing • Choose one (the first) table to be larger than the other • Improves the probability that we get a hit on the first lookup • Only a minor slowdown on insert

  19. Same Table Cuckoo Hashing • We didn’t actually need two separate tables. • It made the analysis much easier • But… In practice, we just need two hash functions

  20. d-ary Cuckoo Hashing • Guaranteed O(1) lookup/delete • Amortized O(1) insert • 97%+ space efficient • Analysis requires random hash functions • (not quite as easy to implement) • (robust against crappier hash functions)

  21. d-ary Cuckoo Hashing • Use d hash tables instead of two! • Lookup and delete look at d buckets • Insert is more complicated • Insertion sees a tree of possible eviction+insertion paths • BFS to find an empty nest • Random walk to find an empty nest (easier)

  22. Bucketed Cuckoo Hashing • Guaranteed O(1) lookup/delete • Amortized O(1) insert • 90%+ space efficient • Requires *mostly* random hash functions • (easier to implement) • (better, “good” cache performance)

  23. Bucketed Cuckoo Hashing • Use two hash functions: but each hashes to an associative m-wide bucket • Lookup and delete must check at most two whole buckets • Insertion into a full bucket leaves a choice during eviction • Insertion sees a tree of possible eviction+insertion paths • BFS to find an empty bucket • Best first uses most empty target bucket • Random walk to find an empty bucket (easier) • Use LRI eviction for easiest implementation

  24. Generalization: Use both! • Use k hash function • Use bins for size m • Get the best of both worlds!

  25. Max load for O(1) Insert – 99% Guarantee (proven)

  26. IBM’s Implementation • IBM designed a hash table for the cell processor • Parameters: K=2, M=4 (SIMD width) • If hash table fit in scratch L2: • lookup in 21 cycles • Simple multiplicative hash functions worked well

  27. Better Cache Performance than you Would Think • If prefetching is used, cost of lookup is one memory latency (plus time to compute the hash function, which can be done in SIMD) • Exactly two cache-line loads • Binary search trees, linear probing, linear chaining, etc… usually take more cache-line loads and have a very branchy search loop

  28. Conclusions • Cuckoo Hashing Provides: • Guaranteed O(1) lookup+delete • Amortized O(1) insert • Efficient memory utalization • Both in space and bandwidth! • Small constant factors • And SIMD friendly! • And is simple to implement • (easier than linear probing!)

  29. Good Hash Function? • http://www.burtleburtle.net/bob/c/lookup3.c • (very fast, especially if you use the __rotl intrinsic) • #define mix(a,b,c){ a -= c; a ^= rot(c, 4); c += b; b -= a; b ^= rot(a, 6); a += c; c -= b; c ^= rot(b, 8); b += a; a -= c; a ^= rot(c,16); c += b; b -= a; b ^= rot(a,19); a += c; c -= b; c ^= rot(b, 4); b += a;}

  30. Questions?

More Related