1 / 40

15-211 Fundamental Structures of Computer Science

Dictionaries, Tries and Hash Maps. 15-211 Fundamental Structures of Computer Science. Jan. 30, 2003. Ananda Guna. Based on lectures given by Peter Lee, Avrim Blum, Danny Sleator, William Scherlis, Ananda Guna & Klaus Sutner. Dictionaries. A data structure that supports lookups and updates

freya
Download Presentation

15-211 Fundamental Structures of Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dictionaries, Tries and Hash Maps 15-211Fundamental Structuresof Computer Science Jan. 30, 2003 Ananda Guna Based on lectures given by Peter Lee, Avrim Blum, Danny Sleator, William Scherlis, Ananda Guna & Klaus Sutner

  2. Dictionaries • A data structure that supports lookups and updates • Computational complexity: • The time and space needed to authenticate the dictionary, i.e. creating and updating it. • The time needed to perform an authenticated membership query. • The time needed to verify the answer to an authenticated membership query. • Implementing a Dictionary • A sorted array? • A sorted linked list? • A BST? • Array of Linked Lists? • Trie?

  3. root 2 4 3 0 4 1 2 2 3 2 20 22 24 31 32 42 43 What is a Trie? • Trie of size (h, b) is a tree of height h and branching factor at most b • All keys can be regarded as integers in range [0, bh] • Each key K can be represented as h-digit number in base b: K1K2K3…Kh • Keys are stored in the leaf level; path from the root resembles decomposition of the keys to digits

  4. Motivation for tries: An Application of Trees

  5. Sample problem: How to do instant messaging on a telephone keypad?

  6. Typing “I love you”: 444 * 555 666 888 33 * 999 666 88 # What is the worst-case number of keystrokes? What might be the average number of keystrokes?

  7. Better approach: 4* 5 6 8 3* 9 6 8# Can be done by using tries.

  8. Tries • A trie is a data structure that stores the information about the contents of each node in the path from the root to the node, rather than the node itself • A tree structure that encodes the possible sequences of symbols in a dictionary. • Invented by Fredkin in 1960. • For example, suppose we have the alphabet {a,b}, and want to store the sequences • aaa, aab, baa, bbab

  9. Tries aaa, aab, baa, bbab No information stored in nodes per se. Shape of trie determines the items.

  10. Tries Nodes can be quite large: N pointers to children, where N is the size of the alphabet. Operations are very fast: search, insert, delete are all O(k) where k is the length of the sequence in question.

  11. Keypad IM Trie 4 5 9 I 4 6 6 5 8 8 3 3 you like love

  12. Implementing a Trie • How do we implement a trie? • How about a tree of arrays? • You can also use HashMaps to implement a trie (later) A C D A

  13. Hashing

  14. Why Hashing? • Suppose we need to find a better way to maintain a table that is easy to insert and search. • If we use a sorted list, you can do binary search and insertion in log2n time. • So is there an alternative way to handle operations such as insert, search, delete? • Yes, Hashing

  15. Big Idea • Suppose we have M items we need to put into a table of size N. • Can we find a Map H such that • H[ith item] [0..N-1]? • Assume that N = 5 and the values we need to insert are: cab, bea, bad etc. • Assume that we assign values to letters: a=0, b=1, c=2, etc

  16. Big Idea Ctd.. • Define H such that • H[data] = ( characters) Mod N • H[cab] = (0+1+2) Mod 5 = 3 • H[bea] = (1+4+0) Mod 5 = 0 • H[bad] = (1+0+3) Mod 5 = 4 bea cab bad 0 1 2 3 4

  17. Problems • Collisions • What if the values we need to insert are “abc”, “cba”, “bca” etc… • They all map to the same location based on our map H (obviously H is not a good hash map) • This is called “Collision” • One way to deal with collisions is “separate chaining” • The idea is to maintain an array of linked lists • More on collisions later

  18. Running Time • We need to make sure that H is easy to compute (constant time) • Lookups and deletes from the hash table depends on H • Assume M = theta(N) • So what is a “bad” H? • Suppose we hash strings by simply adding up the letters and taking it mod the table size. Is that good or bad? • Homework: Think of hashing 1000 5-letter words into a table of size 1000 using the map H. What would be the key distribution like?

  19. What is a good H? • If H behaves likes a random function, there are N-1 other keys with equal probability(1/M) that can collide with the given Key. • Therefore E(collision of a Key) = (N-1)/M • If M = Theta(N) then this value is 1. This is great. But life is not fair. • So what is a good Hash function? • Lets consider a hashing a set of strings Si. Say each Si is of some length i. Consider • H(Si) = (Si[j].dj ) Mod M, where d is some large number and M is the table size. • Is this function hard to calculate?

  20. Collisions • Hash functions can be many-to-1 • They can map different search keys to the same hash key. hash(`a`) == 9 == hash(`w`)

  21. Collisions • Hash functions can be many-to-1 • They can map different search keys to the same hash key. hash1(`a`) == 9 == hash1(`w`) • Must compare the search key with the record found

  22. Collisions • Hash functions can be many-to-1 • They can map different search keys to the same hash key. hash1(`a`) == 9 == hash1(`w`) • Must compare the search key with the record found • If the match fails, there is a collision

  23. Collision strategies • Separate chaining • Open addressing • Linear • Quadratic • Double • Etc. • The perfect hash

  24. Linear Probing • The idea: • Table remains a simple array • On insert, if the cell is full, find another by sequentially searching for the next available slot • On find, if the cell doesn’t match, look elsewhere. • Eg: Consider H(key) = key Mod 6 (assume N=6) • H(11)=5, H(10)=4, H(17)=5, H(16)=4,H(23)=5 • Draw the Hash table

  25. Linear Probe ctd.. • How about deleting items? • Item in a hash table connects to others in the table(eg: BST). • “Lazy Delete” – Just mark the items active or delete rather than removing it.

  26. More on Delete • Naïve removal can leave gaps! Insert f Remove e Find f 0 a 2 b 3 c 3 e 5 d 8 j 8 u 10 g 8 s 0 a 2 b 3 c 3 e 5 d 3 f 8 j 8 u 10 g 8 s 0 a 2 b 3 c 5 d 3 f 8 j 8 u 10 g 8 s 0 a 2 b 3 c 5 d 3 f 8 j 8 u 10 g 8 s “3 f” means search key f and hash key 3

  27. More about delete ctd.. • Clever removal shrinks the table Insert f Remove e Find f 0 a 2 b 3 c 3 e 5 d 8 j 8 u 10 g 8 s 0 a 2 b 3 c 3 e 5 d 3 f 8 j 8 u 10 g 8 s 0 a 2 b 3 c gone 5 d 3 f 8 j 8 u 10 g 8 s 0 a 2 b 3 c gone 5 d 3 f 8 j 8 u 10 g 8 s “3 f” means search key f and hash key 3

  28. Performance of linear probing • Average numbers of probes • Unsuccessful search and insert ½ (1 + 1/(1-)2) • Succesful search ½ (1 + 1/(1-)) • When  is low: • Probe counts are close to 1 • When  is high: • E.g., when  = 0.75, unsuccessful probes:½ (1 + 1/(1-)2) = ½ (1 + 16) = 8.5 probes • E.g., when  = 0.5, unsuccessful probes:½ (1 + 1/(1-)2) = ½ (1 + 4) = 2.5 probes

  29. Quadratic probing • Resolve collisions by examining certain cells away from the original probe point • Collision policy: • Define h0(k), h1(k), h2(k), h3(k), … where hi(k) = (hash(k) + i2) mod size • Caveat: • May not find a vacant cell! • Table must be less than half full ( < ½) • (Linear probing always finds a cell.)

  30. Quadratic probing • Another issue • Suppose the table size is 16. • Probe offsets that will be tried: 1 4 9 16 25 36 49 64 81

  31. Quadratic probing • Another issue • Suppose the table size is 16. • Probe offsets that will be tried: 1 mod 16 = 4 mod 16 = 9 mod 16 = 16 mod 16 = 25 mod 16 = 36 mod 16 = 49 mod 16 = 64 mod 16 = 81 mod 16 =

  32. Quadratic probing • Another issue • Suppose the table size is 16. • Probe offsets that will be tried: 1 mod 16 = 1 4 mod 16 = 4 9 mod 16 = 9 16 mod 16 = 0 25 mod 16 = 9 36 mod 16 = 4 49 mod 16 = 1 64 mod 16 = 0 81 mod 16 = 1

  33. Quadratic probing • Another issue • Suppose the table size is 16. • Probe offsets that will be tried: 1 mod 16 = 1 4 mod 16 = 4 9 mod 16 = 9 16 mod 16 = 0 25 mod 16 = 9 only four different values! 36 mod 16 = 4 49 mod 16 = 1 64 mod 16 = 0 81 mod 16 = 1

  34. Quadratic probing • Table size must be prime • Load factor must be less than ½

  35. Rehash • Scaling up • What makes  grow too large? • Too much data • Too many removals • Rehash! • Do when insert fails or load factor grows • Build a new table • Scan existing table and do inserts into new table

  36. Rehash • Scaling up • What makes  grow too large? • Too much data • Too many removals • Rehash! • Do when insert fails or load factor grows • Build a new table • Scan existing table and do inserts into new table • Twice the size or more • Adds only constant average cost

  37. Double Hashing • Collision policy • Define h0(k), h1(k), h2(k), h3(k), … where hi(k) = (hash(k) + i*hash2(k))mod size • Caveats • hash2(k) must never be zero • Table size must be prime • If multiples of hash2 results are equal to table size, fewer alternative cells will be tried. • Quadratic probing may be faster/easier in practice.

  38. About HashMap Class • This implements the Map interface • HashMap permits null values and null keys. • Constant time performance for get and put operations • HashMap has two parameters that affect its performance: initial capacity and load factor • Capacity – number of buckets in the Hash Table • Load factor – How full the Hash Table is allowed to get before capacity is automatically increased using rehash function.

  39. More on Java HashMaps • HashMap(int initialCapacity) • put(Object key, Object value) • get(Object key) – returns Object • keySet() Returns a set view of the keys contained in this map. • values() Returns a collection view of the values contained in this map.

  40. Next Week • HW2 is due Monday – you need to start early • We will discuss Priority Queues and recap of what we have done so far • Read More about Hash Tables in Chapter 20 • See you Tuesday

More Related