Understanding Hashing and Dictionaries in Symbol Tables

EEM 480 Lecture 11 Hashing and Dictionaries

Symbol Table • Symbol tables are used by compilers to keeptrack of information about • variables • functions • class names • type names • temporary variables • etc. • Typical symbol table operations are Insert,Delete and Search • It's a dictionary structure!

Symbol Table • What kind of information is usually stored in asymbol table? • Type ( int, short, long int, float, …) • storage class (label, static symbol, external def,structure tag,..) • size • scope • stack frame offset • register • We also need a way to keep track of reservedwords.

Symbol Table Where is a symbol table stored? • array/linked list • simple, but linear lookup time • However, we may use a sorted array for reservedwords, since they are generally few and known inadvance. • balanced tree • O(logn) lookup time • hash table • most common implementation • O(1) amortized time for dictionary operations

Hashing • Depends on mapping keys into positions in a table called hash table • Hashing is a technique used for performing insertions, deletions and searches in constant average time

Hashing • In this example john maps 3 • Phil maps 4 … • Problem : • How mapping will be done? • If two items maps the same place what happens?

A Plan For Hashing • Save items in a key-indexed table. Index is a function of the key. • Hash function. • Method for computing table index from key. • Collision resolution strategy. • Algorithm and data structure to handletwo keys that hash to the same index. • If there is no space limitation • Trivial hash function with key as address. • If there is no time limitation • Trivial collision resolution = sequential search. • Limitations on both time and space: hashing (the real world)

Hashing • Hash tables • use array of size m to store elements • given key k (the identifier name), use a function h tocompute index h(k) for that key • collisions are possible • two keys hash into the same slot. • Hash functions • is easy to compute • avoids collisions (by breaking up patterns in the keys anduniformly distributing the hash values)

Hashing • Nomenclature • k is a key • h(k) is the hash function • m is the size of the hash table • n is the number of keys in the hash table

What is Hash • (in Wikipedia) Hash is an American dish consisting of a mixture of beef (often corned beef or roast beef), onions, potatoes, and spicesthat are mashed together into a coarse, chunky paste, and then cooked, either alone, or with other ingredients. • Is it related with our definition???? • to chop any patterns in the keys sothat the results are uniformly distributed

What is Hashing Hashing is the transformation of a stringof characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a databasebecause it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.

Hashing • When the key is a string, we generally usethe ASCII values of its characters in someway: • Examples for k = c1c2c3...cx • h(k) = (c1128(x-1)+c2128(x-2)+...+cx128*0) mod m • h(k) = (c1+c2+...+cx) mod m • h(k) = (h1(c1)+h2(c2)+...hx(cx)) mod m, whereeach hi is an independent hash function.

Finding A Hash Function • Goal: scramble the keys. • Each table position equally likely for each key. • Ex: Vatandaşlık Numarası for 10000 person • Bad: The Whole Number Since 10000 will not be used forever • Better: last three digits. But every number is even • The Best : Use 2,3,4,5 digits • Ex: date of birth. • Bad: first three digits of birth year. • Better: birthday. • Ex: phone numbers. • Bad: first three digits. • Better: last three digits.

Hash Function Truncation • Ignore part of the key and use theremaining part directly as the index. • Example: if the keys are 8-digit numbersand the hash table has 1000 entries, thenthe first, fourth and eighth digit could makethe hash function. • Not a very good method : does notdistribute keys uniformly

Hash Function Folding • Break up the key in parts and combinethem in some way • Example : if the keys are 9 digit numbers,break up a key into three 3-digit numbersand add them up. • Ex ISBN 0-321-37319-7 • Divide them to three as 321 373 and 197 • Add them : 891 use it as mod 500 = 491

Hash Function Middle square • Compute k*k and pick some digits from theresulting number • Example : given a 9-digit key k, and a hashtable of size 1000 pick three digits from themiddle of the number k*k. • Ex 175344387 – 344*344= 118336 -----183 or 833 • Works fairly well in practice if the keys donot have many leading or trailing zeroes.

Hash Function Division • h(k)=k mod m • Fast • Not all values of m are suitable for this. Forexample powers of 2 should be avoidedbecause then k mod m is just the leastsignificant digits of k • Good values for m are prime numbers .

Hash Function Multiplication • h(k)=int(m *(k * c- int(k * c) ) , 0<c<1 • In English : • Multiply the key k by a constant c, 0<c<1 • Take the fractional part of k * c • Multiply that by m • Take the floor of the result • The value of m does not make a difference • Some values of c work better than others • A good value for c :

Hash Function • Multiplication • Example: • Suppose the size of the table, m, is 1301. • For k=1234, h(k)=850 • For k=1235, h(k)=353 • For k=1236, h(k)=115 • For k=1237, h(k)=660 • For k=1238, h(k)=164 • For k=1239, h(k)=968 • For k=1240, h(k)=471

Hash Function • Universal Hashing • Worst-case scenario: The chosen keys all hashto the same slot. • This can be avoided if the hash function is notfixed: • Start with a collection of hash functions with theproperty that for any given set of inputs they willscatter the inputs among the range of the function well • Select one at random and use that. • Good performance on average: the probability that therandomly chosen hash function exhibits the worst-case behavior is very low.

When Collusion Occurs... • Collusion Occurs when more than one item has been mapped to the same location • Ex n = 10 m = 10 Use mod 10 • 9 will be mapped to 9 • 769 will be mapped to 9 • In probability theory, the birthday problem or birthdayparadoxpertains to the probability that in a setof randomly chosen people some pair of them will have the same birthday. In a group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will both have been born on the same day. For 57 or more people, the probability is more than 99%, reaching 100% as the number of people reaches 366. The mathematics behind this problem leads to a well-known cryptographic attack called the birthday attack. • When collusion occurs an algorithm has to map the second, third, ...n’th item to a definitive places in the map • In order to read data from the map the same algorithm has been used to retrieve it.

Resolving Collusion Chaining • Put all the elements that collide in a chain(list) attached to the slot. • The hash table is an array of linked lists • The load factor indicates the averagenumber of elements stored in a chain. Itcould be less than, equal to, or largerthan 1.

What is Load Factor? • Given a hash table of size m, and n elementsstored in it, we define the load factor of thetable as =n/m (lambda) • The load factor gives us an indication of howfull the table is. • The possible values of the load factor dependon the method we use for resolving collisions.

Return to Resolving Collision Chaining ctd. • Chaining puts elements that hash to thesame slot in a linked list • Separate chaining: array of M linked lists. • Hash: map key to integer i between 0 and M-1. • Insert: put at front of ith chain. • constant time • Search: only need to search ith chain. • proportional to length of chain

Chaining • Insert/Delete/Lookup in expected O(1)time • Keep the list doubly-linked to facilitatedeletions • Worst case of lookup time is linear. • However, this assumes that the chainsare kept small. • If the chains start becoming too long, thetable must be enlarged and all the keysrehashed.

Chaining Performance • Search cost is proportional to length of chain. • Trivial: average length = N / M. • Worst case: all keys hash to same chain. • Theorem. Let λ= N / M > 1 be average length of list which is called loading factor. • Average search cost : 1+ λ/2 • What is the choice of M • M too large too many empty chains. • M too small chains too long. • Typical choice: = N / M ~ 10 constant-time search/insert.

Chaining Performance • Analysis of successful search: • Expected number e of elements examinedduring a successful search for key k= one more than the expected number ofelements examined when k was inserted. • it makes no difference whether we insert at the beginning orthe end of the list. • Take the average, over the n items in thetable, of 1 plus the expected length of thechain to which the ith element was added:

Open Addressing Open addressing • Store all elements within the table • The space we save from the chain pointers is usedinstead to make the array larger. • If there is a collision, probe the table in asystematic way to find an empty slot. • If the table fills up, we need to enlarge it andrehash all the keys.

Open Addressing • hash function: (h(k) + i ) mod m for i=0, 1,...,m-1 • Insert : Start with the location where the key hashed anddo a sequential search for an empty slot. • Search : Start with the location where the key hashedand do a sequential search until you either find the key(success) or find an empty slot (failure). • Delete : (lazy deletion) follow same route but mark slotas DELETED rather than EMPTY, otherwise subsequentsearches will fail.

Hash Table without Linked-List • Linear probing: array of size M. • Hash: map key to integer i between 0 and M-1. • Insert: put in slot i if free, if not try i+1, i+2, etc. • Search: search slot i, if occupied but no match, try i+1, i+2, etc. • Cluster. • Contiguous block of items. • Search through cluster using elementary algorithm for arrays.

Open Address Lineer Probing • Advantage: very easy to implement • Disadvantage: primary clustering • Long sequences of used slots build up with gapsbetween them. Every insertion requires severalprobes and adds to the cluster. • The average length of a probe sequence wheninserting is

Quadratic Probes • Probe the table at slots (h(k) + i2) mod m for i =0, 1,2, 3, ..., m-1 • Ease of computation: • Not as easy as linear probing. • Do we really have to compute a power? • Clustering • Primary clustering is avoided, since the probesare not sequential.

Search Quadratic Probing • Probe sequence for hash value 3 in a table ofsize 16: 3 + 0^2 = 3 3 + 1^2 = 4 3 + 2^2 = 7 3 + 3^2 = 12 3 + 4^2 = 3 3 + 5^2 = 12 3 + 6^2 = 7 3 + 7^2 = 4 3 + 8^2 = 3 3 + 9^2 = 4 3 + 10^2 = 7 3 + 11^2 = 12 3 + 12^2 = 3 3 + 13^2 = 12 3 + 14^2 = 7 3 + 15^2 = 4

Quadrature Probing • Probe sequence for hash value 3 in a table ofsize 19: 3 + 0^2 = 3 3 + 1^2 = 4 3 + 2^2 = 7 3 + 32 = 12 3 + 42 = 0 3 + 52 = 9 3 + 62 = 1 3 + 72 = 14 3 + 82 = 10 3 + 92 = 8

Quadrature Probing • Disadvantage: secondary clustering: • if h(k1)==h(k2) the probing sequences fork1 and k2 are exactly the same. • Is this really bad? • In practice, not so much • It becomes an issue when the load factor ishigh.

Double Hashing • The hash function is (h(k)+i h2(k)) mod m • In English: use a second hash function to obtainthe next slot. • The probing sequence is: • h(k), h(k)+h2(k), h(k)+2h2(k), h(k)+3h3(k), ... • Performance : • Much better than linear or quadratic probing. • Does not suffer from clustering • BUT requires computation of a second function

Double Hashing • The choice of h2(k) is important • It must never evaluate to zero • consider h2(k)=k mod 9 for k=81 • The choice of m is important • If it is not prime, we may run out of alternatelocations very fast.

Rehashing • After 70% of table is full, double the size of the hash table. • Don’t forget to have prime number

Lempel-Ziv-Welch (LZW) Compression Algorithm • Introduction to the LZW Algorithm • Example 1: Encoding using LZW • Example 2: Decoding using LZW • LZW: Concluding Notes

Introduction to LZW • As mentioned earlier, static coding schemes require some knowledge about the data before encoding takes place. • Universal coding schemes, like LZW, do not require advance knowledge and can build such knowledge on-the-fly. • LZW is the foremost technique for general purpose data compression due to its simplicity and versatility. • It is the basis of many PC utilities that claim to “double the capacity of your hard drive” • LZW compression uses a code table, with 4096 as a common choice for the number of table entries.

Introduction to LZW (cont'd) • Codes 0-255 in the code table are always assigned to represent single bytes from the input file. • When encoding begins the code table contains only the first 256 entries, with the remainder of the table being blanks. • Compression is achieved by using codes 256 through 4095 to represent sequences of bytes. • As the encoding continues, LZW identifies repeated sequences in the data, and adds them to the code table. • Decoding is achieved by taking each code from the compressed file, and translating it through the code table to find what character or characters it represents.

LZW Encoding Algorithm 1 Initialize table with single character strings 2 P = first input character 3 WHILE not end of input stream 4 C = next input character 5 IF P + C is in the string table 6 P = P + C 7 ELSE 8 output the code for P 9 add P + C to the string table 10 P = C 11 END WHILE 12 output code for P

Example 1: Compression using LZW Example 1: Use the LZW algorithm to compress the string BABAABAAA

Example 1: LZW Compression Step 1 BABAABAAA P=A C=empty

Example 1: LZW Compression Step 2 BABAABAAA P=B C=empty

Example 1: LZW Compression Step 5 BABAABAAA P=A C=A

Example 1: LZW Compression Step 6 BABAABAAA P=AA C=empty

LZW Decompression • The LZW decompressor creates the same string table during decompression. • It starts with the first 256 table entries initialized to single characters. • The string table is updated for each character in the input stream, except the first one. • Decoding achieved by reading codes and translating them through the code table being built.

Understanding Hashing and Dictionaries in Symbol Tables

Understanding Hashing and Dictionaries in Symbol Tables

Presentation Transcript

FHA Energy Efficient Mortgage Program EEM

Election Expenditure Monitoring [EEM]

EEM

EEM 480 Algorithms and Complexity

EEM 480

CS 480/680

BADM 480

Enhanced EM (EEM) Algorithm

EEM 23 2 Digital Systems I

CS 480/680

CS 480/680

Environmental Engineering Management (EEM 690)

EEM calculations in FRED

CS 480/680

Climate Change Problem Solving (AOSS 480 // NRE 480)

EEM 3117 Automatic Control