590 likes | 671 Views
Learn about symbol tables used by compilers to store information, typical operations, storage locations, mapping keys, and hashing techniques for efficient data retrieval.
E N D
EEM 480 Lecture 11 Hashing and Dictionaries
Symbol Table • Symbol tables are used by compilers to keeptrack of information about • variables • functions • class names • type names • temporary variables • etc. • Typical symbol table operations are Insert,Delete and Search • It's a dictionary structure!
Symbol Table • What kind of information is usually stored in asymbol table? • Type ( int, short, long int, float, …) • storage class (label, static symbol, external def,structure tag,..) • size • scope • stack frame offset • register • We also need a way to keep track of reservedwords.
Symbol Table Where is a symbol table stored? • array/linked list • simple, but linear lookup time • However, we may use a sorted array for reservedwords, since they are generally few and known inadvance. • balanced tree • O(logn) lookup time • hash table • most common implementation • O(1) amortized time for dictionary operations
Hashing • Depends on mapping keys into positions in a table called hash table • Hashing is a technique used for performing insertions, deletions and searches in constant average time
Hashing • In this example john maps 3 • Phil maps 4 … • Problem : • How mapping will be done? • If two items maps the same place what happens?
A Plan For Hashing • Save items in a key-indexed table. Index is a function of the key. • Hash function. • Method for computing table index from key. • Collision resolution strategy. • Algorithm and data structure to handletwo keys that hash to the same index. • If there is no space limitation • Trivial hash function with key as address. • If there is no time limitation • Trivial collision resolution = sequential search. • Limitations on both time and space: hashing (the real world)
Hashing • Hash tables • use array of size m to store elements • given key k (the identifier name), use a function h tocompute index h(k) for that key • collisions are possible • two keys hash into the same slot. • Hash functions • is easy to compute • avoids collisions (by breaking up patterns in the keys anduniformly distributing the hash values)
Hashing • Nomenclature • k is a key • h(k) is the hash function • m is the size of the hash table • n is the number of keys in the hash table
What is Hash • (in Wikipedia) Hash is an American dish consisting of a mixture of beef (often corned beef or roast beef), onions, potatoes, and spicesthat are mashed together into a coarse, chunky paste, and then cooked, either alone, or with other ingredients. • Is it related with our definition???? • to chop any patterns in the keys sothat the results are uniformly distributed
What is Hashing Hashing is the transformation of a stringof characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a databasebecause it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.
Hashing • When the key is a string, we generally usethe ASCII values of its characters in someway: • Examples for k = c1c2c3...cx • h(k) = (c1128(x-1)+c2128(x-2)+...+cx128*0) mod m • h(k) = (c1+c2+...+cx) mod m • h(k) = (h1(c1)+h2(c2)+...hx(cx)) mod m, whereeach hi is an independent hash function.
Finding A Hash Function • Goal: scramble the keys. • Each table position equally likely for each key. • Ex: Vatandaşlık Numarası for 10000 person • Bad: The Whole Number Since 10000 will not be used forever • Better: last three digits. But every number is even • The Best : Use 2,3,4,5 digits • Ex: date of birth. • Bad: first three digits of birth year. • Better: birthday. • Ex: phone numbers. • Bad: first three digits. • Better: last three digits.
Hash Function Truncation • Ignore part of the key and use theremaining part directly as the index. • Example: if the keys are 8-digit numbersand the hash table has 1000 entries, thenthe first, fourth and eighth digit could makethe hash function. • Not a very good method : does notdistribute keys uniformly
Hash Function Folding • Break up the key in parts and combinethem in some way • Example : if the keys are 9 digit numbers,break up a key into three 3-digit numbersand add them up. • Ex ISBN 0-321-37319-7 • Divide them to three as 321 373 and 197 • Add them : 891 use it as mod 500 = 491
Hash Function Middle square • Compute k*k and pick some digits from theresulting number • Example : given a 9-digit key k, and a hashtable of size 1000 pick three digits from themiddle of the number k*k. • Ex 175344387 – 344*344= 118336 -----183 or 833 • Works fairly well in practice if the keys donot have many leading or trailing zeroes.
Hash Function Division • h(k)=k mod m • Fast • Not all values of m are suitable for this. Forexample powers of 2 should be avoidedbecause then k mod m is just the leastsignificant digits of k • Good values for m are prime numbers .
Hash Function Multiplication • h(k)=int(m *(k * c- int(k * c) ) , 0<c<1 • In English : • Multiply the key k by a constant c, 0<c<1 • Take the fractional part of k * c • Multiply that by m • Take the floor of the result • The value of m does not make a difference • Some values of c work better than others • A good value for c :
Hash Function • Multiplication • Example: • Suppose the size of the table, m, is 1301. • For k=1234, h(k)=850 • For k=1235, h(k)=353 • For k=1236, h(k)=115 • For k=1237, h(k)=660 • For k=1238, h(k)=164 • For k=1239, h(k)=968 • For k=1240, h(k)=471
Hash Function • Universal Hashing • Worst-case scenario: The chosen keys all hashto the same slot. • This can be avoided if the hash function is notfixed: • Start with a collection of hash functions with theproperty that for any given set of inputs they willscatter the inputs among the range of the function well • Select one at random and use that. • Good performance on average: the probability that therandomly chosen hash function exhibits the worst-case behavior is very low.
When Collusion Occurs... • Collusion Occurs when more than one item has been mapped to the same location • Ex n = 10 m = 10 Use mod 10 • 9 will be mapped to 9 • 769 will be mapped to 9 • In probability theory, the birthday problem or birthdayparadoxpertains to the probability that in a setof randomly chosen people some pair of them will have the same birthday. In a group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will both have been born on the same day. For 57 or more people, the probability is more than 99%, reaching 100% as the number of people reaches 366. The mathematics behind this problem leads to a well-known cryptographic attack called the birthday attack. • When collusion occurs an algorithm has to map the second, third, ...n’th item to a definitive places in the map • In order to read data from the map the same algorithm has been used to retrieve it.
Resolving Collusion Chaining • Put all the elements that collide in a chain(list) attached to the slot. • The hash table is an array of linked lists • The load factor indicates the averagenumber of elements stored in a chain. Itcould be less than, equal to, or largerthan 1.
What is Load Factor? • Given a hash table of size m, and n elementsstored in it, we define the load factor of thetable as =n/m (lambda) • The load factor gives us an indication of howfull the table is. • The possible values of the load factor dependon the method we use for resolving collisions.
Return to Resolving Collision Chaining ctd. • Chaining puts elements that hash to thesame slot in a linked list • Separate chaining: array of M linked lists. • Hash: map key to integer i between 0 and M-1. • Insert: put at front of ith chain. • constant time • Search: only need to search ith chain. • proportional to length of chain
Chaining • Insert/Delete/Lookup in expected O(1)time • Keep the list doubly-linked to facilitatedeletions • Worst case of lookup time is linear. • However, this assumes that the chainsare kept small. • If the chains start becoming too long, thetable must be enlarged and all the keysrehashed.
Chaining Performance • Search cost is proportional to length of chain. • Trivial: average length = N / M. • Worst case: all keys hash to same chain. • Theorem. Let λ= N / M > 1 be average length of list which is called loading factor. • Average search cost : 1+ λ/2 • What is the choice of M • M too large too many empty chains. • M too small chains too long. • Typical choice: = N / M ~ 10 constant-time search/insert.
Chaining Performance • Analysis of successful search: • Expected number e of elements examinedduring a successful search for key k= one more than the expected number ofelements examined when k was inserted. • it makes no difference whether we insert at the beginning orthe end of the list. • Take the average, over the n items in thetable, of 1 plus the expected length of thechain to which the ith element was added:
Open Addressing Open addressing • Store all elements within the table • The space we save from the chain pointers is usedinstead to make the array larger. • If there is a collision, probe the table in asystematic way to find an empty slot. • If the table fills up, we need to enlarge it andrehash all the keys.
Open Addressing • hash function: (h(k) + i ) mod m for i=0, 1,...,m-1 • Insert : Start with the location where the key hashed anddo a sequential search for an empty slot. • Search : Start with the location where the key hashedand do a sequential search until you either find the key(success) or find an empty slot (failure). • Delete : (lazy deletion) follow same route but mark slotas DELETED rather than EMPTY, otherwise subsequentsearches will fail.
Hash Table without Linked-List • Linear probing: array of size M. • Hash: map key to integer i between 0 and M-1. • Insert: put in slot i if free, if not try i+1, i+2, etc. • Search: search slot i, if occupied but no match, try i+1, i+2, etc. • Cluster. • Contiguous block of items. • Search through cluster using elementary algorithm for arrays.
Open Address Lineer Probing • Advantage: very easy to implement • Disadvantage: primary clustering • Long sequences of used slots build up with gapsbetween them. Every insertion requires severalprobes and adds to the cluster. • The average length of a probe sequence wheninserting is
Quadratic Probes • Probe the table at slots (h(k) + i2) mod m for i =0, 1,2, 3, ..., m-1 • Ease of computation: • Not as easy as linear probing. • Do we really have to compute a power? • Clustering • Primary clustering is avoided, since the probesare not sequential.
Search Quadratic Probing • Probe sequence for hash value 3 in a table ofsize 16: 3 + 0^2 = 3 3 + 1^2 = 4 3 + 2^2 = 7 3 + 3^2 = 12 3 + 4^2 = 3 3 + 5^2 = 12 3 + 6^2 = 7 3 + 7^2 = 4 3 + 8^2 = 3 3 + 9^2 = 4 3 + 10^2 = 7 3 + 11^2 = 12 3 + 12^2 = 3 3 + 13^2 = 12 3 + 14^2 = 7 3 + 15^2 = 4
Quadrature Probing • Probe sequence for hash value 3 in a table ofsize 19: 3 + 0^2 = 3 3 + 1^2 = 4 3 + 2^2 = 7 3 + 32 = 12 3 + 42 = 0 3 + 52 = 9 3 + 62 = 1 3 + 72 = 14 3 + 82 = 10 3 + 92 = 8
Quadrature Probing • Disadvantage: secondary clustering: • if h(k1)==h(k2) the probing sequences fork1 and k2 are exactly the same. • Is this really bad? • In practice, not so much • It becomes an issue when the load factor ishigh.
Double Hashing • The hash function is (h(k)+i h2(k)) mod m • In English: use a second hash function to obtainthe next slot. • The probing sequence is: • h(k), h(k)+h2(k), h(k)+2h2(k), h(k)+3h3(k), ... • Performance : • Much better than linear or quadratic probing. • Does not suffer from clustering • BUT requires computation of a second function
Double Hashing • The choice of h2(k) is important • It must never evaluate to zero • consider h2(k)=k mod 9 for k=81 • The choice of m is important • If it is not prime, we may run out of alternatelocations very fast.
Rehashing • After 70% of table is full, double the size of the hash table. • Don’t forget to have prime number
Lempel-Ziv-Welch (LZW) Compression Algorithm • Introduction to the LZW Algorithm • Example 1: Encoding using LZW • Example 2: Decoding using LZW • LZW: Concluding Notes
Introduction to LZW • As mentioned earlier, static coding schemes require some knowledge about the data before encoding takes place. • Universal coding schemes, like LZW, do not require advance knowledge and can build such knowledge on-the-fly. • LZW is the foremost technique for general purpose data compression due to its simplicity and versatility. • It is the basis of many PC utilities that claim to “double the capacity of your hard drive” • LZW compression uses a code table, with 4096 as a common choice for the number of table entries.
Introduction to LZW (cont'd) • Codes 0-255 in the code table are always assigned to represent single bytes from the input file. • When encoding begins the code table contains only the first 256 entries, with the remainder of the table being blanks. • Compression is achieved by using codes 256 through 4095 to represent sequences of bytes. • As the encoding continues, LZW identifies repeated sequences in the data, and adds them to the code table. • Decoding is achieved by taking each code from the compressed file, and translating it through the code table to find what character or characters it represents.
LZW Encoding Algorithm 1 Initialize table with single character strings 2 P = first input character 3 WHILE not end of input stream 4 C = next input character 5 IF P + C is in the string table 6 P = P + C 7 ELSE 8 output the code for P 9 add P + C to the string table 10 P = C 11 END WHILE 12 output code for P
Example 1: Compression using LZW Example 1: Use the LZW algorithm to compress the string BABAABAAA
Example 1: LZW Compression Step 1 BABAABAAA P=A C=empty
Example 1: LZW Compression Step 2 BABAABAAA P=B C=empty
Example 1: LZW Compression Step 3 BABAABAAA P=A C=empty
Example 1: LZW Compression Step 4 BABAABAAA P=A C=empty
Example 1: LZW Compression Step 5 BABAABAAA P=A C=A
Example 1: LZW Compression Step 6 BABAABAAA P=AA C=empty
LZW Decompression • The LZW decompressor creates the same string table during decompression. • It starts with the first 256 table entries initialized to single characters. • The string table is updated for each character in the input stream, except the first one. • Decoding achieved by reading codes and translating them through the code table being built.