eem 480 n.
Skip this Video
Loading SlideShow in 5 Seconds..
EEM 480 PowerPoint Presentation
Download Presentation
EEM 480

Loading in 2 Seconds...

play fullscreen
1 / 59

EEM 480 - PowerPoint PPT Presentation

  • Uploaded on

EEM 480. Lecture 11 Hashing and Dictionaries. Symbol Table. Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc. Typical symbol table operations are Insert, Delete and Search

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'EEM 480' - muhammad-jasmi

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
eem 480

EEM 480

Lecture 11

Hashing and Dictionaries

symbol table
Symbol Table
  • Symbol tables are used by compilers to keeptrack of information about
    • variables
    • functions
    • class names
    • type names
    • temporary variables
    • etc.
  • Typical symbol table operations are Insert,Delete and Search
    • It's a dictionary structure!
symbol table1
Symbol Table
  • What kind of information is usually stored in asymbol table?
    • Type ( int, short, long int, float, …)
    • storage class (label, static symbol, external def,structure tag,..)
    • size
    • scope
    • stack frame offset
    • register
  • We also need a way to keep track of reservedwords.
symbol table2
Symbol Table

Where is a symbol table stored?

  • array/linked list
    • simple, but linear lookup time
    • However, we may use a sorted array for reservedwords, since they are generally few and known inadvance.
  • balanced tree
    • O(logn) lookup time
  • hash table
    • most common implementation
    • O(1) amortized time for dictionary operations
  • Depends on mapping keys into positions in a table called hash table
  • Hashing is a technique used for performing insertions, deletions and searches in constant average time
  • In this example john maps 3
  • Phil maps 4 …
  • Problem :
    • How mapping will be done?
    • If two items maps the same place what happens?
a plan for hashing
A Plan For Hashing
  • Save items in a key-indexed table. Index is a function of the key.
  • Hash function.
    • Method for computing table index from key.
  • Collision resolution strategy.
    • Algorithm and data structure to handletwo keys that hash to the same index.
  • If there is no space limitation
    • Trivial hash function with key as address.
  • If there is no time limitation
    • Trivial collision resolution = sequential search.
  • Limitations on both time and space: hashing (the real world)
  • Hash tables
    • use array of size m to store elements
    • given key k (the identifier name), use a function h tocompute index h(k) for that key
      • collisions are possible
    • two keys hash into the same slot.
  • Hash functions
    • is easy to compute
    • avoids collisions (by breaking up patterns in the keys anduniformly distributing the hash values)
  • Nomenclature
    • k  is a key
    • h(k) is the hash function
    • m  is the size of the hash table
    • n  is the number of keys in the hash table
what is hash
What is Hash
  • (in Wikipedia) Hash is an American dish consisting of a mixture of beef (often corned beef or roast beef), onions, potatoes, and spicesthat are mashed together into a coarse, chunky paste, and then cooked, either alone, or with other ingredients. 
  • Is it related with our definition????
    • to chop any patterns in the keys sothat the results are uniformly distributed
what is hashing
What is Hashing

Hashing is the transformation of a stringof characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a databasebecause it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.

  • When the key is a string, we generally usethe ASCII values of its characters in someway:
    • Examples for k =
    • h(k) = (c1128(x-1)+c2128(x-2)+...+cx128*0) mod  m
    • h(k) = (c1+c2+...+cx) mod m
    • h(k) = (h1(c1)+h2(c2)+...hx(cx)) mod m, whereeach hi is an independent hash function. 
finding a hash function
Finding A Hash Function
  • Goal: scramble the keys.
    • Each table position equally likely for each key.
  • Ex: Vatandaşlık Numarası for 10000 person
    • Bad: The Whole Number Since 10000 will not be used forever
    • Better: last three digits. But every number is even
    • The Best : Use 2,3,4,5 digits
  • Ex: date of birth.
    • Bad: first three digits of birth year.
    • Better: birthday.
  • Ex: phone numbers.
    • Bad: first three digits.
    • Better: last three digits.
hash function
Hash Function


  • Ignore part of the key and use theremaining part directly as the index.
  • Example: if the keys are 8-digit numbersand the hash table has 1000 entries, thenthe first, fourth and eighth digit could makethe hash function.
  • Not a very good method : does notdistribute keys uniformly
hash function1
Hash Function


  • Break up the key in parts and combinethem in some way
  • Example : if the keys are 9 digit numbers,break up a key into three 3-digit numbersand add them up.
    • Ex ISBN 0-321-37319-7
    • Divide them to three as 321 373 and 197
    • Add them : 891 use it as mod 500 = 491
hash function2
Hash Function

Middle square

  • Compute k*k and pick some digits from theresulting number
  • Example : given a 9-digit key k, and a hashtable of size 1000 pick three digits from themiddle of the number k*k. 
  • Ex 175344387 – 344*344= 118336 -----183 or 833
  • Works fairly well in practice if the keys donot have many leading or trailing zeroes.
hash function3
Hash Function


  • h(k)=k mod m
  • Fast
  • Not all values of m are suitable for this. Forexample powers of 2 should be avoidedbecause then k mod m is just the leastsignificant digits of k
  • Good values for m are prime numbers .
hash function4
Hash Function


    • h(k)=int(m *(k * c- int(k * c) ) , 0<c<1
    • In English :
    • Multiply the key k by a constant c, 0<c<1
    • Take the fractional part of k * c
    • Multiply that by m
    • Take the floor of the result
  • The value of m does not make a difference
  • Some values of c work better than others
  • A good value for c :
hash function5
Hash Function
  • Multiplication
    • Example:
    • Suppose the size of the table, m, is 1301.
    • For k=1234,   h(k)=850
    • For k=1235,   h(k)=353
    • For k=1236,   h(k)=115
    • For k=1237,   h(k)=660
    • For k=1238,   h(k)=164
    • For k=1239,   h(k)=968
    • For k=1240,   h(k)=471
hash function6
Hash Function
  • Universal Hashing
    • Worst-case scenario: The chosen keys all hashto the same slot.
    • This can be avoided if the hash function is notfixed:
    • Start with a collection of hash functions with theproperty that for any given set of inputs they willscatter the inputs among the range of the function well
    • Select one at random and use that.
    • Good performance on average: the probability that therandomly chosen hash function exhibits the worst-case behavior is very low.
when collusion occurs
When Collusion Occurs...
  • Collusion Occurs when more than one item has been mapped to the same location
    • Ex n = 10 m = 10 Use mod 10
    • 9 will be mapped to 9
    • 769 will be mapped to 9
  • In probability theory, the birthday problem or birthdayparadoxpertains to the probability that in a setof randomly chosen people some pair of them will have the same birthday. In a group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will both have been born on the same day. For 57 or more people, the probability is more than 99%, reaching 100% as the number of people reaches 366. The mathematics behind this problem leads to a well-known cryptographic attack called the birthday attack.
  • When collusion occurs an algorithm has to map the second, third, ...n’th item to a definitive places in the map
  • In order to read data from the map the same algorithm has been used to retrieve it.
resolving collusion
Resolving Collusion


  • Put all the elements that collide in a chain(list) attached to the slot.
  • The hash table is an array of linked lists
  • The load factor indicates the averagenumber of elements stored in a chain. Itcould be less than, equal to, or largerthan 1.
what is load factor
What is Load Factor?
  • Given a hash table of size m, and n elementsstored in it, we define the load factor of thetable as =n/m  (lambda)
  • The load factor gives us an indication of howfull the table is.
  • The possible values of the load factor dependon the method we use for resolving collisions.
return to resolving collision chaining ctd
Return to Resolving Collision Chaining ctd.
  • Chaining puts elements that hash to thesame slot in a linked list
  • Separate chaining: array of M linked lists.
    • Hash: map key to integer i between 0 and M-1.
    • Insert: put at front of ith chain.
      • constant time
    • Search: only need to search ith chain.
      • proportional to length of chain
  • Insert/Delete/Lookup in expected O(1)time
    • Keep the list doubly-linked to facilitatedeletions
  • Worst case of lookup time is linear.
    • However, this assumes that the chainsare kept small.
    • If the chains start becoming too long, thetable must be enlarged and all the keysrehashed.
chaining performance
Chaining Performance
  • Search cost is proportional to length of chain.
    • Trivial: average length = N / M.
    • Worst case: all keys hash to same chain.
  • Theorem. Let λ= N / M > 1 be average length of list which is called loading factor.
    • Average search cost : 1+ λ/2
  • What is the choice of M
    • M too large too many empty chains.
    • M too small chains too long.
    • Typical choice: = N / M ~ 10 constant-time search/insert.
chaining performance1
Chaining Performance
  • Analysis of successful search:
    • Expected number e of elements examinedduring a successful search for key k= one more than the expected number ofelements examined when k was inserted.
  • it makes no difference whether we insert at the beginning orthe end of the list.
  • Take the average, over the n items in thetable, of 1 plus the expected length of thechain to which the ith element was added:
open addressing
Open Addressing

Open addressing

  • Store all elements within the table
  • The space we save from the chain pointers is usedinstead to make the array larger.
  • If there is a collision, probe the table in asystematic way to find an empty slot.
  • If the table fills up, we need to enlarge it andrehash all the keys.
open addressing1
Open Addressing
  • hash function: (h(k) + i ) mod m for i=0, 1,...,m-1
  • Insert : Start with the location where the key hashed anddo a sequential search for an empty slot.
  • Search : Start with the location where the key hashedand do a sequential search until you either find the key(success) or find an empty slot (failure).
  • Delete : (lazy deletion) follow same route but mark slotas DELETED rather than EMPTY, otherwise subsequentsearches will fail.
hash table without linked list
Hash Table without Linked-List
  • Linear probing: array of size M.
  • Hash: map key to integer i between 0 and M-1.
  • Insert: put in slot i if free, if not try i+1, i+2, etc.
  • Search: search slot i, if occupied but no match, try i+1, i+2, etc.
  • Cluster.
  • Contiguous block of items.
  • Search through cluster using elementary algorithm for arrays.
open address lineer probing
Open Address Lineer Probing
  • Advantage: very easy to implement
    • Disadvantage: primary clustering
    • Long sequences of used slots build up with gapsbetween them. Every insertion requires severalprobes and adds to the cluster.
    • The average length of a probe sequence wheninserting is
quadratic probes
Quadratic Probes
  • Probe the table at slots (h(k) + i2) mod m

for i =0, 1,2, 3, ..., m-1

    • Ease of computation:
    • Not as easy as linear probing.
  • Do we really have to compute a power?
    • Clustering
    • Primary clustering is avoided, since the probesare not sequential.
search quadratic probing
Search Quadratic Probing
  • Probe sequence for hash value 3 in a table ofsize 16:

3 + 0^2 = 3

3 + 1^2 = 4

3 + 2^2 = 7

3 + 3^2 = 12

3 + 4^2 = 3

3 + 5^2 = 12

3 + 6^2 = 7

3 + 7^2 = 4

3 + 8^2   = 3

3 + 9^2   = 4

3 + 10^2 = 7

3 + 11^2 = 12

3 + 12^2 = 3

3 + 13^2 = 12

3 + 14^2 = 7

3 + 15^2 = 4

quadrature probing
Quadrature Probing
  • Probe sequence for hash value 3 in a table ofsize 19:

3 + 0^2 = 3

3 + 1^2 = 4

3 + 2^2 = 7

3 + 32 = 12

3 + 42 = 0

3 + 52 = 9

3 + 62 = 1

3 + 72 = 14

3 + 82 = 10

3 + 92 = 8

quadrature probing1
Quadrature Probing
  • Disadvantage:  secondary clustering:
  • if h(k1)==h(k2) the probing sequences fork1 and k2 are exactly the same.
  • Is this really bad?
    • In practice, not so much
    • It becomes an issue when the load factor ishigh.
double hashing
Double Hashing
  • The hash function is (h(k)+i h2(k)) mod m
  • In English: use a second hash function to obtainthe next slot.
    • The probing sequence is:
  • h(k),  h(k)+h2(k),  h(k)+2h2(k),  h(k)+3h3(k), ...
  • Performance :
    • Much better than linear or quadratic probing.
    • Does not suffer from clustering
    • BUT requires computation of a second function
double hashing1
Double Hashing
  • The choice of h2(k) is important
    • It must never evaluate to zero
  • consider   h2(k)=k mod 9    for k=81
    • The choice of m is important
    • If it is not prime, we may run out of alternatelocations very fast.
  • After 70% of table is full, double the size of the hash table.
  • Don’t forget to have prime number
lempel ziv welch lzw compression algorithm
Lempel-Ziv-Welch (LZW) Compression Algorithm
  • Introduction to the LZW Algorithm
  • Example 1: Encoding using LZW
  • Example 2: Decoding using LZW
  • LZW: Concluding Notes
introduction to lzw
Introduction to LZW
  • As mentioned earlier, static coding schemes require some knowledge about the data before encoding takes place.
  • Universal coding schemes, like LZW, do not require advance knowledge and can build such knowledge on-the-fly.
  • LZW is the foremost technique for general purpose data compression due to its simplicity and versatility.
  • It is the basis of many PC utilities that claim to “double the capacity of your hard drive”
  • LZW compression uses a code table, with 4096 as a common choice for the number of table entries.
introduction to lzw cont d
Introduction to LZW (cont'd)
  • Codes 0-255 in the code table are always assigned to represent single bytes from the input file.
  • When encoding begins the code table contains only the first 256 entries, with the remainder of the table being blanks.
  • Compression is achieved by using codes 256 through 4095 to represent sequences of bytes.
  • As the encoding continues, LZW identifies repeated sequences in the data, and adds them to the code table.
  • Decoding is achieved by taking each code from the compressed file, and translating it through the code table to find what character or characters it represents.
lzw encoding algorithm
LZW Encoding Algorithm

1 Initialize table with single character strings

2 P = first input character

3 WHILE not end of input stream

4 C = next input character

5 IF P + C is in the string table

6 P = P + C


8   output the code for P

9 add P + C to the string table

10 P = C


12 output code for P

example 1 compression using lzw
Example 1: Compression using LZW

Example 1: Use the LZW algorithm to compress the string


example 1 lzw compression step 6
Example 1: LZW Compression Step 6



lzw decompression
LZW Decompression
  • The LZW decompressor creates the same string table during decompression.
  • It starts with the first 256 table entries initialized to single characters.
  • The string table is updated for each character in the input stream, except the first one.
  • Decoding achieved by reading codes and translating them through the code table being built.
lzw decompression algorithm
LZW Decompression Algorithm

1 Initialize table with single character strings

2 OLD = first input code

3 output translation of OLD

4 WHILE not end of input stream

5 NEW = next input code

6  IF NEW is not in the string table

7 S = translation of OLD

8   S = S + C


10  S = translation of NEW

11 output S

12   C = first character of S

13   OLD + C to the string table

14 OLD = NEW


example 2 lzw decompression 1
Example 2: LZW Decompression 1

Example 2: Use LZW to decompress the output sequence of

Example 1:


example 2 lzw decompression step 1
Example 2: LZW Decompression Step 1

<66><65><256><257><65><260> Old = 65 S = A

New = 66 C = A

example 2 lzw decompression step 2
Example 2: LZW Decompression Step 2

<66><65><256><257><65><260> Old = 256 S = BA

New = 256 C = B

example 2 lzw decompression step 3
Example 2: LZW Decompression Step 3

<66><65><256><257><65><260> Old = 257 S = AB

New = 257 C = A

example 2 lzw decompression step 4
Example 2: LZW Decompression Step 4

<66><65><256><257><65><260> Old = 65 S = A

New = 65 C = A

example 2 lzw decompression step 5
Example 2: LZW Decompression Step 5

<66><65><256><257><65><260> Old = 260 S = AA

New = 260 C = A

lzw some notes
LZW: Some Notes
  • This algorithm compresses repetitive sequences of data well.
  • Since the codewords are 12 bits, any single encoded character will expand the data size rather than reduce it.
  • In this example, 72 bits are represented with 72 bits of data. After a reasonable string table is built, compression improves dramatically.
  • Advantages of LZW over Huffman:
    • LZW requires no prior information about the input data stream.
    • LZW can compress the input stream in one single pass.
    • Another advantage of LZW its simplicity, allowing fast execution.
lzw limitations
LZW: Limitations
  • What happens when the dictionary gets too large (i.e., when all the 4096 locations have been used)?
  • Here are some options usually implemented:
    • Simply forget about adding any more entries and use the table as is.
    • Throw the dictionary away when it reaches a certain size.
    • Throw the dictionary away when it is no longer effective at compression.
    • Clear entries 256-4095 and start building the dictionary again.
      • Some clever schemes rebuild a string table from the last N input characters.