Create Presentation
Download Presentation

Download Presentation
## A Practical Minimal Perfect Hashing Method

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**A Practical Minimal Perfect Hashing Method**Fabiano C. Botelho1, Yoshiharu Kohayakawa2 and Nivio Ziviani1 1Dept. of Computer Science Federal, University of Minas Gerais. 2Dept. of Computer Science, University of São Paulo. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)1**What is the Problem to Solve?**• Finding algorithms that: • Construct minimal perfect hash functions faster than the ones available in the literature. • Use little memory to generate minimal perfect hash functions. • Generate minimal perfect hash functions that can be represented with a very economical description. • Construct minimal perfect hash functions that can be evaluated very fast. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)2**Characteristics of Our Method.**• We are able to find minimal perfect hash functions using cyclic random graphs. • We have to impose a restriction: • The random graphs must have at most 50% of its edges in cycles. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)3**Basic Concepts – Hash Function**Set of n keys S case if set for while ... Collision Hash Table Hash Function ... m -1 0 1 2 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)4**Basic Concepts – Perfect Hash Function**Set of n keys ... 0 1 n -1 Hash Table Perfect Hash Function ... m -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)5**Basic Concepts – Order Preserving Perfect Hash Function**• A perfect hash function h is order preserving if the keys in S are arranged in some order and h preserves this order in the hash table: • For example: • Considering the lexicographic order: • As Fabiano < Nivio < Yoshi, so h(Fabiano) < h(Nivio) < h(Yoshi). LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)6**Basic Concepts – Minimal Perfect Hash Function**Set of n keys ... 0 1 n -1 Hash Table Minimal Perfect Hash Function ... n -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)7**Applications**• Nowadays huge collections are common. • Memory efficient storage and fast retrieval of items from static sets: • Words in natural languages. • Reserved words in programming languages or interactive systems. • Universal resource locations (URLs) in Web search engines. • Item sets in data mining techniques. • Among others. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)8**Approach Used for Constructing Minimal Perfect Hash**Functions • MOS - Mapping, Ordering and Searching: • Mapping: transforms the key set from the original universe to a new universe. • Ordering: places the keys in a sequential order that determines the order in which hash values are assigned to keys. • Searching: attempts to assign hash values to the keys. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)9**A Start Point**• We use as start point an algorithm proposed in: • Z.J. Czech, G. Havas, and B.S. Majewski. An optimal algorithm for generating minimal perfect hash functions. Information Processing Letters, 43(5):257-264, 1992. • Due to the authors we use the acronym CHM to refer to the algorithm. • The CHM Algorithm generates order preserving minimal perfect hash functions using acyclic random graphs. • This implies that we will need more memory to generate and to store the resulting function. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)10**The CHM Algorithm**• To generate an acyclic graph, two vertices h1(x) and h2(x) are computed for each key x in S, where • The set of edges is E(G)={{h1(x),h2(x)}: x in S}. • To show how CHM algorithm works, we are going to use a small example with the first-six months shorted: S ={jan, feb, march, apr, mai, jun} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)11**CHM Algorithm - Mapping**Jan, feb, mar, apr, mai, jun G: 7 h1(jan) = 6 h2(jan) = 4 6 0 h1(feb) = 2 h2(feb) = 3 h1(mar) = 3 h2(mar) = 0 5 1 h1(apr) = 7 h2(apr) = 0 4 2 h1(mai) = 6 h2(mai) = 7 3 h1(jun) = 1 h2(jun) = 4 • The resulting graph must be acyclic. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)12**CHM Algorithm - Ordering**G: 3 4 Jan, feb, mar, apr, mai, jun 7 jan is placed in address 0 6 0 feb is placed in address 1 mar is placed in address 2 2 0 5 abr is placed in address 3 1 mai is placed in address 4 5 4 2 jun is placed in address 5 3 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)13**The algorithm repeatedly selects h1 and h2.**• They proved that if |V(G)|=cn and c>2, then the probability that G is acyclic is: • For c=2.09, this probability is p = 0.342. • The expected number of iterations to obtain an acyclic graph is 1/p = 2.92. How to Obtain an Acyclic Random Graph? LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)14**CHM Algorithm - Searching**• Given the acyclic random graph G = (V, E): g:1 3 4 7 g:2 g:3 6 0 2 0 g:0 g:2 5 1 5 g(2) = (1 – 0) mod 6 = 1 g:3 4 2 3 1 g(0)=0 • The problem is to find an assignment of values to V(G) that makes the function: an ordered minimal perfect hash function. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)15**CHM Algorithm – Evaluating the Resulting Function**• MPHF: G: h2(jan) = 4 h1(jan) = 6 g:1 3 4 7 g:2 g:3 6 0 2 0 g:0 g:2 5 1 5 g: 1 g:3 4 2 3 1 g:0 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)16**Our Method**• The Goals are: • Constructing minimal perfect hash functions for huge set of keys. • Constructing a minimal perfect hash function in O(n) time, using a small constant. • Using litlle memory to generate the functions. • Storing the functions with a very economical description. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)17**Our Method**• The algorithm shares several features with the CHM algorithm. • The differences are: • We generate cyclic random graphs G = (V, E) with |V(G)|=cn and |E(G)|=|S|=n, where . • They generate acyclic random graphs with a greater number of vertices: ; • They generate order preserving minimal perfect hash functions while our algorithm does not preserve order • Thus, our algorithm improves the space requirement at the expense of generating functions that are not order preserving. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)18**Our Method**The algorithm uses the MOS approach to look for a function: MPHF • Where: • e = {a, b} • a = h1(x) • b = h2(x) • x is in a key of S • To show how our algorithm works, we are going to use a small example with the first-eight months shorted. • S ={jan, feb, march, apr, mai, jun, jul, aug} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)19**h1(jan) = 7 h2(jan) = 0**h1(feb) = 1 h2(feb) = 2 h1(mar) = 8 h2(mar) = 1 h1(apr) = 3 h2(apr) = 4 h1(mai) = 4 h2(mai) = 8 h1(jun) = 8 h2(jun) = 0 h1(jul) = 3 h2(jul) = 8 h1(ago) = 7 h2(ago) = 8 Our Method - Mapping Jan, feb, mar, apr, mai, jun, jul, ago G must be simple and must have at most 50% of edges in cycles. G: 7 0 1 6 8 5 2 4 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)20**How to Obtain a Simple Random Graph?**• The probability that G=(V,E), |E| = n and |V| = cn is simple is: • For c=1.15, this probability is p = 0.47. • The expected number of iterations to obtain a simple graph is 1/p = 2.12. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)21**G:**7 0 Ordering 1 6 8 5 2 4 3 Our Method - Ordering G: 7 0 1 6 8 5 2 4 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)22**Our Method - Ordering**The 2-core of G G: 7 0 1 6 8 5 2 4 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)23**Our Method - Ordering**The acyclic part of G G: 7 0 1 6 8 5 2 4 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)24**How to Obtain a Random Graph With at Most 50% of Edges in**Cycles? • The crucial step now is to determine the value of c (in |V(G)|=cn) to obtain a random graph with at most 50% of edges in cycles. • It is equivalent to determine what is the vale of c in which the expected number of edges in the 2-core of G is 0.5|E(G)|. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)25**Determining The Value of c Theoretically**• Pittel and Wormald (2005), present detailed results for the 2-core of the giant component of the random graph G. • They have determined that |Vcrit| and |Ecrit| are given by: Where and 0 < T < 1 is the unique solution to the equation Average degree of vertices in G LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)26**Determining The Value of c Theoretically**• Using the equations to calculate |Vcrit| and |Ecrit| we have: • We determined empirically that c = 1.15. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)27**Our Method - Searching**• The labelling g is defined such that: is a minimal perfect hash function. • First, we obtain the g values for the vertices in Gcrit. • Second, we obtain the g values for the vertices in Gncrit. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)28**Assignment of Values to Critical Vertices**• The labels g(v) (v in Vcrit) are assigned in increasing order following a greedy strategy. • The critical vertices v are considered one at a time according to a breadth-first search on Gcrit. • If a candidate value x for g(v) is forbidden because setting g(v)=x would create two edges with the same sum: • Try x+1 for g(v). LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)29**g:1**g:1 7 0 7 0 1 g:0 g:0 8 8 g:2 4 3 4 3 Assignment of Values to Critical Vertices Let us apply the algorithm to the critical graph (2-core) obtained for the considered example in the ordering step: 7 0 1 g:0 8 2 4 3 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)30**g:4**g:5 g:1 g:1 7 0 7 0 1 1 g:0 g:0 8 8 3 3 2 2 g:2 g:2 g:3 g:3 4 3 4 3 5 5 Assignment of Values to Critical Vertices reassignment reassignment 5 6 g:1 7 0 1 4 5 g:0 8 3 2 g:2 g:3 4 3 5 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)31**g:6**g:1 7 0 1 g:0 8 3 2 g:2 g:3 4 3 5 Assignment of Values to Critical Vertices 7 Used addresses: {1,2,3,5,6,7} 6 Unused addresses: {0,4} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)32**1**0 6 8 g:0 5 2 g:0 1 0 6 4 8 g:0 5 2 Assignment of Values to Non-Critical Vertices Unused addresses: {4} Unused addresses: {0,4} g:0 1 6 8 g:0 5 2 g:0 g:0 g:4 Unused addresses: {} LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)33**g:4**g:5 g:1 g:1 7 0 7 0 1 1 g:0 g:0 8 8 3 3 2 2 g:2 g:2 g:3 g:3 4 3 4 3 5 5 Analysis of The Searching Step reassignment reassignment 6 5 4 5 • We have shown that the maximal value assigned to an edge is: • We also have shown that the number of back edges of G is: Nbedges = |Ecrit| - |Vcrit| + 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)34**Analysis of The Searching Step**• Joining these information and considering that . Thus, • If then and a MPHF is generated in linear time. • The only problem is left open is: prove that . LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)35**Experimental Evidences**• Experimental evidences that : • Recall: Nbedges = |Ecrit| - |Vcrit| + 1 = 0.501n – 0.401n + 1 = 0.1n +1. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)36**Experimental Settings**• Our data consists of a collection of 100 million universe resource locations (URLs) collected from the Web. • The average length of an URL in the collection is 63 bytes. • All experiments were carried on a P.C. with a 2.4 gigahertz processor and 4 gigabytes of main memory. • The table entries showed in the following represent averages over 50 trials. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)37**Experimental Results**• Gains: • 59% in the time for constructing a MPHF. • The resulting functions are generated using 25% less memory than the CHM algorithm. • The resulting functions are stored in 55 % of the space that is needed to store the ones generated by the CHM algorithm. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)38**Conclusions**• We have presented a practical method to construct minimal perfect hash functions. • The method uses only 24.80n + O(1) bytes to generate the functions. • The method is very fast. • So, it is a good option for huge static sets. • The implementation of the method is available at http://cmph.sf.net over the LGPL free software license. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)39**?**LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)40**Heuristic**• We have proposed a heuristic that reduces the space requirement to any given value between 1.15n words and 0.93n words. • The heuristic reuses, when possible, the set of x values that caused reassignments, just before trying x+1. • Problem: Decreasing the value of c leads to an increase in the number of iterations to generate G. • For example: • for c=1 and c=0.93, the analytical expected number of iterations are 2.72 and 3.17, respectively • However, the algorithm is yet linear and will need less memory to generate and to store the resulting functions. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)41**Unused g values: {}**Unused g values: {} 3 2 3 2 g:2 g:2 2 4 2 4 g:0 g:0 5 5 1 1 g:1 g:1 6 6 5 4 g:3 g:4 7 8 7 8 7 7 Heuristic • Let us suppose that we have the following2-core: Unused g values: {} 3 2 4 g:0 5 6 8 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)42**Unused g values: {5}**g:6 3 2 8 g:2 2 4 g:0 5 1 g:1 6 5 4 g:3 g:4 8 7 7 Heuristic Unused g values: {} reassignment g:5 3 2 7 g:2 2 4 g:0 5 1 g:1 6 5 4 g:3 g:4 8 7 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)43**Unused g values: {5}**13 g:6 g:7 3 2 9 8 g:2 2 4 g:0 5 1 g:1 6 5 4 g:3 g:4 8 7 7 7 Heuristic Unused g values: {} 11 reassignment g:6 g:5 3 2 7 8 g:2 2 4 g:0 5 1 g:1 6 5 4 g:3 g:4 8 7 7 7 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)44**Heuristic**• Unfortunately the heuristic does not work for the previous example. • However it works fine for the random graphs. • That is why we are able to reduce the value of c. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)45