1 / 52

External Perfect Hashing for Very Large Key Sets

External Perfect Hashing for Very Large Key Sets. Fabiano C. Botelho. Department of Computer Science Federal University of Minas Gerais, Brazil. Nivio Ziviani. Department of Computer Science Federal University of Minas Gerais, Brazil. What Is The Problem to Solve?.

evania
Download Presentation

External Perfect Hashing for Very Large Key Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. External Perfect Hashing for Very Large Key Sets Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil Nivio Ziviani Department of Computer Science Federal University of Minas Gerais, Brazil LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)1

  2. What Is The Problem to Solve? Generate practical MPHFs for very large key sets (in the order of billions of keys) What is required for that? Efficiently dealing with external memory Generate compact functions (near space optimal) Using a little amount of internal memory to scale Work in linear time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)2

  3. Where to Use a PHF or a MPHF for Large Sets? • Ubiquitous in Computer Science: access items based on the value of a key • The work with huge static item sets has become a daily task. For example: • In data warehousing applications: • On-Line Analytical Processing (OLAP) applications • In search engines: • Large vocabularies (ongoing work) • To map long URLs in smaller integer numbers that are used as IDs LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)3

  4. Mapping Long URLs To Smaller Integer Numbers • Link Analysis (PageRank, Hits, Salsa, and so on): 6 2 4 0 8 7 3 5 1 WEB graph: 9 Vertices: URLs Edges: Links among web pages

  5. M P H F Mapping Long URLs To Smaller Integer Numbers 0 WEB graph vertices 1 URL 1 URLS 2 URL 2 URL3 3 URL4 URL5 4 URL6 URL7 5 ... 6 URLn ... n-1

  6. Perfect Hash Function Key set S of size n ... 0 1 n -1 Hash Table Perfect Hash Function ... m -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)6

  7. Minimal Perfect Hash Function Key set S of size n ... 0 1 n -1 Minimal Perfect Hash Function Hash Table ... n -1 0 1 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)7

  8. Lower Bounds For Storage Space • PHFs (m ≈ n): • MPHFs (m = n): LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)8

  9. Uniform Hashing Versus Universal Hashing Key universe U of size u Hash function Range M of size m

  10. Uniform Hashing Versus Universal Hashing • Uniform hashing • # of functions from U to M? • Independent functions with values uniformly distributed • # of bits to encode each fucntion Key universe U of size u Hash function Range M of size m

  11. Uniform Hashing Versus Universal Hashing • Universal hashing • A family of hash functions H is universal if: • for any pair of distinct keys (x1, x2) from U and • a hash function h chosen uniformly from H then: • Uniform hashing • # of functions from U to M? • Independent functions with values uniformly distributed • # of bits to encode each fucntion Key universe U of size u Hash function Range M of size m X

  12. Our Contribution • First external perfect hashing (EPH) algorithm capable of efficiently dealing with sets in the order of billions of keys • This is possible because: • It makes good use of memory hierarchy. • It requires O(N) internal memory words for constructing the functions, where N is much lower than n. • Two implementations: • Theoretical well-founded EPH (uses uniform hashing) • Heuristic EPH (uses universal hashing) • The resulting functions are: • generated in linear time • requires O(n) bits to be stored • evaluated in constant time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)12

  13. Related Work • Theoretical Results (use uniform hashing) • Practical Results (assume uniform hashing for free) • Empirical Results

  14. Theoretical Results

  15. Theoretical Results

  16. Practical Results – Assume Uniform Hashing For Free

  17. Practical Results – Assume Uniform Hashing For Free

  18. Practical Results – Assume Uniform Hashing For Free

  19. Empirical Results

  20. The External Perfect Hashing Algorithm

  21. The External Perfect Hashing Algorithm (EPH) n-1 0 1 Key Set S … h0 Partitioning Buckets … 0 1 2b - 1 2 Searching MPHF0 MPHF1 MPHF2 MPHF2b-1 Hash Table … n-1 0 1 mphf(x) = mphfi(x) + offset[i];

  22. Key Set Does Not Fit In Internal Memory n-1 0 Key Set S of β bytes ... … ... µ bytes of Internal memory µ bytes of Internal memory Partitioning h0 h0 … … … 0 0 1 2b - 1 1 2b - 1 2 2 File N File 1 Each bucket ≤ 256 N = β/µ b = Number of bits of each bucket address

  23. Important Design Decisions • We map long URLs to smaller keys of fixed size (128 bits) using a hash function • Choose a linear time and near space-optimal algorithm to generate the MPHF of each bucket • How do we obtain a linear time complexity? • Using internal radix sorting to form the buckets • Using a heap of N entries to drive a N-way merge that reads the buckets from disk in one pass

  24. Algorithm Used For The Buckets S f0(x) 2 0 1 information knowledge management Mapping management information knowledge f1(x) 5 3 4 Assigning The resulting functions require approximately 3.3 bits/key to be stored g 0 0 Hash Table 2 1 0 information Ranking 1 2 management 1 3 2 knowledge 2 1 4 5 2 LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)24

  25. Why the EPH Algorithm is Well-Founded? First Point: Pool of uniform hash functions on each bucket ≤ 256 Sharing f0 f1 f0 f1 … 0 1 2b - 1 2 Buckets

  26. Why the EPH Algorithm is Well-Founded? Second Point: We have shown how to create that pool based on the linear hash functions proposed by Alon et al (JACM 1999)

  27. Why the EPH Algorithm is Well-Founded? Second Point: We have shown how to create that pool based on the linear hash functions proposed by Alon et al (JACM 1999) The pool

  28. Why the EPH Algorithm is Well-Founded? Second Point: We have shown how to create that pool based on the linear hash functions proposed by Alon et al (JACM 1999) The linear hash functions The pool

  29. Why the EPH Algorithm is Well-Founded? Second Point: Computing 128 bits with the linear hash functions . . .

  30. Why the EPH Algorithm is Well-Founded? Third Point: How to keep the maximum bucket size smaller than l = 256?

  31. The Heuristic External Perfect Hashing Algorithm

  32. The Heuristic EPH Algorithm • Theoretically founded hash functions from the pool: • Slower than simpler universal hash functions and pseudo random hash functions • Require a lookup table that do not fit in cache • Heuristic EPH uses a pseudo random hash function proposed by Jenkins (1997): • Faster hash function • Requires just one random integer number as seed to be stored • Using universal hashing we often loose relatively little compared to using a complete uniform random map LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)32

  33. Experimental Results • Metrics: • Construction time • Storage space • Evaluation time • Collection: • 1.024 billions of URLs collected from the web (64 bytes long on average) • Experiments • Commodity PC with a cache of 512 Kbytes • 1 GHz, 1 GB, Linux, 64 bits architeture LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)33

  34. Construction Time (in Minutes) Time in minutes to construct a MPHF for 32, 128, 512 and 1024 millions of keys

  35. Construction Time

  36. Related Algorithms • Botelho, Kohayakawa, Ziviani (2005) - BKZ • Botelho, Pagh, Ziviani (2007) - BPZ • Botelho, Ziviani (2007) - EPH • Fox, Chen and Heath (1992) – FCH • Czech, Havas and Majewski (1992) – CHM • Pagh (1999) - PAGH All algorithms coded in the same framework: CMPH free software library (http://cmph.sourceforge.net)

  37. Construction Time 2 million keys

  38. Construction Time and Storage Space 2 million keys

  39. Construction Time, Storage Space and Evaluation Time 2 million keys Key length = 64 bytes

  40. Internal Memory Size Versus Generation Time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)40

  41. CMinimal Perfect Hashing Library • Why to build a library? • Lack of similar libraries in the free software community • To test the applicability of our algorithms out there • Feedbacks: • 1,310 downloads • Incorporated by Debian • Applications: • Norton spam filter (Symantec) • Natural languages translation models (University of California) • Library address: http://cmph.sourceforge.net

  42. Conclusions • We have presented an efficient external memory based algorithm • Two implementations were developed: • Theoretical EPH algorithm • Heuristic EPH algorithm • Both algorithms generate near space-optimal functions in linear time • The functions are computed in time O(1) and are simple enough to be used in practice. • The EPH algorithm is the first theorecally well-founded algorithm that can be used in practice and will work for every key set. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)42

  43. Ongoing Work – Large Vocabularies • Problem: • Large vocabularies use a considerable portion of the main memory • Access to search engine vocabularies follows a Zipf distribution LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)43

  44. Ongoing Work – Large Vocabularies • Solution: • MPHF to index the vocabulay entries in disk and a cache to resolve most of the accesses quickly • Evaluation: • To compare our approach with the one that uses a BTree LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)44

  45. Ongoing Work - Distributed EPH • Problem: • How to build a distributed version of the EPH algorithm? • Solution: • Data parallelism • Evaluation: • To compare with the original EPH algorithm LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)45

  46. ? LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)46

  47. Important Design Decisions a) Buckets Logical View a) Buckets Phisical View … … … … … 0 1 2b - 1 2 File 1 File 2 File N • To map long URLs to smaller keys of fixed size by using a hash function • How we obtain a linear time complexity? • Using radix sort to form the buckets in internal memory. • Using a heap of N entries to drive a N-way merge that reads the buckets from disk in the Searching step. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)47

  48. Conclusions – CMinimal Perfect Hashing Library • Why to build a library? • Lack of similar libraries in the free software community • To test the applicability of our algorithms out there • Feedbacks: • We have released 3 versions • 1014 downloads (338 per version on average) • 100 emails thanking us for the library • .deb packages available for Ubuntu (a Linux distribution) • .deb packages distributed with Debian • An invitation to spend a summer in the University of California • A job offer from Symantec Corp. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)48

  49. jan mar feb apr Using uniform hash functions to generate a Random Graph • Two hash functions induce a random 2-graph. • Example for the key set S = {jan, feb, mar, apr} Gr: h0 h0(jan) = 2 h1(jan) = 5 2 0 1 3 h0(feb) = 2 h1(feb) = 6 h0(mar) = 0 h1(mar) = 5 h0(apr) = 2 h1(abr) = 7 h1 6 4 5 7 • Then, r hash functions induce a random r-graph. LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)49

  50. Influence of the internal memory size in the generation time LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)50

More Related