A highly scalable perfect hashing algorithm
Download
1 / 89

A Highly Scalable Perfect Hashing Algorithm - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

A Highly Scalable Perfect Hashing Algorithm. Nivio Ziviani Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil. 3rd Intl. Conference on Scalable Information Systems Naples, Italy, June 4-6, 2008. Where is Belo Horizonte?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' A Highly Scalable Perfect Hashing Algorithm' - johnna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A highly scalable perfect hashing algorithm

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 1

A Highly Scalable Perfect Hashing Algorithm

Nivio Ziviani

Fabiano C. Botelho

Department of Computer Science

Federal University of Minas Gerais, Brazil

3rd Intl. Conference on Scalable Information Systems

Naples, Italy, June 4-6, 2008


Where is belo horizonte
Where is Belo Horizonte? LAboratory for Treating INformation (www.dcc.ufmg.br/latin)


Pampulha s church oscar niemeyer
Pampulha’s Church LAboratory for Treating INformation (www.dcc.ufmg.br/latin) Oscar Niemeyer


Objective of the presentation
Objective of the Presentation LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

Present a perfect hashing algorithm:

Sequential construction of the function

Distributed construction of the function

Description and evaluation of the function:

Centralized in one machine

Distributed among the participating machines

Algorithm is highly scalable, time

efficient and near space-optimal

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 4


Perfect hash function

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 5

Perfect Hash Function

Static key set S of size n

...

0

1

n -1

Hash Table

Perfect Hash Function

...

m -1

0

1


Minimal perfect hash function

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 6

Minimal Perfect Hash Function

Static key set S of size n

...

0

1

n -1

Minimal Perfect Hash Function

Hash Table

...

n -1

0

1


The algorithm
The Algorithm LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

A perfect hashing algorithm that uses the idea of partitioning the input key set into small buckets:

  • Key set fits in the internal memory

    • Internal Random Access memory algorithm

  • Key set larger than the internal memory

    • External Cache-Aware memory algorithm


Where to use a phf or a mphf

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 8

Where to use a PHF or a MPHF?

  • Access items based on the value of a key is ubiquitous in Computer Science

  • Work with huge static item sets:

    • In data warehousing applications:

      • On-Line Analytical Processing (OLAP) applications

    • In Web search engines:

      • Large vocabularies

      • Map long URLs in smaller integer numbers that are used as IDs


Indexing representing the vocabulary

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 9

Vocabulary

Inverted List

Collection of

documents

Term 1

Doc 1

Doc 5

...

Doc 1

Doc 2

Doc 3

Doc 4

Doc 5

...

Doc n

Term 2

Doc 1

Doc 2

...

Term 3

Doc 3

Doc 4

...

Term 4

Doc 7

Doc 9

...

Term 5

Doc 6

Doc 10

...

Term 6

Doc 1

Doc 5

...

Term 7

Term 8

...

Term t

Doc 9

Doc 11

...

Indexing: Representing the Vocabulary

Indexing


Mapping urls to web graph vertices

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 10

0

Web Graph Vertices

1

URL 1

URLS

2

URL 2

URL3

3

URL4

URL5

4

URL6

URL7

5

...

6

URLn

...

n-1

Mapping URLs to Web Graph Vertices


Mapping urls to web graph vertices1

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 11

0

Web Graph Vertices

M

P

H

F

1

URL 1

URLS

2

URL 2

URL3

3

URL4

URL5

4

URL6

URL7

5

...

6

URLn

...

n-1

Mapping URLs to Web Graph Vertices


Information theoretical lower bounds for storage space
Information Theoretical Lower Bounds for LAboratory for Treating INformation (www.dcc.ufmg.br/latin) Storage Space

  • PHFs (m ≈ n):

  • MPHFs (m = n):

m < 3n


Related work
Related Work LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • Theoretical Results

    (use uniform hashing)

  • Practical Results

    (use universal hashing - assume uniform hashing for free)

  • Heuristics


Theoretical results use complete randomness uniform hash functions
Theoretical Results LAboratory for Treating INformation (www.dcc.ufmg.br/latin) Use Complete Randomness (Uniform Hash Functions)

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)14


Theoretical results use complete randomness uniform hash functions1
Theoretical Results LAboratory for Treating INformation (www.dcc.ufmg.br/latin) Use Complete Randomness (Uniform Hash Functions)

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)15


Practical results assume uniform hashing for free use universal hashing
Practical Results LAboratory for Treating INformation (www.dcc.ufmg.br/latin) Assume Uniform Hashing for Free (Use Universal Hashing)

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)16


Practical results assume uniform hashing for free use universal hashing1
Practical Results LAboratory for Treating INformation (www.dcc.ufmg.br/latin) Assume Uniform Hashing for Free (Use Universal Hashing)

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)17


Practical results assume uniform hashing for free use universal hashing2
Practical Results LAboratory for Treating INformation (www.dcc.ufmg.br/latin) Assume Uniform Hashing for Free (Use Universal Hashing)

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin)18


Empirical results
Empirical Results LAboratory for Treating INformation (www.dcc.ufmg.br/latin)


The Sequential LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

External Cache-Aware Algorithm...

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 20


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 21

External Cache-Aware Memory Algorithm

  • First MPHF algorithm for very large key sets (in the order of billions of keys)‏

  • This is possible because

    • Deals with external memory efficiently‏

    • Works in linear time

    • Generates compact functions (near space-optimal)

      • MPHF (m = n): 3.3n bits

      • PHF (m =1.23n): 2.7n bits

      • Theoretical lower bound:

        • MPHF:1.44n bits

        • PHF: 0.89n bits


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 22

Sequential External Perfect Hashing Algorithm‏

n-1

0

1

Key Set S

h0

Partitioning

Buckets

0

1

2b - 1

2

Searching

MPHF0

MPHF1

MPHF2

MPHF2b-1

Hash Table

n-1

0

1

MPHF(x) = MPHFi(x) + offset[i];


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 23

Key Set Does Not Fit In Internal Memory

n-1

0

Key Set S of β

bytes

...

...

µ bytes of

Internal

memory

µ bytes of

Internal

memory

Partitioning

h0

h0

0

0

1

2b - 1

1

2b - 1

2

2

File N

File 1

Each bucket ≤ 256

N = β/µ

b = Number of bits of each bucket address


Important Design Decisions LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • We map long URLs to a fingerprint of fixed size using a hash function

  • Use our linear time and near space-optimal algorithm to generate the MPHF of each bucket

  • How do we obtain a linear time complexity?

    • Using internal radix sorting to form the buckets

    • Using a heap of N entries to drive a N-way merge that reads the buckets from disk in one pass


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 25

Algorithm Used for the Buckets:

Internal Random Access Memory Algorithm...


Internal Random Access Memory Algorithm LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • Near space optimal

  • Evaluation in constant time

  • Function generation in linear time

  • Simple to describe and implement

  • Known algorithms with near-optimal space either:

    • Require exponential time for construction and evaluation, or

    • Use near-optimal space only asymptotically, for large n

  • Acyclic random hypergraphs

    • Used before by Majewski et all (1996): O(n log n) bits

  • We proceed differently: O(n) bits

    (we changed space complexity, close to theoretical lower bound)


Random hypergraphs r graphs
Random Hypergraphs (r-graphs) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • 3-graph:

1

0

2

3

4

5

  • 3-graph is induced by three uniform hash functions


Random hypergraphs r graphs1
Random Hypergraphs (r-graphs) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • 3-graph:

h0(jan) = 1 h1(jan) = 3 h2(jan) = 5

1

0

2

3

4

5

  • 3-graph is induced by three uniform hash functions


Random hypergraphs r graphs2
Random Hypergraphs (r-graphs) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • 3-graph:

h0(jan) = 1 h1(jan) = 3 h2(jan) = 5

1

0

h0(feb) = 1 h1(feb) = 2 h2(feb) = 5

2

3

4

5

  • 3-graph is induced by three uniform hash functions


Random hypergraphs r graphs3
Random Hypergraphs (r-graphs) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • 3-graph:

h0(jan) = 1 h1(jan) = 3 h2(jan) = 5

1

0

h0(feb) = 1 h1(feb) = 2 h2(feb) = 5

2

3

h0(mar) = 0 h1(mar) = 3 h2(mar) = 4

4

5

  • 3-graph is induced by three uniform hash functions

  • Our best result uses 3-graphs


Acyclic 2 graph
Acyclic 2-graph LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

Gr:

L:Ø

h0

1

2

0

3

jan

mar

apr

feb

h1

4

5

6

7


Acyclic 2 graph1
Acyclic 2-graph LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

Gr:

L:

{0,5}

h0

1

2

0

3

jan

apr

feb

h1

4

5

6

7


Acyclic 2 graph2
Acyclic 2-graph LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

0

1

Gr:

L:

{0,5}

{2,6}

h0

1

2

0

3

jan

apr

h1

4

5

6

7


Acyclic 2 graph3
Acyclic 2-graph LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

0

1

2

Gr:

L:

{0,5}

{2,6}

{2,7}

h0

1

2

0

3

jan

h1

4

5

6

7


Acyclic 2 graph4
Acyclic 2-graph LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

0

1

2

3

Gr:

L:

{0,5}

{2,6}

{2,7}

{2,5}

h0

1

2

0

3

Gr is acyclic

h1

4

5

6

7


Internal Random Access Memory Algorithm (r=2) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

S

jan

feb

mar

apr


Internal Random Access Memory Algorithm (r=2) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

S

Gr:

jan

feb

mar

apr

h0

1

2

0

3

jan

Mapping

mar

apr

feb

h1

4

5

6

7


Internal Random Access Memory Algorithm (r=2) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

g

S

Gr:

0

r

jan

feb

mar

apr

r

1

h0

1

2

0

3

r

2

L

3

r

jan

Mapping

Assigning

mar

apr

r

feb

4

5

r

h1

4

5

6

7

6

r

7

r

0

1

2

3

L:

{0,5}

{2,6}

{2,7}

{2,5}


Internal Random Access Memory Algorithm (r=2) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

g

S

Gr:

0

r

jan

feb

mar

apr

r

1

h0

1

2

0

3

0

2

L

3

r

jan

Mapping

Assigning

mar

apr

r

feb

4

5

r

h1

4

5

6

7

6

r

7

r

0

1

2

3

L:

{0,5}

{2,6}

{2,7}

{2,5}


Internal Random Access Memory Algorithm (r=2) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

g

S

Gr:

0

r

jan

feb

mar

apr

r

1

h0

1

2

0

3

0

2

L

3

r

jan

Mapping

Assigning

mar

apr

r

feb

4

5

r

h1

4

5

6

7

6

r

7

1

0

1

2

3

L:

{0,5}

{2,6}

{2,7}

{2,5}


Internal Random Access Memory Algorithm (r=2) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

g

S

assigned

Gr:

assigned

0

0

jan

feb

mar

apr

r

1

h0

1

2

0

3

0

2

L

3

r

jan

Mapping

Assigning

mar

apr

r

feb

4

5

r

h1

4

5

6

7

6

1

7

1

assigned

assigned


Internal Random Access Memory Algorithm (r=2) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

g

S

assigned

Gr:

assigned

0

0

jan

feb

mar

apr

r

1

h0

1

2

0

3

0

2

L

3

r

jan

Mapping

Assigning

mar

apr

r

feb

4

5

r

h1

4

5

6

7

6

1

7

1

assigned

assigned

i = (g[h0(feb)] + g[h1(feb)]) mod r =(g[2] + g[6]) mod 2 = 1


Internal Random Access Memory Algorithm: PHF LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

Hash Table

g

S

assigned

Gr:

assigned

0

0

mar

jan

feb

mar

apr

r

-

1

h0

1

2

0

3

jan

0

2

L

3

-

r

jan

Mapping

Assigning

mar

apr

-

r

feb

4

5

-

r

h1

4

5

6

7

6

feb

1

7

apr

1

assigned

assigned

i = (g[h0(feb)] + g[h1(feb)]) mod r =(g[2] + g[6]) mod 2 = 1

phf(feb) = hi=1 (feb) = 6


Internal Random Access Memory Algorithm: MPHF LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

g

S

assigned

Gr:

assigned

0

0

jan

feb

mar

apr

Hash Table

r

1

h0

1

2

0

3

0

0

mar

2

L

3

jan

r

1

jan

Mapping

Assigning

Ranking

mar

apr

2

feb

r

feb

4

5

apr

r

3

h1

4

5

6

7

6

1

7

1

assigned

assigned

i = (g[h0(feb)] + g[h1(feb)]) mod r =(g[2] + g[6]) mod 2 = 1

phf(feb) = hi=1 (feb) = 6

mphf(feb) = rank(phf(feb)) = rank(6) = 2


Space to Represent the Function LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

g

S

Gr:

0

0

jan

feb

mar

apr

2 bits

for each

entry

r

1

h0

1

2

0

3

0

2

L

3

r

jan

Mapping

Assigning

mar

apr

r

feb

4

5

r

h1

4

5

6

7

6

1

7

1


Space to Represent the Functions (r = 3) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • PHF g: [0,m-1] → {0,1,2}

    • m = cn bits, c = 1.23 → 2.46 n bits

    • (log 3) cn bits, c = 1.23 → 1.95 n bits (arith. coding)

    • Optimal: 0.89n bits

  • MPHF g: [0,m-1] → {0,1,2,3}(ranking info required)

    • 2m + εm = (2+ ε)cn bits

    • For c = 1.23 and ε = 0.125 → 2.62 n bits

    • Optimal: 1.44n bits.


Use of Acyclic Random Hypergraphs LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • Sufficient condition to work

  • Repeatedly selects h0, h1..., hr-1

  • For r = 3, m = 1.23n: Pra tends to 1

  • Number of iterations is 1/Pra = 1


Experimental results
Experimental Results LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • Metrics:

    • Generation time

    • Storage space for the description

    • Evaluation time

  • Collection:

    • 1.024 billions of URLs collected from the web

    • 64 bytes long on average

  • Experiments

    • Commodity PC with a cache of 4 Mbytes

    • 1.86 GHz, 1 GB, Linux, 64 bits architecture


Generation time of mphfs in minutes
Generation Time of MPHFs (in Minutes) LAboratory for Treating INformation (www.dcc.ufmg.br/latin)


Related algorithms
Related Algorithms LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • Fox, Chen and Heath (1992) – FCH

  • Majewski, Wormald, Havas and Czech (1996) – MWHC

    All algorithms coded in the same framework


Generation time
Generation Time LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

3,541,615 URLs


Generation time and storage space
Generation Time and Storage Space LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

3,541,615 URLs


Generation time storage space and evaluation time
Generation Time, Storage Space and Evaluation Time LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

3,541,615 URLs

Key length = 64 bytes


The Distributed External Memory Based Algorithm ... LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 54


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 55

Manager

Worker

p-1

Worker

0

Worker

1

. . .

Distributed Construction of the MPHFs


Distributed Construction of MPHFs LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • Manager :

    • Assign tasks to workers

    • Determine global values during execution

    • Dump resulting MPHFs to disk

  • Worker :

    • Has a partition of the keys on disk

    • Creates its buckets from the keys (Bpw = Nb/p)

      • Migrate data whenever necessary

    • Constructs a MPHF for each bucket

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 56


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 57

Sequential External Perfect Hashing Algorithm‏

Worker 0

Worker 1

Worker p-1

n-1

0

1

2

3

4

5

Key Set S

h0

Partitioning

Worker p-1

Worker 0

Buckets

0

1

2b - 1

2

Searching

MPHF0

MPHF1

MPHF2

MPHF2b-1

Hash Table

n-1

0

1

MPHF(x) = MPHFi(x) + offset[i];


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 58

Partitioning Step in Each Worker

Disk

reader

Keys queue

Disk


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 59

Partitioning Step in Each Worker

Disk

reader

Keys queue

Fingerprint queue

Hashing

Disk

Fingerprint queue


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 60

Partitioning Step in Each Worker

Disk

reader

Keys queue

Fingerprint queue

Hashing

Disk

Fingerprint queue

Sender

Local network


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 61

Partitioning Step in Each Worker

Disk

reader

Net

reader

Keys queue

Fingerprint queue

Fingerprint queue

Hashing

Local network

Disk

Fingerprint queue

Sender

Local network


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 62

Partitioning Step in Each Worker

Disk

reader

Net

reader

Keys queue

Fingerprint queue

Fingerprint queue

Block

sorter

Hashing

Local network

Disk

Fingerprint queue

Block queue

Sender

Local network


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 63

Partitioning Step in Each Worker

Disk

reader

Net

reader

Keys queue

Fingerprint queue

Fingerprint queue

Block

sorter

Hashing

Local network

Disk

Fingerprint queue

Block queue

Sender

Block

dumper

Disk

Local network


Sequential External Perfect Hashing Algorithm LAboratory for Treating INformation (www.dcc.ufmg.br/latin) ‏

Worker 0

Worker 1

Worker p-1

n-1

0

1

2

3

4

5

Key Set S

h0

Partitioning

Worker p-1

Worker 0

Buckets

0

1

2b - 1

2

Searching

Worker p-1

Worker 0

Worker 1

MPHF0

MPHF1

MPHF2

MPHF2b-1

Hash Table

n-1

0

1

MPHF(x) = MPHFi(x) + offset[i];

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 64


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 65

Searching Step in Each Worker

Disk

Bucket

reader

Buckets queue


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 66

Searching Step in Each Worker

Disk

Bucket

reader

Buckets queue

...

MPHF

gen 0

MPHF

gen t-1


Description and evaluation of a mphf
Description and Evaluation of a MPHF LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

Centralized in one machine

Distributed among the participating machines

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 67


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 68

Centralized Description and Evaluation of MPHFs

  • End of the partitioning step:

    • Worker sends size of each bucket to manager

    • Manager calculates the offset array

  • End of searching step (construction of MPHFs):

    • Worker sends MPHFs of its buckets to manager

    • Manager writes sequentially final MPHF to disk

  • MPHF(x) = MPHFi(x) + offset[i]


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 69

Distributed Description and Evaluation of MPHFs

  • Description of MPHFs of a bucket

    • Stays in the bucket

  • Evaluation of a MPHF

    • Locate the key inside the bucket:

    • MPHFpartial(k) = MPHFi(k) + localoffset[i]

    • Add this to the number of keys before worker w:

    • MPHF(k) = MPHFpartial(k) + globaloffset[w]

    • A key stream is evaluated in parallel


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 70

Advantages of Distributed Evaluation of MPHFs

  • No need to send the MPHFs of a bucket to manager

  • They are written to disk in parallel by the workers

  • Final function is stored in a distributed way

    • Size of the description of the MPHF grows linearly with the size of the input key


Communication Overhead LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • On average, the number of keys τsent through the net during the execution is:

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 71


Experimental Setup LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • Three collections

  • Cluster with 14 equal 64 bits single core machines

    • 2 gigabytes of main memory

    • 2.13 gigahertz

    • Linux operating system version 2.6

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 72


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 73

Speedup

  • 64 bytes URLs


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 74

Speedup

  • 16 bytes random integers


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 75

Speedup

  • 8 bytes random integers


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 76

Scale-up

  • 64 bytes URLs


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 77

Scale-up

  • 16 bytes random integers


LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 78

Scale-up

  • 8 bytes random integers


Scale-up LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • 14.336 billion random integers

  • 14 machines (each with 1.024 billion keys)

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 79


Load Balancing LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • Execution time: fastest minus slowest (in minutes)

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 80


Distributed Evaluation LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • 1.024 billion key stream taken at random (minutes)

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 81


C m inimal p erfect h ashing l ibrary
C LAboratory for Treating INformation (www.dcc.ufmg.br/latin) Minimal Perfect Hashing Library

  • Why to build a library?

    • Lack of similar libraries in the free software community

    • Test the applicability of our algorithm out there

  • Feedbacks:

    • 2,243 downloads (until May 27th, 2008)

    • Incorporated by Debian

  • Library address: http: //cmph.sourceforge.net


Conclusions LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • Sequential and parallel perfect hashing algorithm

  • Near space-optimal functions in linear time

  • Function evaluation in time O(1)

  • The algorithms are simpler and has much lower constant factors than existing theoretical results

  • Outperforms the main practical general purpose algorithms found in the literature

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 83


Conclusions LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • Construction time: 14 machines, 1 billion URLs

    • Sequential algorithm: 50 minutes

    • Parallel algorithm: 4 minutes

  • Speedup > 90% for keys with more than 16 bytes

  • Description and evaluation of MPHF:

    • Centralized

    • Distributed: fast evaluation for key streams

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 84


? LAboratory for Treating INformation (www.dcc.ufmg.br/latin)


Uniform hashing versus universal hashing
Uniform Hashing Versus Universal Hashing LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

Key universe

U of size u

Hash function

Range M of size m


Uniform hashing versus universal hashing1
Uniform Hashing Versus Universal Hashing LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

Uniform hashing

  • # of functions from U to M?

  • # of bits to encode each fucntion

  • Independent functions with values uniformly distributed

Key universe

U of size u

Hash function

Range M of size m


Uniform hashing versus universal hashing2
Uniform Hashing Versus Universal Hashing LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

Uniform hashing

  • # of functions from U to M?

  • # of bits to encode each fucntion

  • Independent functions with values uniformly distributed

Key universe

U of size u

Hash function

Range M of size m

Universal hashing

  • A family of hash functions H is universal if:

    • for any pair of distinct keys (x1, x2) from U and

    • a hash function h chosen uniformly from H then:


Intuition behind universal hashing
Intuition Behind Universal Hashing LAboratory for Treating INformation (www.dcc.ufmg.br/latin)

  • We often lose relatively little compared to using a completely random map (uniform hashing)

  • If S of size n is hashed to n2 buckets, with probability more than ½, no collisions occur

    • Even with complete randomness, we do not expect little o(n2) buckets to suffice (the birthday paradox)

    • So nothing is lost by using a universal family instead!


ad