A Highly Scalable Perfect Hashing Algorithm. Nivio Ziviani Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil. 3rd Intl. Conference on Scalable Information Systems Naples, Italy, June 46, 2008. Where is Belo Horizonte?.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 1
Nivio Ziviani
Fabiano C. Botelho
Department of Computer Science
Federal University of Minas Gerais, Brazil
3rd Intl. Conference on Scalable Information Systems
Naples, Italy, June 46, 2008
Present a perfect hashing algorithm:
Sequential construction of the function
Distributed construction of the function
Description and evaluation of the function:
Centralized in one machine
Distributed among the participating machines
Algorithm is highly scalable, time
efficient and near spaceoptimal
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 4
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 5
Static key set S of size n
...
0
1
n 1
Hash Table
Perfect Hash Function
...
m 1
0
1
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 6
Static key set S of size n
...
0
1
n 1
Minimal Perfect Hash Function
Hash Table
...
n 1
0
1
A perfect hashing algorithm that uses the idea of partitioning the input key set into small buckets:
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 8
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 9
Vocabulary
Inverted List
Collection of
documents
Term 1
Doc 1
Doc 5
...
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
...
Doc n
Term 2
Doc 1
Doc 2
...
Term 3
Doc 3
Doc 4
...
Term 4
Doc 7
Doc 9
...
Term 5
Doc 6
Doc 10
...
Term 6
Doc 1
Doc 5
...
Term 7
Term 8
...
Term t
Doc 9
Doc 11
...
Indexing
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 10
0
Web Graph Vertices
1
URL 1
URLS
2
URL 2
URL3
3
URL4
URL5
4
URL6
URL7
5
...
6
URLn
...
n1
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 11
0
Web Graph Vertices
M
P
H
F
1
URL 1
URLS
2
URL 2
URL3
3
URL4
URL5
4
URL6
URL7
5
...
6
URLn
...
n1
m < 3n
(use uniform hashing)
(use universal hashing  assume uniform hashing for free)
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin)14
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin)15
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin)16
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin)17
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin)18
The Sequential
External CacheAware Algorithm...
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 20
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 21
External CacheAware Memory Algorithm
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 22
Sequential External Perfect Hashing Algorithm
n1
0
1
Key Set S
…
h0
Partitioning
Buckets
…
0
1
2b  1
2
Searching
MPHF0
MPHF1
MPHF2
MPHF2b1
Hash Table
…
n1
0
1
MPHF(x) = MPHFi(x) + offset[i];
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 23
Key Set Does Not Fit In Internal Memory
n1
0
Key Set S of β
bytes
...
…
...
µ bytes of
Internal
memory
µ bytes of
Internal
memory
Partitioning
h0
h0
…
…
…
0
0
1
2b  1
1
2b  1
2
2
File N
File 1
Each bucket ≤ 256
N = β/µ
b = Number of bits of each bucket address
Important Design Decisions
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 25
Algorithm Used for the Buckets:
Internal Random Access Memory Algorithm...
Internal Random Access Memory Algorithm
(we changed space complexity, close to theoretical lower bound)
1
0
2
3
4
5
h0(jan) = 1 h1(jan) = 3 h2(jan) = 5
1
0
2
3
4
5
h0(jan) = 1 h1(jan) = 3 h2(jan) = 5
1
0
h0(feb) = 1 h1(feb) = 2 h2(feb) = 5
2
3
4
5
h0(jan) = 1 h1(jan) = 3 h2(jan) = 5
1
0
h0(feb) = 1 h1(feb) = 2 h2(feb) = 5
2
3
h0(mar) = 0 h1(mar) = 3 h2(mar) = 4
4
5
Gr:
L:Ø
h0
1
2
0
3
jan
mar
apr
feb
h1
4
5
6
7
Gr:
L:
{0,5}
h0
1
2
0
3
jan
apr
feb
h1
4
5
6
7
0
1
Gr:
L:
{0,5}
{2,6}
h0
1
2
0
3
jan
apr
h1
4
5
6
7
0
1
2
Gr:
L:
{0,5}
{2,6}
{2,7}
h0
1
2
0
3
jan
h1
4
5
6
7
0
1
2
3
Gr:
L:
{0,5}
{2,6}
{2,7}
{2,5}
h0
1
2
0
3
Gr is acyclic
h1
4
5
6
7
Internal Random Access Memory Algorithm (r=2)
S
jan
feb
mar
apr
Internal Random Access Memory Algorithm (r=2)
S
Gr:
jan
feb
mar
apr
h0
1
2
0
3
jan
Mapping
mar
apr
feb
h1
4
5
6
7
Internal Random Access Memory Algorithm (r=2)
g
S
Gr:
0
r
jan
feb
mar
apr
r
1
h0
1
2
0
3
r
2
L
3
r
jan
Mapping
Assigning
mar
apr
r
feb
4
5
r
h1
4
5
6
7
6
r
7
r
0
1
2
3
L:
{0,5}
{2,6}
{2,7}
{2,5}
Internal Random Access Memory Algorithm (r=2)
g
S
Gr:
0
r
jan
feb
mar
apr
r
1
h0
1
2
0
3
0
2
L
3
r
jan
Mapping
Assigning
mar
apr
r
feb
4
5
r
h1
4
5
6
7
6
r
7
r
0
1
2
3
L:
{0,5}
{2,6}
{2,7}
{2,5}
Internal Random Access Memory Algorithm (r=2)
g
S
Gr:
0
r
jan
feb
mar
apr
r
1
h0
1
2
0
3
0
2
L
3
r
jan
Mapping
Assigning
mar
apr
r
feb
4
5
r
h1
4
5
6
7
6
r
7
1
0
1
2
3
L:
{0,5}
{2,6}
{2,7}
{2,5}
Internal Random Access Memory Algorithm (r=2)
g
S
assigned
Gr:
assigned
0
0
jan
feb
mar
apr
r
1
h0
1
2
0
3
0
2
L
3
r
jan
Mapping
Assigning
mar
apr
r
feb
4
5
r
h1
4
5
6
7
6
1
7
1
assigned
assigned
Internal Random Access Memory Algorithm (r=2)
g
S
assigned
Gr:
assigned
0
0
jan
feb
mar
apr
r
1
h0
1
2
0
3
0
2
L
3
r
jan
Mapping
Assigning
mar
apr
r
feb
4
5
r
h1
4
5
6
7
6
1
7
1
assigned
assigned
i = (g[h0(feb)] + g[h1(feb)]) mod r =(g[2] + g[6]) mod 2 = 1
Internal Random Access Memory Algorithm: PHF
Hash Table
g
S
assigned
Gr:
assigned
0
0
mar
jan
feb
mar
apr
r

1
h0
1
2
0
3
jan
0
2
L
3

r
jan
Mapping
Assigning
mar
apr

r
feb
4
5

r
h1
4
5
6
7
6
feb
1
7
apr
1
assigned
assigned
i = (g[h0(feb)] + g[h1(feb)]) mod r =(g[2] + g[6]) mod 2 = 1
phf(feb) = hi=1 (feb) = 6
Internal Random Access Memory Algorithm: MPHF
g
S
assigned
Gr:
assigned
0
0
jan
feb
mar
apr
Hash Table
r
1
h0
1
2
0
3
0
0
mar
2
L
3
jan
r
1
jan
Mapping
Assigning
Ranking
mar
apr
2
feb
r
feb
4
5
apr
r
3
h1
4
5
6
7
6
1
7
1
assigned
assigned
i = (g[h0(feb)] + g[h1(feb)]) mod r =(g[2] + g[6]) mod 2 = 1
phf(feb) = hi=1 (feb) = 6
mphf(feb) = rank(phf(feb)) = rank(6) = 2
Space to Represent the Function
g
S
Gr:
0
0
jan
feb
mar
apr
2 bits
for each
entry
r
1
h0
1
2
0
3
0
2
L
3
r
jan
Mapping
Assigning
mar
apr
r
feb
4
5
r
h1
4
5
6
7
6
1
7
1
Space to Represent the Functions (r = 3)
Use of Acyclic Random Hypergraphs
All algorithms coded in the same framework
3,541,615 URLs
3,541,615 URLs
3,541,615 URLs
Key length = 64 bytes
The Distributed External Memory Based Algorithm ...
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 54
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 55
Manager
Worker
p1
Worker
0
Worker
1
. . .
Distributed Construction of the MPHFs
Distributed Construction of MPHFs
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 56
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 57
Sequential External Perfect Hashing Algorithm
Worker 0
Worker 1
Worker p1
n1
0
1
2
3
4
5
Key Set S
…
h0
Partitioning
Worker p1
Worker 0
Buckets
…
0
1
2b  1
2
Searching
MPHF0
MPHF1
MPHF2
MPHF2b1
Hash Table
…
n1
0
1
MPHF(x) = MPHFi(x) + offset[i];
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 58
Partitioning Step in Each Worker
Disk
reader
Keys queue
Disk
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 59
Partitioning Step in Each Worker
Disk
reader
Keys queue
Fingerprint queue
Hashing
Disk
Fingerprint queue
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 60
Partitioning Step in Each Worker
Disk
reader
Keys queue
Fingerprint queue
Hashing
Disk
Fingerprint queue
Sender
Local network
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 61
Partitioning Step in Each Worker
Disk
reader
Net
reader
Keys queue
Fingerprint queue
Fingerprint queue
Hashing
Local network
Disk
Fingerprint queue
Sender
Local network
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 62
Partitioning Step in Each Worker
Disk
reader
Net
reader
Keys queue
Fingerprint queue
Fingerprint queue
Block
sorter
Hashing
Local network
Disk
Fingerprint queue
Block queue
Sender
Local network
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 63
Partitioning Step in Each Worker
Disk
reader
Net
reader
Keys queue
Fingerprint queue
Fingerprint queue
Block
sorter
Hashing
Local network
Disk
Fingerprint queue
Block queue
Sender
Block
dumper
Disk
Local network
Sequential External Perfect Hashing Algorithm
Worker 0
Worker 1
Worker p1
n1
0
1
2
3
4
5
Key Set S
…
h0
Partitioning
Worker p1
Worker 0
Buckets
…
0
1
2b  1
2
Searching
Worker p1
Worker 0
Worker 1
MPHF0
MPHF1
MPHF2
MPHF2b1
Hash Table
…
n1
0
1
MPHF(x) = MPHFi(x) + offset[i];
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 64
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 65
Searching Step in Each Worker
Disk
Bucket
reader
Buckets queue
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 66
Searching Step in Each Worker
Disk
Bucket
reader
Buckets queue
...
MPHF
gen 0
MPHF
gen t1
Centralized in one machine
Distributed among the participating machines
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 67
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 68
Centralized Description and Evaluation of MPHFs
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 69
Distributed Description and Evaluation of MPHFs
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 70
Advantages of Distributed Evaluation of MPHFs
Communication Overhead
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 71
Experimental Setup
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 72
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 73
Speedup
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 74
Speedup
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 75
Speedup
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 76
Scaleup
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 77
Scaleup
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 78
Scaleup
Scaleup
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 79
Load Balancing
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 80
Distributed Evaluation
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 81
Conclusions
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 83
Conclusions
LATIN  LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 84
?
Key universe
U of size u
Hash function
Range M of size m
Uniform hashing
Key universe
U of size u
Hash function
Range M of size m
Uniform hashing
Key universe
U of size u
Hash function
Range M of size m
Universal hashing