CS 277: Database System Implementation. Notes 5: Hashing and More. Arthur Keller. Hashing. key h(key). <key>. Buckets (typically 1 disk block). Two alternatives. records. (1) key h(key). Two alternatives. record. (2) key h(key). key 1. Index.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Notes 5
This may not be best function …
Read Knuth Vol. 3 if you reallyneed to select a good function.
Good hash Expected number of
function:keys/bucket is the
same for all buckets
Notes 5
& Inserts/Deletes not too frequent
Notes 5
a
e
c
b
EXAMPLE 2 records/bucketINSERT:
h(a) = 1
h(b) = 2
h(c) = 1
h(d) = 0
0
1
2
3
h(e) = 1
Notes 5
between 50% and 80%
Utilization = # keys used
total # keys that fit
Notes 5
(a) Use i of b bits output by hash function
b
h(K)
use i grows over time….
00110101
Notes 5
i =
2
00
01
10
11
1
1
2
1010
New directory
2
1100
Example: h(k) is 4 bits; 2 keys/bucket1
0001
i =
1
1001
1100
Insert 1010
Notes 5
0000
0001
2
0111
2
2
Example continued
i =
2
00
01
10
11
1
0001
0111
1001
1010
Insert:
0111
0000
1100
Notes 5
i =
3
000
001
010
011
100
101
110
111
3
1001
1001
2
1001
1010
3
1010
1100
2
Example continued
0000
2
0001
i =
2
00
01
10
11
0111
2
Insert:
1001
Notes 5
(Reverse insert procedure)
Notes 5
(Not bad if directory in memory)
Directory doubles in size
(Now it fits, now it does not)


Summary
Extensible hashing+
Can handle growing files
 with less wasted space
 with no full reorganizations
Notes 5
b
(a) Use ilow order bits of hash
01110101
grows
i
(b) File grows linearly
Linear hashingNotes 5
If h(k)[i ] m, then
look at bucket h(k)[i ]
else, look at bucket h(k)[i ] 2i 1
Rule
Exampleb=4 bits, i =2, 2 keys/bucket00 01 1011
Future
growth
buckets
0000
0101
1010
1111
m = 01 (max used block)
Notes 5
1010
1111
0101
10
11
Exampleb=4 bits, i =2, 2 keys/bucket00 01 1011
Future
growth
buckets
0000
0101
1010
1111
m = 01 (max used block)
Notes 5
0101
0101
101
100
0
0
0
0
100 101 110 111
100
101
Example Continued:How to grow beyond this?i = 2
00 01 1011
0000
0101
1010
1111
0101
. . .
m = 11 (max used block)
Notes 5
(and maybe i )
total # of slots
= U
Notes 5
Can still have overflow chains

Summary
Linear Hashing+
Can handle growing files
 with less wasted space
 with no full reorganizations
No indirection like extensible hashing
+
Notes 5
Range Searches:
e.g., SELECT
FROM R
WHERE R.A > 5
Notes 5
defines candidate key
Notes 5
CANNOT SPECIFY TYPE OF INDEX
(e.g. Btree, Hashing, …)
OR PARAMETERS
(e.g. Load Factor, Size of Hash,...)
... at least in SQL...
Notes 5
Sales
Toy
Example10k
15k
Example
Record
Dept
Index
Salary
Index
17k
21k
Name=Joe
DEPT=Sales
SAL=15k
12k
15k
15k
19k
Notes 5
Find RECs Dept = “Sales” SAL=20k
Find RECs Dept = “Sales” SAL > 20k
Find RECs Dept = “Sales”
Find RECs SAL = 20k
Notes 5
DATA:
<X1,Y1, Attributes>
<X2,Y2, Attributes>
y
x
. . .
Notes 5
Notes 5
30
10 20
20
m
g
b
n
h
o
d
e
a
k
c
f
j
l
i
10
25
15 35
20
10 20
5
d e
h i
g
f
c
a b
15
15
n o
j k
m
l
Example
Notes 5
Notes 5
Notes 5
Notes 5
V1V2V3 suggested
X1
X1
X1
Like
Array...
X2
X2
X2
X3
X3
X3
X4
X4
X4
But there is a catch with Grid Indexes!
Problem:
Notes 5
X1 X2 X3
Buckets
V1
V2
V3 *Grid only
V4 contains
pointers to
buckets
Buckets















Notes 5
Notes 5
SalaryGrid
020K
1
20K50K
2
50K
3
8
Linear Scale
1
2
3
Toy
Sales
Personnel
Notes 5
Good for multiplekey search
Space, management overhead(nothing is free)
Need partitioning ranges that evenlysplit keys
+


Notes 5
<Fred> suggested
<Joe><Sally>
EX:h1(toy)=0000
h1(sales)=1001
h1(art)=1010
.011
.
h2(10k)=01100
h2(20k)=11101
h2(30k)=01110
h2(40k)=00111
.
.
<Fred,toy,10k>,<Joe,sales,10k>
<Sally,art,30k>
Insert
Notes 5
h1(toy)=0000 suggested
h1(sales)=1001
h1(art)=1010
.011
.
h2(10k)=01100
h2(20k)=11101
h2(30k)=01110
h2(40k)=00111
.
.
<Fred>
<Joe><Jan>
<Mary>
<Sally>
<Tom><Bill>
<Andy>
Notes 5
look here suggested
h1(toy)=0000
h1(sales)=1001
h1(art)=1010
.011
.
h2(10k)=01100
h2(20k)=11101
h2(30k)=01110
h2(40k)=00111
.
.
<Fred>
<Joe><Jan>
<Mary>
<Sally>
<Tom><Bill>
<Andy>
Notes 5
look here suggested
h1(toy)=0000
h1(sales)=1001
h1(art)=1010
.011
.
h2(10k)=01100
h2(20k)=11101
h2(30k)=01110
h2(40k)=00111
.
.
<Fred>
<Joe><Jan>
<Mary>
<Sally>
<Tom><Bill>
<Andy>
Notes 5
Summary suggested
Post hashing discussion:
 Indexing vs. Hashing
 SQL Index Definition
 Multiple Key Access
 Multi Key Index
Variations: Grid, Geo Data
 Partitioned Hash
Notes 5
Notes 5
 B trees
 Hashing
 Multi key
NEXT
Notes 5