Cs 277 database system implementation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 61

CS 277: Database System Implementation PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

CS 277: Database System Implementation. Notes 5: Hashing and More. Arthur Keller. Hashing. key  h(key). <key>. Buckets (typically 1 disk block). Two alternatives. records. (1) key  h(key). Two alternatives. record. (2) key  h(key). key 1. Index.

Download Presentation

CS 277: Database System Implementation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cs 277 database system implementation

CS 277: Database System Implementation

Notes 5: Hashing and More

Arthur Keller

Notes 5


Cs 277 database system implementation

Hashing

key  h(key)

<key>

Buckets

(typically 1

disk block)

.

.

.

Notes 5


Cs 277 database system implementation

Two alternatives

.

.

.

records

(1) key  h(key)

.

.

.

Notes 5


Cs 277 database system implementation

Two alternatives

record

(2) key  h(key)

key 1

Index

  • Alt (2) for “secondary” search key

Notes 5


Example hash function

Example hash function

  • Key = ‘x1 x2 … xn’ n byte character string

  • Have b buckets

  • h: add x1 + x2 + ….. xn

    • compute sum modulo b

Notes 5


Cs 277 database system implementation

 This may not be best function …

 Read Knuth Vol. 3 if you reallyneed to select a good function.

Good hash Expected number of

function:keys/bucket is the

same for all buckets

Notes 5


Within a bucket

Within a bucket:

  • Do we keep keys sorted?

  • Yes, if CPU time critical

    & Inserts/Deletes not too frequent

Notes 5


Next example to illustrate inserts overflows deletes

Next: example to illustrateinserts, overflows, deletes

h(K)

Notes 5


Example 2 records bucket

d

a

e

c

b

EXAMPLE 2 records/bucket

INSERT:

h(a) = 1

h(b) = 2

h(c) = 1

h(d) = 0

0

1

2

3

h(e) = 1

Notes 5


Cs 277 database system implementation

d

maybe move

“g” up

EXAMPLE: deletion

Delete:ef

0

1

2

3

a

b

d

c

c

e

f

g

Notes 5


Rule of thumb

  • If < 50%, wasting space

  • If > 80%, overflows significantdepends on how good hashfunction is & on # keys/bucket

Rule of thumb:

  • Try to keep space utilization

    between 50% and 80%

    Utilization = # keys used

    total # keys that fit

Notes 5


How do we cope with growth

  • Extensible

  • Linear

How do we cope with growth?

  • Overflows and reorganizations

  • Dynamic hashing

Notes 5


Extensible hashing two ideas

Extensible hashing: two ideas

(a) Use i of b bits output by hash function

b

h(K) 

use i grows over time….

00110101

Notes 5


Cs 277 database system implementation

(b) Use directory

h(K)[i ] to bucket

.

.

.

.

.

.

Notes 5


Example h k is 4 bits 2 keys bucket

i =

2

00

01

10

11

1

1

2

1010

New directory

2

1100

Example: h(k) is 4 bits; 2 keys/bucket

1

0001

i =

1

1001

1100

Insert 1010

Notes 5


Cs 277 database system implementation

2

0000

0001

2

0111

2

2

Example continued

i =

2

00

01

10

11

1

0001

0111

1001

1010

Insert:

0111

0000

1100

Notes 5


Cs 277 database system implementation

i =

3

000

001

010

011

100

101

110

111

3

1001

1001

2

1001

1010

3

1010

1100

2

Example continued

0000

2

0001

i =

2

00

01

10

11

0111

2

Insert:

1001

Notes 5


Extensible hashing deletion

Extensible hashing: deletion

  • No merging of blocks

  • Merge blocks and cut directory if possible

    (Reverse insert procedure)

Notes 5


Deletion example

Deletion example:

  • Run thru insert example in reverse!

Notes 5


Extensible hashing

Indirection

(Not bad if directory in memory)

Directory doubles in size

(Now it fits, now it does not)

-

-

Summary

Extensible hashing

+

Can handle growing files

- with less wasted space

- with no full reorganizations

Notes 5


Linear hashing

Two ideas:

b

(a) Use ilow order bits of hash

01110101

grows

i

(b) File grows linearly

Linear hashing

  • Another dynamic hashing scheme

Notes 5


Example b 4 bits i 2 2 keys bucket

0101

  • can have overflow chains!

If h(k)[i ]  m, then

look at bucket h(k)[i ]

else, look at bucket h(k)[i ]- 2i -1

Rule

Exampleb=4 bits, i =2, 2 keys/bucket

  • insert 0101

00 01 1011

Future

growth

buckets

0000

0101

1010

1111

m = 01 (max used block)

Notes 5


Example b 4 bits i 2 2 keys bucket1

0101

  • insert 0101

1010

1111

0101

10

11

Exampleb=4 bits, i =2, 2 keys/bucket

00 01 1011

Future

growth

buckets

0000

0101

1010

1111

m = 01 (max used block)

Notes 5


Example continued how to grow beyond this

3

0101

0101

101

100

0

0

0

0

100 101 110 111

100

101

Example Continued:How to grow beyond this?

i = 2

00 01 1011

0000

0101

1010

1111

0101

. . .

m = 11 (max used block)

Notes 5


When do we expand file

 When do we expand file?

  • If U > threshold then increase m

    (and maybe i )

  • Keep track of: # used slots

    total # of slots

= U

Notes 5


Linear hashing1

Can still have overflow chains

-

Summary

Linear Hashing

+

Can handle growing files

- with less wasted space

- with no full reorganizations

No indirection like extensible hashing

+

Notes 5


Example bad case

Example: BAD CASE

Very full

Very emptyNeed to move

m here…

Would waste

space...

Notes 5


Cs 277 database system implementation

Summary

Hashing

- How it works

- Dynamic hashing

- Extensible

- Linear

Notes 5


Cs 277 database system implementation

Next:

  • Indexing vs Hashing

  • Index definition in SQL

  • Multiple key access

Notes 5


Cs 277 database system implementation

Indexing vs Hashing

  • Hashing good for probes given key

    e.g., SELECT …

    FROM R

    WHERE R.A = 5

Notes 5


Cs 277 database system implementation

Indexing vs Hashing

  • INDEXING (Including B Trees) good for

    Range Searches:

    e.g., SELECT

    FROM R

    WHERE R.A > 5

Notes 5


Index definition in sql

Index definition in SQL

  • Createindex name on rel (attr)

  • Createuniqueindex name on rel (attr)

defines candidate key

  • Drop INDEX name

Notes 5


Cs 277 database system implementation

Note

CANNOT SPECIFY TYPE OF INDEX

(e.g. B-tree, Hashing, …)

OR PARAMETERS

(e.g. Load Factor, Size of Hash,...)

... at least in SQL...

Notes 5


Cs 277 database system implementation

Note

ATTRIBUTE LIST MULTIKEY INDEX

(next)

e.g., CREATEINDEX foo ON R(A,B,C)

Notes 5


Cs 277 database system implementation

Multi-key Index

Motivation: Find records where

DEPT = “Toy” AND SAL > 50k

Notes 5


Strategy i

Strategy I:

  • Use one index, say Dept.

  • Get all Dept = “Toy” records and check their salary

I1

Notes 5


Strategy ii

Strategy II:

  • Use 2 Indexes; Manipulate Pointers

    ToySal

    > 50k

Notes 5


Strategy iii

Strategy III:

  • Multiple Key Index

    One idea:

I2

I3

I1

Notes 5


Example

Art

Sales

Toy

Example

10k

15k

Example

Record

Dept

Index

Salary

Index

17k

21k

Name=Joe

DEPT=Sales

SAL=15k

12k

15k

15k

19k

Notes 5


For which queries is this index good

For which queries is this index good?

Find RECs Dept = “Sales” SAL=20k

Find RECs Dept = “Sales” SAL > 20k

Find RECs Dept = “Sales”

Find RECs SAL = 20k

Notes 5


Interesting application

Interesting application:

  • Geographic Data

    DATA:

    <X1,Y1, Attributes>

    <X2,Y2, Attributes>

y

x

. . .

Notes 5


Queries

Queries:

  • What city is at <Xi,Yi>?

  • What is within 5 miles from <Xi,Yi>?

  • Which is closest point to <Xi,Yi>?

Notes 5


Cs 277 database system implementation

40

30

10 20

20

m

g

b

n

h

o

d

e

a

k

c

f

j

l

i

10

25

15 35

20

10 20

5

d e

h i

g

f

c

a b

15

15

  • Search points near f

  • Search points near b

n o

j k

m

l

Example

Notes 5


Queries1

Queries

  • Find points with Yi > 20

  • Find points with Xi < 5

  • Find points “close” to i = <12,38>

  • Find points “close” to b = <7,24>

Notes 5


Cs 277 database system implementation

  • Many types of geographic index structures have been suggested

    • Quad Trees

    • R Trees

Notes 5


Two more types of multi key indexes

Two more types of multi key indexes

  • Grid

  • Partitioned hash

Notes 5


Grid index

Grid Index

Key 2

X1 X2 …… Xn

V1

V2

Key 1

Vn

To records with key1=V3, key2=X2

Notes 5


Claim

CLAIM

  • Can quickly find records with

    • key 1 = Vi Key 2 = Xj

    • key 1 = Vi

    • key 2 = Xj

  • And also ranges….

    • E.g., key 1  Vi key 2 < Xj

Notes 5


Cs 277 database system implementation

V1V2V3

X1

X1

X1

Like

Array...

X2

X2

X2

X3

X3

X3

X4

X4

X4

  • How is Grid Index stored on disk?

 But there is a catch with Grid Indexes!

Problem:

  • Need regularity so we can computeposition of <Vi,Xj> entry

Notes 5


Solution use indirection

Solution: Use Indirection

X1 X2 X3

Buckets

V1

V2

V3 *Grid only

V4 contains

pointers to

buckets

Buckets

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

Notes 5


With indirection

With indirection:

  • Grid can be regular without wasting space

  • We do have price of indirection

Notes 5


Can also index grid on value ranges

Can also index grid on value ranges

SalaryGrid

0-20K

1

20K-50K

2

50K-

3

8

Linear Scale

1

2

3

Toy

Sales

Personnel

Notes 5


Grid files

Grid files

Good for multiple-key search

Space, management overhead(nothing is free)

Need partitioning ranges that evenlysplit keys

+

-

-

Notes 5


Cs 277 database system implementation

Partitioned hash function

010110 1110010

Idea:

Key1 Key2

h1

h2

Notes 5


Cs 277 database system implementation

<Fred>

<Joe><Sally>

EX:

h1(toy)=0000

h1(sales)=1001

h1(art)=1010

.011

.

h2(10k)=01100

h2(20k)=11101

h2(30k)=01110

h2(40k)=00111

.

.

<Fred,toy,10k>,<Joe,sales,10k>

<Sally,art,30k>

Insert

Notes 5


Cs 277 database system implementation

h1(toy)=0000

h1(sales)=1001

h1(art)=1010

.011

.

h2(10k)=01100

h2(20k)=11101

h2(30k)=01110

h2(40k)=00111

.

.

  • Find Emp. with Dept. = Sales  Sal=40k

<Fred>

<Joe><Jan>

<Mary>

<Sally>

<Tom><Bill>

<Andy>

Notes 5


Cs 277 database system implementation

look here

h1(toy)=0000

h1(sales)=1001

h1(art)=1010

.011

.

h2(10k)=01100

h2(20k)=11101

h2(30k)=01110

h2(40k)=00111

.

.

  • Find Emp. with Sal=30k

<Fred>

<Joe><Jan>

<Mary>

<Sally>

<Tom><Bill>

<Andy>

Notes 5


Cs 277 database system implementation

look here

h1(toy)=0000

h1(sales)=1001

h1(art)=1010

.011

.

h2(10k)=01100

h2(20k)=11101

h2(30k)=01110

h2(40k)=00111

.

.

  • Find Emp. with Dept. = Sales

<Fred>

<Joe><Jan>

<Mary>

<Sally>

<Tom><Bill>

<Andy>

Notes 5


Cs 277 database system implementation

Summary

Post hashing discussion:

- Indexing vs. Hashing

- SQL Index Definition

- Multiple Key Access

- Multi Key Index

Variations: Grid, Geo Data

- Partitioned Hash

Notes 5


Reading chapter 14

Reading Chapter 14

  • Skim the following sections:

    • 14.3.6, 14.3.7, 14.3.8

    • 14.4.2, 14.4.3, 14.4.4

  • Read the rest

Notes 5


The big picture

The BIG picture….

  • Chapters 11 & 12: Storage, records, blocks...

  • Chapter 13 & 14: Access Mechanisms- Indexes

    - B trees

    - Hashing

    - Multi key

  • Chapter 15 & 16: Query Processing

NEXT

Notes 5


  • Login