database systems n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Database Systems ( 資料庫系統 ) PowerPoint Presentation
Download Presentation
Database Systems ( 資料庫系統 )

Loading in 2 Seconds...

play fullscreen
1 / 24

Database Systems ( 資料庫系統 ) - PowerPoint PPT Presentation


  • 197 Views
  • Uploaded on

Database Systems ( 資料庫系統 ). November 8, 2004 Lecture #9 By Hao-hua Chu ( 朱浩華 ). Announcement. Midterm exam: November 20 (Sat): 2:30 PM in CSIE 101/103 Assignment #6 is available on the course homepage. It is due on 11/24 It is very difficult Suggest you do it before midterm exam

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Database Systems ( 資料庫系統 )' - horace


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
database systems

Database Systems(資料庫系統)

November 8, 2004

Lecture #9

By Hao-hua Chu (朱浩華)

announcement
Announcement
  • Midterm exam: November 20 (Sat): 2:30 PM in CSIE 101/103
  • Assignment #6 is available on the course homepage.
    • It is due on 11/24
    • It is very difficult
    • Suggest you do it before midterm exam
  • Assignment #7 will be available on the course homepage later this afternoon.
    • It is due 11/16 (next Tuesday).
    • It is easy.
    • It will help you prepare midterm exam.
cool ubicomp project counter intelligence mit
Cool Ubicomp ProjectCounter Intelligence (MIT)
  • Smart kitchen & kitchen wares
  • Talking Spoon
    • Salty, sweet ,hot?
  • Talking Cultery
    • Bacteria?
  • Smart fridge & counters
    • RFID tags
    • Tracking food from fridge to your month
introduction
Introduction
  • Recall that Hash-based indexes are best for equalityselections.
    • Cannot support range searches.
    • Equality selections are useful for join operations.
  • Static and dynamic hashing techniques exist
    • Trade-offs similar to ISAM vs. B+ trees.
    • Static hashing technique
    • Two dynamic hashing techniques
      • Extendible Hashing
      • Linear Hashing
static hashing
Static Hashing
  • # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed.
  • h(k) mod N = bucket to which data entry withkey k belongs. (N = # of buckets)

0

h(key) mod N

2

key

h

N-1

Primary bucket pages

Overflow pages

static hashing contd
Static Hashing (Contd.)
  • Buckets contain data entries.
  • Hash function works on search key field of record r.
    • Ideally uniformly distribute values over range 0 ... N-1
    • h(key) = (a * key + b) usually works well.
    • a and b are constants; lots known about how to tune h.
  • Cost for insertion/delete/search
    • two/two/one disk page I/Os (no overflow chains).
  • Long overflow chains can develop and degrade performance.
    • Why poor performance? Scan through overflow chains linearly.
    • Extendible and Linear Hashing: Dynamic techniques to fix this problem.
simple solution
Simple Solution
  • Avoid creating overflow pages:
    • When a bucket (primary page) becomes full, double # of buckets & re-organize the file.
  • What’s wrong with this simple solution?
    • High cost concern: reading and writing all pages is expensive!
extendible hashing
Extendible Hashing
  • The basic Idea (another level of abstraction):
    • Use directory of pointers to buckets (hash to the directory entry)
    • Double # of buckets by doubling the directory
    • Splitting just the bucket that overflowed!
  • Directory much smaller than file, so doubling it is much cheaper.
  • Only one page of data entries is split
    • The page that overflows, rehash that page to two pages.
  • Trick lies in how hash function is adjusted!
    • Before doubling directory, h(r) -> 0..N-1 buckets.
    • After doubling directory, h(r) -> 0 .. 2N-1
example
Example
  • Directory is array of size 4.
  • To find bucket for r, take last global depth # bits of h(r);
    • Example: If h(r) = 5 = binary 101, it is in bucket pointed to by 01.
  • Global depth: # of bits used for hashing directory entries.
  • Local depth of a bucket: # bits for hashing a bucket.
  • When can global depth be different from local depth?

LOCAL DEPTH

2

Bucket A

16*

4*

12*

32*

GLOBAL DEPTH

2

2

Bucket B

00

5*

1*

21*

13*

01

2

10

Bucket C

10*

11

2

DIRECTORY

Bucket D

15*

7*

19*

DATA PAGES

insert h r 20 causes doubling
Insert h(r)=20 (Causes Doubling)

2

LOCAL DEPTH

3

LOCAL DEPTH

Bucket A

4

12

16*

32*

32*

16*

GLOBAL DEPTH

Bucket A

GLOBAL DEPTH

2

2

2

3

Bucket B

5*

21*

13*

1*

00

1*

5*

21*

13*

000

Bucket B

01

001

2

10

2

010

Bucket C

10*

11

10*

Bucket C

011

100

2

2

DIRECTORY

101

Bucket D

15*

7*

19*

15*

7*

19*

Bucket D

110

111

4: 0000 0100

12: 0000 1100

20: 0001 0100

16: 0001 0000

32: 0010 0000

3

DIRECTORY

4*

12*

20*

12*

20*

Bucket A2

4*

(`split image'

of Bucket A)

extensible hashing insert
Extensible Hashing Insert
  • Check if the bucket is full.
    • If no, done!
  • Otherwise, check if local depth = global depth
    • if no, rehash the entries and distribute them into two buckets + increment the local depth
    • if yes, double the directory -> rehash the entries and distribute into two buckets
  • Directory is doubled by copying it over and `fixing’ pointer to split image page.
    • You can do this only by using the least significant bits in the directory.
insert 9

1: 0000 0001

5: 0000 0101

21: 0001 0101

13: 0000 1101

9: 0000 1001

Insert 9

3

LOCAL DEPTH

3

32*

16*

Bucket A

LOCAL DEPTH

GLOBAL DEPTH

32*

16*

Bucket A

GLOBAL DEPTH

3

3

2

3

1*

9*

000

Bucket B

1*

5*

21*

13*

000

001

Bucket B

2

010

001

2

10*

Bucket C

010

011

10*

Bucket C

100

011

2

100

101

2

15*

7*

19*

Bucket D

101

110

15*

7*

19*

Bucket D

111

110

3

111

12*

20*

Bucket A2

4*

DIRECTORY

3

(`split image'

DIRECTORY

of Bucket A)

12*

20*

Bucket A2

3

4*

(`split image'

13*

21*

Bucket B2

5*

of Bucket A)

(`split image'

of Bucket B)

directory doubling
Directory Doubling
  • Why use least significant bits in directory?
    • Allows for doubling via copying!

6 = 110

6 = 110

3

3

000

000

001

001

2

2

010

010

00

00

1

1

011

011

6*

01

10

0

0

100

100

6*

6*

10

01

1

1

101

101

6*

11

11

6*

6*

110

110

111

111

vs.

Least Significant

Most Significant

comments on extendible hashing
Comments on Extendible Hashing
  • If directory fits in memory, equality search answered with one disk access; else two.
    • 100MB file, 100 bytes/rec, you have 1M data entries.
    • A 4K page (a bucket) can contain 40 data entries. You need about 25,000 directory elements; chances are high that directory will fit in memory.
    • If the distribution of hash values is skewed (concentrates on a few buckets), directory can grow large.
  • Delete: If removal of data entry makes bucket empty, can be merged with `split image’. If each directory element points to same bucket as its split image, can halve directory.
linear hashing lh
Linear Hashing (LH)
  • This is another dynamic hashing scheme, an alternative to Extendible Hashing.
    • LH fixes the problem of long overflow chains (in static hashing) without using a directory (in extendible hashing).
  • Basic Idea: Use a family of hash functions h0, h1, h2, ...
    • Each function’s range is twice that of its predecessor.
    • Pages are split when overflows occur –but not necessarily the overflowing page. (Splitting occurs in turn, in a round robin fashion.)
    • Buckets are added gradually (one bucket at a time).
    • When all the pages at one level (the current hash function) have been split, a new level is applied.
    • Primary pages are allocated consecutively.
levels of linear hashing
Levels of Linear Hashing
  • Initial Stage.
    • The initial level distributes entries into N0 buckets.
    • Call the hash function to perform this h0.
  • Splitting buckets.
    • If a bucket overflows its primary page is chained to an overflow page (same as in static hashing).
    • Also when a bucket overflows, some bucket is split.
      • The first bucket to be split is the first bucket in the file (not necessarily the bucket that overflows).
      • The next bucket to be split is the second bucket in the file … and so on until the Nth. has been split.
      • When buckets are split their entries (including those in overflow pages) are distributed using h1.
    • To access split buckets the next level hash function (h1) is applied.
    • h1 maps entries to 2N0 (or N1)buckets.
levels of linear hashing cnt
Levels of Linear Hashing (Cnt)
  • Level progression:
    • Once all Ni buckets of the current level (i) are split, the hash function hi is replaced by hi+1.
    • The splitting process starts again at the first bucket, and hi+2 is applied to find entries in split buckets.
linear hashing example
Linear Hashing Example
  • Initially, the index level equal to 0 and N0 equals 4 (three entries fit on a page).
  • h0 maps index entries to one of four buckets.
  • h0 is used and no buckets have been split.
  • Now consider what happens when 9 (1001) is inserted (which will not fit in the second bucket).
  • Note that next indicates which bucket is to split next. (Round Robin)

h0

00

01

10

11

linear hashing example 2
Linear Hashing Example 2
  • The page indicated by next is split (the first one).
  • Next is incremented.
  • An overflow page is chained to the primary page to contain the inserted value.
  • If h0 maps a value from zero to next – 1 (just the first page in this case) h1 must be used to insert the new entry.
  • Note how the new page falls naturally into the sequence as the fifth page.

000

01

10

11

100

linear hashing example 3
Linear Hashing Example 3
  • Assume inserts of 8, 7, 18, 14, 111, 32, 162, 10, 13, 233
  • After the 2nd. split the base level is 1 (N1 = 8), use h1.
  • Subsequent splits will use h2 for inserts between the first bucket and next-1.
linear hashing vs extendable hashing
Linear Hashing vs. Extendable Hashing
  • What is the similarity?
    • One round of RR of splitting in LH is the same as 1-step doubling of directory in EH
  • What are the differences?
    • Directory overhead vs. none
    • Overflow pages vs. none
    • Gradual splitting (of pages) vs. one-step doubling (of directory)
    • Pages are allocated in order vs. not in order
    • Splitting non-overflowing pages vs. splitting overflowing pages
summary
Summary
  • Hash-based indexes: best for equality searches, cannot support range searches.
  • Static Hashing can lead to long overflow chains.
  • Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.)
    • Directory to keep track of buckets, doubles periodically.
    • Can get large with skewed data; additional I/O if this does not fit in main memory.
    • a skewed data distribution is one in which the hash values of data entries are not uniformly distributed!
summary contd
Summary (Contd.)
  • Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages.
    • Overflow pages not likely to be long.
    • Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas.
      • Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization.