1 / 11

Notions of clustering

Notions of clustering. Clustered file : e.g. store movie tuples together with the corresponding studio tuple. Clustered relation : tuples are stored in blocks mostly devoted to that relation. Clustering index : tuples (of the relation) with same search key are stored together.

wilfredb
Download Presentation

Notions of clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Notions of clustering • Clustered file: e.g. store movie tuples together with the corresponding studio tuple. • Clustered relation: tuples are stored in blocks mostly devoted to that relation. • Clustering index: tuples (of the relation) with same search key are stored together.

  2. Index-based algorithms: selection • To evaluate a=(R) use index on a, if it exists • Cost: cost of index lookup (negligible) plus • If index is clustering: B(R)/V(R,a) I/O’s (the fraction of the relation with some value for a) • Otherwise, an approximation is that each tuple we retrieve is in a different block, so we get: T(R)/V(R,a) I/O’s

  3. Example of index-based selection • a=(R), and B(R) = 1000, T(R) =20,000 • R is clustered, and no index on attribute a 1000 disk I/O’s • R is unclustered, and no index on attribute a 20,000 I/O’s • R has a clustering index on a, V(R,a) = 100 10 I/O’s • R has a non-clustering index on a, V(R,a) = 100 20,000/100 = 200 disk I/O’s • V(R,a) = 20,000 (i.e. attribute a is key)  just 1 I/O

  4. Index joins • We want to compute R(X,Y) S(Y,Z) • Suppose there is a Y-index on S. • For each tuple t of R, lookup all tuples in S with key-value t[Y] and output the join of t. • Cost: B(R) to read R (clustered case) --- We ignore this cost • Each tuple of R joins with T(S)/V(S,Y) tuples of S, on average. • S has a non-clustered index on Y: T(R)T(S)/V(S,Y) • S has a clustered index on Y: T(R)B(S)/V(S,Y)

  5. Example of index-join • T(R) = 10,000, B(R) = 1000 • T(S) = 5000, B(S) = 500, V(S,Y) = 100 • To compute R(X,Y)  S(Y,Z) using a clustered Y-index on S: • 1000 + 10,000*(500/100) = 51,000 I/O’s Bad!!

  6. However, things are not so bad in practice • Suppose we have the relations: • StarsIn(title, year, starName) • MovieStar(name, address, gender, birthdate) • And there is an index on MovieStar.name • Consider the SQL query: • SELECT birthdate • FROM StarsIn, MovieStar • WHERE title = 'King Kong' AND starName = name;

  7. Practice (Cont’d) • We can first do the selection of those tuples in StarsIn relation with title=‘King Kong’. Suppose they are 10 such tuples. • Now, we know that stars take care to not have the same name with some other star. So, name is a key for the relation MovieStar. (V(MovieStar, name) = ?) • Hence, V(MovieStar, name) = T(MovieStar) • Finally the number of I/O’s is: • B(StarsIn) + T(name=‘King Kong’(StarsIn)) I/O’s for R clustered and • T(StarsIn) + T(name=‘King Kong’(StarsIn)) I/O’s for R non-clustered.

  8. Joins using sorted indexes We want to compute R(X,Y)  S(Y,Z) • If S has a B-tree index on Y, • Create sorted sublists of R only, and • Do a sort join, extracting the S-tuples in order through the index • Of both have B-tree index on Y, do a zigzag-join.

  9. Example (B-Tree index on S[Y]) • T(R) = 10,000, B(R) = 1000, • T(S) = 5000, B(S) = 500, V(S,Y) = 100, S has a B-Tree index on Y • Assume that both relations and the indexes are clustered. • M = 101 buffers • Create 10 sorted sublists of R. Cost: 2B(R) • 10 buffers for sublists of R, 1 buffer for S (retrieved via index) • Join tuples from input buffers • Total cost: 2B(R) + B(R) + B(S) + index lookup = 2000 + 1000 + 500 + index lookup = 3500 + index lookup

  10. Zigzag Join • Suppose we have B-Tree indexes on both S[Y] and R[Y]. • We can jump back and forth between the indexes finding Y-values that they share in common. • Tuples from R with Y-value that doesn’t appear in S need never be retrieved, and similarly tuples of S whose Y-value doesn’t appear in R need never be retrieved. Example. • Let the Y-values for R be: 1,3,4,4,4,5,6 • Let the Y-values for S be: 2,2,4,4,6,7 • Start with the 1 and 2. • Since 1<2 skip 1 in R. • Since 2<3 skip the 2’s in S. • Since 3<4 skip 3 in R. • Join 4’s. • …

  11. Example (Zigzag Join) • T(R) = 10,000, B(R) = 1000, • T(S) = 5000, B(S) = 500, • S and R both have clustered B-Tree indexes on Y • There is no need to store either relation. • We use just 1000+500 disk I/O’s to read the blocks of R and S through their indexes. • We can determine from the indexes alone that a large fracion of R or S cannot match tuples of the other relation, so the cost might be considerably less than 1500 I/O’s.

More Related