1 / 70

CS143: Index

CS143: Index. Book Chapters: (4 th ) 12.1-3, 12.5-8 (5 th ) 12.1-3, 12.6-8, 12.10. Topics to Learn. Important concepts Dense index vs. sparse index Primary index vs. secondary index (= clustering index vs. non-clustering index) Tree-based vs. hash-based index Tree-based index

illias
Download Presentation

CS143: Index

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS143: Index Book Chapters: (4th) 12.1-3, 12.5-8 (5th) 12.1-3, 12.6-8, 12.10

  2. Topics to Learn • Important concepts • Dense index vs. sparse index • Primary index vs. secondary index (= clustering index vs. non-clustering index) • Tree-based vs. hash-based index • Tree-based index • Indexed sequential file • B+-tree • Hash-based index • Static hashing • Extendible hashing

  3. Basic Problem • SELECT *FROM StudentWHERE sid = 40 • How can we answer the query?

  4. Random-Order File • How do we find sid=40?

  5. Sequential File • Table sequenced by sid. Find sid=40?

  6. Binary Search • 100,000 records • Q: How many blocks to read? • Any better way? • In a library, how do we find a book?

  7. Basic Idea • Build an “index” on the table • An auxiliary structure to help us locate a record given a “key” 40

  8. 110 70 50 30 10 90 100 80 60 40 20 120 10 30 90 70 50 80 60 40 20 100 Dense, Primary Index Dense Index Sequential File • Primary index (clustering index) • Index on the search key • Dense index • (key, pointer) pair for every record • Find the key from index and follow pointer • Maybe through binary search • Q: Why dense index? • Isn’t binary search on the file the same?

  9. Why Dense Index? • Example • 10,000,000 records (900-bytes/rec) • 4-byte search key, 4-byte pointer • 4096-byte block. Unspanned tuples • Q: How many blocks for table (how big)? • Q: How many blocks for index (how big)?

  10. 90 50 10 130 150 110 70 30 10 30 50 70 90 100 60 40 20 80 Sparse, Primary Index Sequential File Sparse Index • Sparse index • (key, pointer) pair per every “block” • (key, pointer) pair points to the first record in the block • Q: How can we find 60?

  11. Sequential File Sparse 2nd level 1st level 130 330 170 10 210 10 50 490 90 170 70 570 110 30 230 410 150 250 90 190 90 70 50 30 10 100 80 60 40 20 Multi-level index Q: Why multi-level index? Q: Does dense, 2nd level index make sense?

  12. 30 20 80 100 90 50 70 40 10 60 Secondary (non-clustering) Index Sequence field • Secondary (non-clustering) index • When tuples in the table are not ordered by the index search key • Index on a non-search-key for sequential file • Unordered file • Q: What index? • Does sparse index make sense?

  13. 100 30 20 80 90 60 40 10 50 70 90 30 ... 20 80 100 Sparse and secondary index?

  14. 90 30 20 80 100 50 70 40 10 60 10 50 10 60 20 50 30 70 90 ... ... 40 Secondary index sparse High level • First level is always dense • Sparse from the second level

  15. Important terms • Dense index vs. sparse index • Primary index vs. secondary index • Clustering index vs. non-clustering index • Multi-level index • Indexed sequential file • Sometimes called ISAM (indexed sequential access method) • Search key ( primary key)

  16. Insertion Insert 35 10 10 20 30 30 40 60 40 50 60 Q: Do we need to update higher-level index?

  17. Insertion Insert 15 (overflow) 10 10 20 30 30 40 60 40 50 60 Q: Do we need to update higher-level index?

  18. Insertion Insert 15 (redistribute) 10 10 20 30 30 40 60 40 50 60 Q: Do we need to update higher-level index?

  19. 32 39 38 31 34 35 36 Main index Potential performance problem After many insertions… 10 20 30 33 40 50 60 70 80 overflow pages (not sequential) 90

  20. Traditional Index (ISAM) • Advantage • Simple • Sequential blocks • Disadvantage • Not suitable for updates • Becomes ugly (loses sequentiality and balance) over time

  21. B+Tree • Most popular index structure in RDBMS • Advantage • Suitable for dynamic updates • Balanced • Minimum space usage guarantee • Disadvantage • Non-sequential index blocks

  22. 50 50 20 30 80 90 70 80 70 B+Tree Example (n=3) root Non leaf Leaf ... ... ... Balanced: All leaf nodes are at the same level

  23. Sample Leaf Node (n=3) From a non-leaf node 20 30 Last pointer: to the next leaf node points to tuple • n: max # of pointers in a node • All pointers (except the last one) point to tuples • At least half of the pointers are used. • (more precisely, (n+1)/2 pointers)

  24. Sample Non-leaf Node (n=3) 23 56 To keys 56 k To keys k<23 To keys 23 k<56 • Points to the nodes one-level below • - No direct pointers to tuples • At least half of the ptrs used (precisely, n/2) • - except root, where at least 2 ptrs used

  25. 50 20 30 80 90 70 50 80 70 Search on B+tree • Find 30, 60, 70? • Find a greater key and follow the link on the left • (Algorithm: Figure 12.10 on textbook)

  26. 5 8 10 5 5 8 10 5 8 Nodes are never too empty • Use at least Non-leaf: n/2 pointers Leaf: (n+1)/2 pointers full node min. node Non-leaf Leaf n=4

  27. Number of Ptrs/Keys for B+tree Max Max Min Min Ptrs keys ptrs keys Non-leaf (non-root) n n-1 n/2 n/2-1 Leaf (non-root) n n-1 (n+1)/2 (n-1)/2 Root n n-1 2 1

  28. B+Tree Insertion (a) simple case (no overflow) (b) leaf overflow (c) non-leaf overflow (d) new root

  29. 50 20 30 80 90 70 50 80 70 Insertion (Simple Case) • Insert 60

  30. 20 30 50 60 80 90 70 50 80 70 Insertion (Leaf Overflow) • Insert 55 • No space to store 55

  31. 20 30 80 90 70 80 70 60 50 60 50 55 Insertion (Leaf Overflow) • Insert 55 No overflow. Stop • Q: After split, leaf nodes always half full?

  32. 20 30 50 60 50 55 60 70 Insertion (Non-leaf Overflow) • Insert 52 Leaf overflow. Split and copy the first key of the new node

  33. 20 30 50 52 50 60 60 55 55 70 Insertion (Non-leaf Overflow) • Insert 52 No overflow. Stop Q: After split, non-leaf at least half full?

  34. 20 30 50 60 50 55 60 Insertion (New Root Node) • Insert 25

  35. 50 55 60 50 20 25 60 30 Insertion (New Root Node) • Insert 25 • Q: At least 2 ptrs at root? 30

  36. B+Tree Insertion • Leaf node overflow • The first key of the new node is copied to the parent • Non-leaf node overflow • The middle key is moved to the parent • Detailed algorithm: Figure 12.13

  37. B+Tree Deletion (a) Simple case (no underflow) (b) Leaf node, coalesce with neighbor (c) Leaf node, redistribute with neighbor (d) Non-leaf node, coalesce with neighbor (e) Non-leaf node, redistribute with neighbor In the examples, n = 4 • Underflow for non-leaf when fewer than n/2 = 2 ptrs • Underflow for leaf when fewer than (n+1)/2 = 3 ptrs • Nodes are labeled as a, b, c, d, …

  38. 20 25 30 20 40 60 40 50 (a) Simple case a • Delete 25 b c d e

  39. 40 50 20 40 60 20 30 (b) Coalesce with sibling (leaf) a b c d e • Delete 50 60

  40. 20 60 20 30 40 (b) Coalesce with sibling (leaf) a b c e • Delete 50 • Check underflow at a. Min 2 ptrs, currently 3 60

  41. 20 40 60 20 25 30 40 50 (c) Redistribute (leaf) a b c d e • Delete 50 60

  42. 30 30 40 20 40 60 20 25 (c) Redistribute (leaf) a b c d e • Delete 50 • No underflow at a. Done. 60

  43. 30 70 30 40 50 90 50 60 10 20 (d) Coalesce (non-leaf) a b c • Delete 20 • Underflow! Merge d with e. • Move everything in the right to the left d e f g 70

  44. 50 70 90 10 30 40 50 60 (d) Coalesce (non-leaf) a b • Delete 20 • Underflow at a? Min 2 ptrs. Currently 2. Done. d f g 70

  45. 30 70 90 97 30 40 50 99 50 60 10 20 (e) Redistribute (non-leaf) a b c • Delete 20 • Underflow! Merge d with e. d e f g 70

  46. 90 99 50 70 10 30 40 97 50 60 (e) Redistribute (non-leaf) a b c • Delete 20 • Underflow at a? Min 2 ptrs, currently 3. Done d f g 70

  47. Important Points • Remember: • For leaf node merging, we delete the mid-key from the parent • For non-leaf node merging/redistribution, we pull down the mid-key from their parent. • Exact algorithm: Figure 12.17 • In practice • Coalescing is often not implemented • Too hard and not worth it

  48. Where does n come from? • n determined by • Size of a node • Size of search key • Size of an index pointer • Q: 1024B node, 10B key, 8B ptr  n?

  49. 20 30 50 60 80 90 70 50 80 70 Question on B+tree • SELECT *FROM StudentWHERE sid > 60?

  50. Summary on tree index • Issues to consider • Sparse vs. dense • Primary (clustering) vs. secondary (non-clustering) • Indexed sequential file (ISAM) • Simple algorithm. Sequential blocks • Not suitable for dynamic environment • B+trees • Balanced, minimum space guarantee • Insertion, deletion algorithms

More Related