1 / 50

CSCE 520 Test 2 Info Indexing

CSCE 520 Test 2 Info Indexing. Modified from slides of Hector Garcia-Molina and Jeff Ullman. Physical Storage Media. Speed of data access Cost per unit of data Reliability Data loss (power failure or system crash) Physical failure (storage device) Storage types Volatile storage

conlan
Download Presentation

CSCE 520 Test 2 Info Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCE 520 Test 2 InfoIndexing Modified from slides of Hector Garcia-Molina and Jeff Ullman

  2. Physical Storage Media • Speed of data access • Cost per unit of data • Reliability • Data loss (power failure or system crash) • Physical failure (storage device) • Storage types • Volatile storage • Non-volatile storage

  3. Memory Hierarchy Programs, Main Memory DBMS Tertiary Storage DBMS Disk Virtual Memory File System Main Memory Cache

  4. Disk Access Characteristics • Move data to main memory: • Position head on cylinder • Find and access sector • Steps of reading a block: • Processor and disk controller processes the request • Seek time: position the head • Rotation latency: rotate the sector under the head • Transfer time: sector/block read by the head

  5. Disk Access Characteristics • Steps of writing a block: • Read the block into the main memory • Change main memory copy of block • Write new content back on disk • Verify correctness of write

  6. How to find records efficiently? • Primary key – sequential organization • Search key? • High I/O cost  INDEXING

  7. Cost of Indexing • Where the time spent on answering a query • Fast: processing in memory • Slow: fetching from secondary storage • Cost of indexing: • Index on several attributes: fast retrieval but slow writes (maintain index structure)

  8. Topics • Conventional indexes • B-trees • Hashing schemes (read only)

  9. 10 30 50 70 90 20 40 60 80 100 Sequential File

  10. 70 50 30 10 110 90 20 40 100 80 120 60 70 50 30 10 90 80 60 40 20 100 Sequential File Dense Index

  11. 130 90 50 10 210 170 30 70 190 150 230 110 70 50 30 10 90 80 60 40 20 100 Sequential File Sparse Index

  12. 210 50 130 10 170 490 90 170 330 10 110 30 570 70 410 150 190 90 230 250 30 90 70 50 10 100 80 60 40 20 Sequential File Sparse 2nd level

  13. Sparse vs. Dense Tradeoff • Sparse: Less index space per record can keep more of index in memory • Dense: Can tell if any record exists without accessing file

  14. Terms • Index sequential file • Search key ( primary key) • Primary index (on Sequencing field) • Secondary index • Dense index (all Search Key values in) • Sparse index • Multi-level index

  15. Next: • Duplicate keys • Deletion/Insertion • Secondary indexes

  16. 10 10 20 30 40 10 20 30 30 45 Duplicate keys

  17. 10 40 10 20 30 30 40 20 10 10 45 20 30 10 30 30 30 45 10 20 Duplicate keys Dense index, one way to implement? 10 10 10 10 10 10 20 20 20 20 30 30 30 30 30 30

  18. 10 10 20 30 40 10 20 30 30 45 Duplicate keys Dense index, better way? 10 20 30 40

  19. 10 40 10 30 20 30 30 20 10 45 careful if looking for 20 or 30! Duplicate keys Sparse index, one way? 10 10 20 30

  20. 10 40 10 30 20 30 30 20 10 45 should this be 40? Duplicate keys Sparse index, another way? • place first new key from block 10 20 30 30

  21. Duplicate values, primary index Summary • Index may point to first instance of each value only File Index a a a . . b

  22. 70 10 30 50 20 40 60 80 Deletion from sparse index 10 30 50 70 90 110 130 150

  23. 70 10 30 50 20 40 60 80 Deletion from sparse index • delete record 40 10 30 50 70 90 110 130 150

  24. 10 30 50 70 20 40 60 80 40 40 Deletion from sparse index • delete record 30 10 30 50 70 90 110 130 150

  25. 10 30 50 70 20 40 60 80 50 70 Deletion from sparse index • delete records 30 & 40 10 30 50 70 90 110 130 150

  26. 70 10 30 50 20 40 60 80 Deletion from dense index 10 20 30 40 50 60 70 80

  27. 10 70 50 20 60 80 40 40 Deletion from dense index • delete record 30 10 20 30 30 40 40 50 60 70 80

  28. 10 30 40 60 20 50 Insertion, sparse index case 10 30 40 60

  29. 10 30 40 60 20 50 34 • our lucky day! • we have free space • where we need it! Insertion, sparse index case • insert record 34 10 30 40 60

  30. 10 30 40 60 20 50 15 20 20 30 Insertion, sparse index case • insert record 15 10 30 40 60 • Illustrated: Immediate reorganization • Variation: • insert new block (chained file) • update index

  31. 10 30 40 60 25 20 50 overflow blocks (reorganize later...) Insertion, sparse index case • insert record 25 10 30 40 60

  32. Insertion, dense index case • Similar • Often more expensive . . .

  33. Summary so far • Conventional index • Basic Ideas: sparse, dense, multi-level… • Duplicate Keys • Deletion/Insertion • Secondary indexes

  34. Conventional indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Inserts expensive, and/or - Lose sequentiality & balance

  35. NEXT: Another type of index • Give up on sequentiality of index • Try to get “balance”

  36. B+Tree Example n=3 100 Root 120 150 180 30 3 5 11 120 130 180 200 100 101 110 150 156 179 30 35

  37. Sample non-leaf 57 81 95 to keys to keys to keys to keys < 57 57 k<81 81k<95 95

  38. Sample leaf node: From non-leaf node to next leaf in sequence 57 81 95 To record with key 57 To record with key 81 To record with key 85

  39. Size of nodes: n+1 pointers n keys (fixed)

  40. Don’t want nodes to be too empty • Use at least Non-leaf: (n+1)/2 pointers Leaf: (n+1)/2 pointers to data

  41. n=3 Full node min. node Non-leaf Leaf 120 150 180 30 3 5 11 30 35 counts even if null

  42. B+tree rules tree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”

  43. (3) Number of pointers/keys for B+tree Max Max Min Min ptrs keys ptrsdata keys Non-leaf (non-root) n+1 n (n+1)/2 (n+1)/2- 1 Leaf (non-root) n+1 n (n+1)/2 (n+1)/2 Root n+1 n 1 1

  44. Insert into B+tree (read only) (a) simple case • space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

  45. 32 n=3 100 (a) Insert key = 32 30 3 5 11 30 31

  46. 7 3 5 7 n=3 100 (a) Insert key = 7 30 3 5 11 30 31

  47. Deletion from B+tree (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf

  48. 40 n=4 (b) Coalesce with sibling • Delete 50 10 40 100 10 20 30 40 50

  49. 35 35 n=4 (c) Redistribute keys • Delete 50 10 40 100 10 20 30 35 40 50

  50. B+tree deletions in practice • Often, coalescing is not implemented • Too hard and not worth it!

More Related