1 / 41

7. Indexes

Section 7 # 1. Heap files allow record retrieval: by specifying the Record IDentifier, RID , or by scanning all records sequentially. Sometimes, retrieval of records by specifying the values in one or more fields is needed (semantic search or value-based querying), e.g.,

Download Presentation

7. Indexes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Section 7 # 1 • Heap files allow record retrieval: • by specifying the Record IDentifier, RID, or • by scanning all records sequentially. • Sometimes, retrieval of records by specifying the values in one or more fields is needed (semantic search or value-based querying), e.g., • Find all students in the CS dept; or • Find students with gpa > 3; • Indexes are auxilliary files (separate from the data files they index) that enable answering these value-based queries efficiently. • An index contains a “search key attribute”, k, and “data entries”, k*, which lead us to the records containing the search key value. • the k*s can be actual records, • or pointers to the records • or pointers to pointers to the records (indirect pointers). • In these notes we will always assume the second alternative • (pointer to the records). 7. Indexes

  2. Section 7 # 2 Index Classification • Primary Indexes vs. Secondary Indexes: • If the search key is, or at least contains, the clustered primary key, then the index is called a primary index, else it is called a secondary index. • Clustered Indexes vs. Unclustered Indexes: • If closeness of key values implies closeness of the data records containing those key values, the index is called a clustered index, else it is an unclustered index. (recall, since physical disk storage is not completely linear, "close" means • 1st: close on one disk track • 2nd: on differnt tracks but on the same cylinder • 3rd: on the different cylinders but on close cylinders. • A file can be clustered on at most 1 attribute (search key) but that attribute may be composite (made up of multiple attributes). • The cost of retrieving data records through an index varies greatly based on whether index is clustered or not!

  3. key Anchor records of each page Name, age, bonus Ashby, 25, 3000 22 Basu, 33, 4003 25 Bristow, 30, 2007 30 Ashby 33 Cass, 50, 5004 Cass Smith Daniels, 22, 6003 40 Jones, 40, 6003 44 44 Smith, 44, 3000 50 Tracy, 44, 5004 Sparse Index Dense Index on on Data File Name Age Section 7 # 3 Index Classification continued • If there is at least one index entry per existing attribute value, then it is called dense, else sparse or non-dense. • Every sparse index must be clustered! Sparse indexes are smaller. • We show a sparse index on the Name attribute. • The key values (name in this example) that do occur in a sparse index are called ANCHOR values (always the first key value that occurs on a page). • In this example the anchor names are Ashby, Cass and Smith. Basu, Bristow, Daniels, Jones and Tracy are non-anchor names.

  4. page 1 page 2 page 3 Primary Index on S# |S#|pg |17| 1 |32| 2 |57| 3 Section 7 # 4 Primary Index PRIMARY INDEX: I(k, k*) k  ordered or clustered "key" field values from an ordered or clustered field of file with theuniqueness property (individual value occurrences are "unique" i.e., each value can occur at most once.) k* = pointer to page (sematic pointer - namely page number) containing the first record with key value, k. Example: Assume the blocking factor (bfr) is 2, which means there is room for 2 records per page. STUDENT |S#|SNAME |LCODE | |17|BAID |NY2091| |25|CLAY |NJ5101| |32|THAISZ|NJ5102| |38|GOOD |FL6321| |57|BROWN |NY2092| |83|THOM |ND3450|

  5. page 1 page 2 page 3 page 4 |C#|pg| Non-dense Clustering_Index on C# |6 | 1| (indexing new anchor records only) |8 | 4| There's no more search overhead with this 2nd type of non-dense clustering index, but How can you know which page has C#=7? (search pages sequentially starting at page=1) How can you know which page has C#=9? (search pages sequentially starting at page=4) Section 7 # 5 Clustering Index is like a primary index except that the attribute need not be a key - but the file must be clustered on the attribute, k - and the pointer for any k is the 1st page with that k-value ENROLL2 |S#|C#|GRADE |17|6 | 96 | |25|6 | 76 | |32|6 | 62 | |38|6 | 98 | |32|6 | 91 | |25|7 | 68 | |32|8 | 89 | |17|9 | 95 | |C#|pg| Dense Clustering_Index on C# |6 | 1| |7 | 3| |8 | 4| |9 | 4|

  6. page 1 page 4 page 2 page 3 page 5 Option2: Use repeating groups of pointers (requires a variable length page attribute) |C#|page |5 | 4 |6 | 1,2,3 |7 | 3,4 |8 | 1,2,4 Section 7 # 6 Secondary Index These indexes are the same as Clustering Indexes except, - the file need not be clustered on k, - k* points to the page or record containing k - every record must be indexed (all secondary indexes are dense). Why? Option1: If there are multiple occurences of k, use multiple index entries for that k. S#|C#|GRADE ENROLL(unclustered C#) 32|8 | 89 | 25|6 | 76 | 32|6 | 62 | 25|8 | 86 | 38|6 | 98 | 32|7 | 91 | 17|5 | 96 | 25|7 | 68 | 17|8 | 95 | C#|pgSecondary_Index, Option1 on C# 5 | 4 6 | 1 6 | 2 6 | 3 7 | 3 7 | 4 8 | 1 8 | 2 8 | 4 Option3: Use 1 index entry for each value, with 1 pointer to a linked list of record pointers. (1 level of indirection) |S#|C#|GRADEENROLL (unclustered C#) |32|8 | 89 | |25|6 | 76 | |32|6 | 62 | |25|8 | 86 | |38|6 | 98 | |32|7 | 91 | |17|5 | 96 | |25|7 | 68 | |17|8 | 95 | |C#| page |5 | -->|4| |6 | -->|1|->|2|->|3| |7 | -->|3|->|4| |8 | -->|1|->|2|->|4|

  7. page 1 of STUDENT page 4 page 1 of First Level Index page 2 Section 7 # 7 Multi-level Index (made up of an index on an index) For any index, since it is a file clustered on the key, k, it can have a primary or clustering index on it. (constituting the second level of the multilevel index). STUDENT |S#|SNAME |LCODE | |17|BAID |NY2091| |25|CLAY |NJ5101| |32|THAISZ|NJ5102| |38|GOOD |FL6321| |57|BROWN |NY2092| |83|THOM |ND3450| |91|PARK |MN7334| |94|SIVA |OR1123| |S#|pg| (of index file) S#-index (nondense, primary) |17| 1| |32| 2| |57| 3| |91| 4| 2nd_LEVEL (a second level, nondense index) |S#|pg| |17| 1| |57| 2|

  8. Tree-structured or Multilevel indexing techniques: • ISAM: (a variation of multilevel Secondary indexing) static structure; • B+ tree:dynamic structure which adjusts gracefully under insert and delete. index entry K* K K* K K* K* K m 0 1 2 1 m 2 • Leaf pages contain data entries, <k,k*>. • Non-Leaf or Inodes contain k values only Non-leaf (inode Leaf Overflow Pages page Primary pages Section 7 # 8 ISAM 1 index every record by <k, k*>. This Index file may still be quite large, but we can apply the idea repeatedly!

  9. Root 40 20 33 51 63 46,46* 55,55* 10,10* 15,15* 20,20* 27,27* 33,33* 37,37* 40,40* 51,51* 97,97* 63,63* Section 7 # 9 Example ISAM Tree Blocking factor is 2 (each node has 2 k entries) In any internal node or inode (non-leaf) add a ptr for key_values < first k-value

  10. 48,48* 42,42* Index Pages Primary Leaf Pages 23,23* Overflow Pages Section 7 # 10 Insert k=48 Insert k=41 Insert k=23 Insert k=42 40 20 33 51 63 46,46* 55,55* 10,10* 15,15* 20,20* 27,27* 33,33* 37,37* 40,40* 51,51* 97,97* 63,63* Need overflow page Need overflow page 41,41* Need overflow page

  11. 48,48* 42,42* 23,23* Section 7 # 11 40 Deleting 51 Deleting 42 Deleting 97 20 33 51 63 46,46* 55,55* 10,10* 15,15* 20,20* 27,27* 33,33* 37,37* 40,40* 51,51* 97,97* 63,63* 41,41* • Note that 51 appears in index levels, but not in leaf!

  12. Index Entries (“Direct search set or index set”) Data Entries ("Sequence set") Section 7 # 12 B+ Tree: The Most Widely Used Index • keeps tree height-balanced. • Minimum 50% occupancy (except for root). Each node contains m entries, where d m  2d. • d is called the degree or order of the index. • Supports equality and range-searches efficiently.

  13. Section 7 # 13 Example B+ Tree (d=2) Search for 5 Search for 15 Search begins at root, key comparisons direct it to a leaf (similar to ISAM except use  comparisons to keys and take left pointer iff < lowest key). Search for all data entries  24 Root 30 13 17 24 Leaves are doubly linked for fast sequential <, , , > search 39* 3* 5* 19* 20* 22* 24* 27* 38* 2* 7* 14* 16* 29* 33* 34* 5* 15 is not in the file!

  14. Section 7 # 14 Example B+ Tree (contd.) • Search for all data entries < 23 • (note, this is the reason for the double linkage). Root 30 13 17 24 39* 20* 3* 5* 19* 22* 24* 27* 38* 2* 7* 14* 16* 29* 33* 34*

  15. Section 7 # 15 Inserting a Data Entry into a B+ Tree • Find correct leaf L. • Put data entry in L. • If L has enough space, done! • Else, must splitL (into L and a new node L2) • Redistribute entries, copy up (promote)middle key. • middle value which was promoted and is now the anchor key for L2). • This can happen recursively (e.g., if there is no space for the promoted middle value in the inode to which it is promoted) • To split inode, redistribute entries evenly, but push up (promote)middle key. • So promote means Copy up at leaf; Move up at inode. • Splits “grow” tree • only a root split increases height. • Only tree growth possible: wider or 1 level taller at top.

  16. Entry to be inserted in parent node. (Note that 17 is moved up and only appears once in the index. Contrast this with a leaf split.) 5 17 3* 5* 2* 7* 5 24 13 30 30 13 17 24 5 is promoted to parent node. 39* 38* 3* 5* 19* 20* 22* 24* 27* 2* 7* 14* 16* 29* 33* 34* 2* 5* 3* 7* 8* (Note that 5 is s copied up and continues to appear in the new leaf node, L2, as its anchor value.) Section 7 # 16 Inserting 8* No room for 5, so split and move 17 up. No room for 8, so split. Observe how minimum occupancy is guaranteed in both leaf and index pg splits. • Note difference between copy-up (leaf)and move-up (inode)

  17. Root 30 13 17 24 39* 38* 3* 5* 19* 20* 22* 24* 27* 2* 7* 14* 16* 29* 33* 34* Root 17 24 5 13 30 After Inserting 8* 39* 2* 3* 5* 7* 8* 19* 20* 22* 24* 27* 38* 29* 33* 34* 14* 16* Section 7 # 17 B+ Tree Before Inserting 8* Note height_increase, balance and occupancy maintenance.

  18. Section 7 # 18 Deleting a Data Entry from a B+ Tree • Start at root, find leaf L where entry belongs. • Remove the entry. • If L is at least half-full, done! • If L has only d-1 entries, • Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). • If re-distribution fails, mergeL and a sibling. • Merge could propagate to root, and therefore decreasing height.

  19. Root 17 17 Then Deleting 19*, 20* Root 24 5 13 30 27 5 13 30 39* 2* 3* 19* 20* 22* 24* 27* 5* 7* 8* 38* 29* 33* 34* 14* 16* 39* 2* 3* 5* 7* 8* 22* 24* 27* 29* 38* 33* 34* 14* 16* Section 7 # 19 Example Tree After Inserting 8* • Deleting 19* is easy. • Deleting 20* is done with re-distribution of 24* (and revision of anchor value (from 24 to 27) in inode.

  20. 27 17 30 5 13 17 39* 2* 3* 27* 29* 5* 7* 8* 22* 38* 34* 33* 14* 16* 27 30 5 13 39* 2* 3* 5* 7* 8* 22* 24* 27* 29* 38* 34* 33* 14* 16* Root 5 13 17 30 3* 39* 2* 5* 7* 8* 22* 38* 27* 33* 34* 14* 16* 29* Section 7 # 20 ... And Then Deleting 24* • Must merge. • Observe `toss’ of index entry, 27, now that inode is below min occupancy so merge it with its sibling • and index entry, 17 can be `pulled down’ (sibling merge, followed by pull-down)

  21. Section 7 # 21 Multidimensional Index Multidimensional data almost always requires multidimensional indexing for effective access. One dimensional indexes assume a single search column, attribute (search key) which can be a composite column or key. Data structures, that support queries into multidimensional data specifically, fall in two categories: 1. Hash-table-like (e.g., Grid files and Partitioned Hash Functions) 2. Tree-like, eg, multi-key indexes, kd-trees, quad-trees (for sets of points); R-trees (for sets of regions as well as sets of points) ), Predicate-trees (P-trees) for vertical compressed, representations of data

  22. If vertical grid lines are drawn at age=40, age=65, horizontal at SAL=90K, SAL=224K The points are hashed by ranges: 40 56 400K 380K 360K 340K 320K 300K 280K 260K 240K 220K 200K 180K 160K 140K 120K 100K 80K 60K 40K 20K 0K 0 10 20 30 40 50 60 70 80 90 100  AGE * * * * * Grid hash function age sal points range range 0-39 0-89K (24,60) 40-55 0-89K (46,60) (50,80) 40-55 90-223K (50,100) (50,120) 56-99 90-223K (70,100) (84,140) 0-39 224-400K (30,260) (26,400) 40-55 224-400K (44,360) (50,280) 56-99 224-400K (60,260) * * * * * * * Section 7 # 22 Hash-like Structures for Multidimensional e.g., Data Grid Files Partition the POINTS in the space into a grid. In each dimension "grid lines" partition space into stripes. Points that fall directly on a grid line belong to the stripe above or to the right of it. Example:12 customer(age,salary) 2-dimensional data records. (age,salary): (24,60) (46,60) (50,80) (50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260) Inserting into Grid files: If there is room, insert, else (two methods) 1. add overflow block and chain it to the primary block, or 2. reorganize the structure by adding or moving grid lines (similar to dynamic hashing) A problem with Grid files is that the number of buckets grows exponentially with dimension and the grid may become sparse.

  23. Partitioined hash function keypoints 00 0 (24,060) (46,060) (26,400) 0 0 1(84,140) 0 1 0(60,260) 0 1 1 (44,360) 10 0(50,080) 10 1(50,100) (50,120) (70,100) 1 1 0(30,260) (50,280) 11 1 Section 7 # 23 Hash-like Structures for Multidimensional e.g., Partitioned hash Files is a sequence of hash functions, h=(h1,...hn) such that hi produces the ith segment of bits in the hash key, that is, h(a) is the concatenation of bit subsequences, h1(a)h2(a)..hn(a). Example: The data file is CUSTOMER(AGE,SAL) consisting again of (24,60) (46,60) (50,80) (50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260) Use 2 hash functions and 3 bits, the 1st bit is for age with hash function, mod2(tens_digit of age) and the last 2 bits are for salary with hash function, mod4(hundreds_digit of sal) The lookup table is:

  24. Section 7 # 24 Tree-like Structures for Multidimensional e.g., Multi-key Index /|--> / |--> Index on .--> < |--> 1st attr / \ |.. /|/ \|--> / | / | /|--> / | / |--> --> < |----> < |--> \ | \ |.. \ |\ \|--> \ | \ \| \ /|--> \ \ / |--> \ `>< |--> \ \ |.. \ \|--> \ `-> . . . indexes on 2nd attr Assume several attributes representing "dimensions" of the data points (data cube tuples) - uses a multi-level index, e.g., suppose there are 2 attributes: Provides a second level of Indexes on 2nd attribute to all tuples with same 1st attribute value Take the (age, salary) points again (24,60) (24,260) (24,400) (50,80) (50,100) (50,120) (50,280) (60,100) (60,260) (84,140) . - - - - - - - - - - -> (24,060) / .- - - - - - -> (24,060) / / .- - - -> (24,400) ___/_________/______/ .--> |_60_|_260_|_400_____| / .- - - - - - -- - - - - -> (50,080) age / ____/________________ 24----' .-> |_80_|_100|_120_|_280_|- - - --> (50,280) 50-------' \ `- - - - - - - --> (50,120) 60. _______ `- - - - - - - - - - -> (50,100) 84 `-->|100|260|- - - - - - - - - - - - - --> (60,260) \ `- - - - - - - - - - - - - - - - --> (60,100) \ _____ `- >|_140_|- - - - - - - - - - - - - - --> (84,140)

  25. Section 7 # 25 Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index Interior nodes have (Attribute, Value, LowPointer, HighPontr) - Value is a value which splits data points - The example below will show (a, V, down, up) with pointers going down for LowPointer and up for HighPointer. (goes up on greater or equal actually). - Attributes used for different levels are different and ROTATE among the dimensions (round robin). - The leaves are blocks of records (assume data blocks hold 2 records, i.e., the blocking factor, bfr, is 2). - to search: decide along the tree until you reach a leaf (going up on greater or equal) - to insert: decide along the tree until you reach the proper leaf if there is room there, insert; else split the block and divide its contents according to the appropriate attribute (next one in the rotation). Example: (insert into kd-tree in this order using age first then salart, sal): age,sal (50,80) (84,140) (30,260) (44,360) (50,120) (70,100) (24,60) (26,400) (50,280) (46,60) (60,260) (50,100) insert the first 2 pairs (no tree yet, since just 1 leaf block): 50, 80 84, 140 age sal 30, 260 (leaf is full so split it and divide the contents by sal=150) 30,260 sal / { ,150} < \ 50,80 84,140

  26. Section 7 # 26 Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued 30,260 sal / { ,150} < \ 50,80 84,140 age sal 44,360 (leaf is not full so insert) 30,260 44,360 sal / { ,150} < \ 50,80 84,140 age sal 50,120 (leaf is full so split, divide contents by age=55) 30,260 44,360 sal / { ,150} < \ 84,140 \age / { 55, } < \ 50,80 50,120

  27. Section 7 # 27 age sal 50,120 30,260 44,360 sal / { ,150} < \ 84,140 \age / { 55, } < \ 50,80 50,120 Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued age sal 70,100 (leaf is not full so insert) 30,260 44,360 sal / { ,150} < 84,140 \ 70,100 \age / { 55, } < \ 50,80 50,120 age sal 24,060 (leaf is full so split, divide by sal=75) 30,260 44,360 sal / { ,150} < 84,140 \ 70,100 \age / { 55, } < \ 50,080 \ 50,120 \ sal / { ,75)< \ 24,060

  28. Section 7 # 28 30,260 44,360 sal / { ,150} < 84,140 \ 70,100 \age / { 55, } < \ 50,080 \ 50,120 \ sal / { ,75)< \ 24,060 Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued age sal 50,280 (leaf full split, div by sal=300) 44,360 sal / (300, }< age / \ { 28, }< 30,260 / \ 50,280 / 26,400 / sal / { ,150} < 84,140 \ 70,100 \age / { 55, } < \ 50,080 \ 50,120 \ sal / { ,75)< \ 24,060 age sal 26,400 (leaf full split, div by age=28) 30,260 44,360 age / { 28, }< / \ / 26,400 / sal / { ,150} < 84,140 \ 70,100 \age / { 55, } < \ 50,080 \ 50,120 \ sal / { ,75)< \ 24,060

  29. Section 7 # 29 Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued 44,360 sal / (300, }< age / \ { 28, }< 30,260 / \ 50,280 / 26,400 / sal / { ,150} < 84,140 \ 70,100 \age / { 55, } < \ 50,080 \ 50,120 \ sal / { ,75)< \ 24,060 age sal 46,060 (leaf not full so insert) 44,360 sal / (300, }< age / \ { 28, }< 30,260 / \ 50,280 / 26,400 / sal / { ,150} < 84,140 \ 70,100 \age / { 55, } < \ 50,080 \ 50,120 \ sal / { ,75)< \ 24,060 46,060

  30. Section 7 # 30 Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued 44,360 sal / (300, }< age / \ { 28, }< 30,260 / \ 50,280 / 26,400 / sal / { ,150} < 84,140 \ 70,100 \age / { 55, } < \ 50,080 \ 50,120 \ sal / { ,75)< \ 24,060 46,060 age sal 60,260 (leaf full split by age=40) 44,360 sal / (300, }< age / \ 30,260 { 28, }< \ age / / \ { 40, }< / 26,400 \ / 50,280 sal / 60,260 { ,150} < 84,140 \ 70,100 \age / { 55, } < \ 50,080 \ 50,120 \ sal / { ,75)< \ 24,060 46,060

  31. Section 7 # 31 44,360 sal / (300, }< age / \ 30,260 { 28, }< \age / / \ { 40, }< / 26,400 \ / 50,280 sal / 60,260 { ,150} < 84,140 \ 70,100 \age / { 55, } < \ 50,080 \ 50,120 \ sal / { ,75)< \ 24,060 46,060 Tree-like Structures for Multidime.g., k dim (kd tree) Index continued age sal 50,100 (full split age=50 full again split sal=90) 44,360 sal / (300, }< age / \ 30,260 { 28, }< \age / / \ { 40, }< / 26,400 \ / 50,280 sal / 60,260 { ,150} < 84,140 50,120 \ 70,100 50,100 \age / sal / { 55, } < { , 90 }< \ age / \ \ { 50, }< 50,080 \ sal / \ { ,75)< \ 24,060 46,060

  32. 400K 380K 360K 340K 320K 300K 280K 260K 240K 220K 200K 180K 160K 140K 120K 100K 80K 60K 40K 20K 0K 0 10 20 30 40 50 60 70 80 90 100  AGE * * Section 7 # 32 Tree-like Structures for Multidimensional datasets e.g., Quad tree indexes - Interior nodes (Inodes) correspond to rectangulars in 2-D (more generally, they can be constructed to represent hypercubes higher dimensional space) - If the number of points in the rectangle fits in a block, it's a leaf, else the rectangle is treated as interior node with children corresponding to its 4 quadrants. - to insert into the quad treee index: search to find the proper leaf; if there's room, insert; else split node into 4 quadrants, divide contents appropriately. Example: Build the Quad-tree index as it would develop, assuming (age,sal) arrive in this order: age,sal (24,60) (46,60) (50,80) (50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260) Insert (24,60) (46,60) The only leaf node is: age sal 24,060 46,060

  33. 400K 380K 360K 340K 320K 300K 280K 260K 240K 220K 200K 180K 160K 140K 120K 100K 80K 60K 40K 20K 0K 0 10 20 30 40 50 60 70 80 90 100  AGE * * * Section 7 # 33 Tree-like Structures for Multidim datasets e.g., Quad tree indexes insert age sal 50, 080 (leaf full split (e.g., at age=50 and sal=200) divide contents by quadrant .-NW / /---NE age,sal / {50,200} < \ \---SW 24,060 \ 46,060 \ `SE 50,080 24,060 46,060

  34. 400K 380K 360K 340K 320K 300K 280K 260K 240K 220K 200K 180K 160K 140K 120K 100K 80K 60K 40K 20K 0K 0 10 20 30 40 50 60 70 80 90 100  AGE * * * Section 7 # 34 insert 50, 120full split SE at 75,100 .-NW / /---NE age,sal / {50,200} < \ \---SW 24,060 \ 46,060 \ `SE(75,100)< Tree-like Structures for Multidim datasets e.g., Quad tree indexes insert 50, 100 (not full insert .-NW / /---NE age,sal / {50,200} < \ \---SW 24,060 \ 46,060 \ `SE 50,080 50,100 .-NW / /---NE age,sal / {50,200} < \ \---SW 24,060 \ 46,060 \ `SE 50,080 .-NW 50,100 / 50,120 /---NE / < \ \---SW 50,080 \ \ `SE ETC. * *

  35. Thus the R-tree has a root and 1 leaf: ( (0,0), (100,90) ) (corners of outer red region) road1 road2 house1 school house2 pipeline (a full leaf with 6 objects) Section 7 # 35 Tree-like Structures for Multidim datasets: Region tree (Rtree) indexes Assume a leaf can hold 6 regions (bfr=6) and that the 6 regions or objects above are together on 1 leaf block, whose region is shown as the outer red rectangle - inodes of an R-tree correspond to interior regions, (which can be overlapping) (usually regions are rectangles, tho, not necessarily) - R-tree regions have subregions that represent the contents of their children - And the subregions need not cover the region they subdivide (but all data must be within a subregion) Example, Consider the spatial image: Example: Consider the spatial image: 100________________________________________________________ | | | | | | | | | .---------. | | | | | | | school | | | |_________| | | | | | |---------------------------. | | road1 | .-------. | |---------------------------| |house2 | | | |r | |_______| | | .------.________ |o_|_____________________________| | |house1|________ |a_|________pipeline_____________| | |______| |d | | | |2 | | | | | | | | | | 0 `-------------------------------------------------------' 0 100

  36. (0,0),(60,50) (20,20),(100,80) POP Now suppose a local cellular phone company adds a POP as shown. Section 7 # 36 (0,0),(100,90) Rtree indexes cont. road1 | road2 | house1 school | house2 | pipeline POP 100________________________________________________________ | | | | | | | | | .---------. | | | | | | | school | | | |_________| | | | | | |---------------------------. | | road1 | .-------. | |---------------------------| |house2 | | | |r | |_______| | | .------.________ |o_|_____________________________| | |house1|________ |a_|________pipeline_____________| | |______| |d | | | |2 | | | | | | | | | | 0 `-------------------------------------------------------' 0 100 Split the full leaf putting 4 objects in 1 new leaf and 3 in the other (minimize overlap and split ~evenly)

  37. (80,50) house2 house3 Now suppose we insert house3 house3 Section 7 # 37 (0,0),(100,90) Rtree indexes cont. (0,0),(60,50) (20,20),(100,80) road1 | road2 | house1 school | house2 | pipeline POP Note that house2 is in both regions. 100________________________________________________________ | | | | | | | | | .---------. | | | | | | | school | | | |_________| | | | | | |---------------------------. | | road1 | .-------. | |---------------------------| |house2 | | | |r | |_______| | | .------.________ |o_|_____________________________| | |house1|________ |a_|________pipeline_____________| | |______| |d | | | |2 | | | | | | | | | | 0 `-------------------------------------------------------' 0 100 Since house3 is not in either region (and both have room) we must decide to expand one of them. If we pick the green, expanding it to (0,20), (100,80) we add 1600 units2 If we pick the purple, expanding it to ((0,0), (80,50) we add 1000 units2 so to minimize we pick the purple. POP

  38. INSERT JAY | LA | 25 | STAR (assigned RRN=1 to it) b3 J< INSERT JON | LA | 45 | HOOD (assigned RRN=2 1st letters are teh same (J) so the common pat is embedded in the root 2nd letters: A and O, bit 3 (zero-based count) is 1st difference (and makes the decision) 0123 4567 bit positions DBCDIC for A=1100 0001 EBCDIC for O=1101 0110 ON 2 Section 7 # 38 Binary Radix Tree Index (AKA a trie) an additional index structure (e.g., used in IBM AS/400 systems) Similar to B-tree, except - only the common parts of key values are embedded in inodes - a single bit is used to make the navigation direction decision at each level (0 for up and 1 for down). (zero-based bit positions are used) Example: (in this example, the tree structure is being built left-to-right) Starting with an empty structure, nam_trie_INDEX CUSTOMER FILE RRN nam loc age job name part RRN 1 | JAY | LA | 25 | STAR JAY 1 2 | JON | LA | 45 | HOOD

  39. JAY 1 b2 A< N 3 b3 J< Y 1 ON 2 b3 J< INSERT JAN | RO | 93 | DOC (assigned RRN=3 0123 4567 bit positions DBCDIC for N=1101 0101 EBCDIC for Y=1110 0000 ON 2 Section 7 # 39 Binary Radix Tree Index (AKA a trie) cont. nam_trie_INDEX CUSTOMER FILE RRN nam loc age job 1 | JAY | LA | 25 | STAR 2 | JON | LA | 45 | HOOD 3 | JAN | RO | 93 | DOC

  40. b2 < b2 A< N 3 b3 J< SUE 2 Y 1 ON 2 INSERT SUE | RO | 16 | PROG (assigned RRN=4 0123 4567 bit positions DBCDIC for J= 1101 0001 EBCDIC for Y= 1110 0010 Section 7 # 40 Binary Radix Tree Index (AKA a trie) cont. nam_trie_INDEX CUSTOMER FILE RRN nam loc age job 1 | JAY | LA | 25 | STAR 2 | JON | LA | 45 | HOOD 3 | JAN | RO | 93 | DOC 4 | SUE | RO | 16 | PROG

  41. Section 7 Thank you.

More Related