1 / 33

CS511 Design of Database Management Systems

CS511 Design of Database Management Systems. Lecture 07: R-Trees Kevin C. Chang. Two Essential Techniques for Efficient Processing of Declarative Queries ??. Access Methods: Indexing. What is an index? partition data into buckets label each buckets

Download Presentation

CS511 Design of Database Management Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS511Design of Database Management Systems Lecture 07: R-Trees Kevin C. Chang

  2. Two Essential Techniques for Efficient Processing of Declarative Queries ??

  3. Access Methods: Indexing • What is an index? • partition data into buckets • label each buckets • use label to determine if buckets relevant to query • One dimensional indexing: • hashing • B-trees/B+-trees (we’ll simply call them B-trees) • ?? What are the differences?

  4. Puzzles: Why B and R? • What do B-trees and R-trees stand for? • guess: Binary trees, Reasonable trees?

  5. Multidimensional Data • Values may be multi-dimensional, with extent

  6. Multidimensional Data • GIS (geographic info. systems) database • e.g.: map databases • CAD database • circuit layout (of rectangles) • Data cube systems • sale data (date, time, store, item, price) • a tuple as a point in multi-dimensional space • queries typically retrieve a “cube” • e.g.: date during 2000, store=champaign, price > 100

  7. ?? Multidimensional Queries • What “kinds” of queries to ask?

  8. Multidimensional Queries • Point queries: • constrain values of some dimensions • Range queries: • constraint ranges of some dimensions • Nearest neighbor: • e.g.: the closest city to Champaign • Where-am-I query: regions related to a query region • given a point, find out where it is located • given a region, find overlapping, containing, contained regions • Indexes are designed to serve queries! • check each index methods w.r.t. typical queries

  9. ?? B-Tree for Nearest Neighbor Queries • Object represented by (X,Y) • X indexed by a B-tree • Y indexed by a B-tree • Given an object (x0, y0), find its NN object: • let d = 1 • find objs in (X-d, Y-d) to (X+d, Y+d) • if there are some objs, find NN among them--> done! • else d = d + delta; repeat • ?? Problems?

  10. ?? B-Tree for Nearest Neighbor Queries • Intersecting multiple indexes not efficient • Need to guess the distance d and delta • too small: no objs in the bounding box • too large: may be too many in bounding box • May actually miss the NN point! • closer point may be outside the d-range wrong answer! > d

  11. Indexing Techniques How to reach the right buckets? • Hash-like schemes: lookup by k  h(k) • grid files • partitioned indexing functions • Tree-like schemes: lookup by tree traversal • multiple key indexes • kd-trees • quad trees • R-trees Demo: http://www.cs.umd.edu/~brabec/quadtree

  12. Grid Files 250K * * 200K * 150K * * * * * Salary * * 80K * * * * * * 20K * 0 30 40 70 100 Age

  13. Grid Files • Each region corresponds to a bucket • ? why is this hash-based indexing • Good for partial-match, range, NN • Buckets may be empty or overfull over time • points do not distribute uniformly • Number of buckets grows exponentially with D • Reorganization requires repartitioning space • need to move the grid lines

  14. Partitioned Hash Functions • Hash function produces K bits a bucket • Bits partitioned among attributes • buckets identified by b1b2b3b4b5b6 • age produces b1b2b3 • salary produces b4b5b6 • ? what are different from grid files? @ • can simulate a grid file (generalization of grid files!) • range/NN queries? • bucket utilization?

  15. Multiple-Key Indexes • Each level is an index for one attribute • Partial match will work if path is as planned • e.g., how if only salary is specified? • NN queries? age salary

  16. kd-Tree: k-dimensional Search Tree • binary tree • interior nodes: attr, div value • attributes alternate at different levels • binary search to find records Salary, 150 Age, 60 Age, 47 50, 275 70, 110 Salary, 80 Salary, 300 60, 260 85, 140 50, 100 Age, 38 50, 120 30, 260 25, 400 25, 60 45, 60 45, 350 50, 75

  17. Quad Trees • Interior nodes represent square regions • with at most M entries • Split into four quadrants if more than M entries 200K * * * * * * Salary * * * * * * * 0 100 Age

  18. ?? Quad Tree In terms of space partitioning: • ? What different from a grid file? • ? What different from a kd-tree?

  19. ?? R-Trees: Region Trees • How is it different from others?

  20. ?? R-Trees: Region Trees • B-tree related: • balanced tree (by dynamic restructuring) • guaranteed utilization of nodes • between half and completely full • Support regions • Covering space among siblings may overlap

  21. R-Trees • Structure like B-tree, but keys are MBR regions • does B-tree use (1-D) MBR? e.g., keys: | 57 | 81 | 95 | • Interior node contains sets of regions • Regions can overlap • Regions do not fully cover entire parent region • Region can be a point or a shape • Utilization of nodes: (unless root) • max capacity: M. ? How to determine M • min capacity: m <= M/2. ? Why?

  22. R-Tree: Demo • http://www.cs.umd.edu/~brabec/quadtree

  23. R-Tree: Where-Am-I Queries • Given point P, return data regions containing P • Start with root • Find subregions S containing P: • if S is a data region: return S • else: recursively search S

  24. R-Tree: Split Node • Objectives: • minimize covered area of the containing region • ? why? • ? Is this the only objective? What else?

  25. R-Tree: Split Node • Minimal-covered area split is exponential • need to consider all binary partitioning • M+1 entries --> 2M-1 approximately (?) • Use heuristics (not exhaustive) • quadratic algorithm • linear algorithm • Demo: node splitting

  26. R-Tree: Quadratic Split • PickSeeds: select two seeds for two groups • heuristics: pair that wastes the largest area • waste = area(pair) - [area(I) + area(J)] • If can still partition (more entries assignable): PickNext: • heuristics: greedily select the next with max preference • d1 & d2: area-increase if entry E added to group 1 & 2 • select E with max(|d1 - d2|) Add to the group with least enlargement • Quadratic in M: • PickSeeds consider every pair • ? PickNext will run how many times?

  27. R-Tree: Linear Split • PickSeeds: select two seeds for two groups • heuristics: pair that is the most separated in any dimension • separation = distance of the near sides • normalized by the length of the dimension • If can still partition (more entries assignable): PickNext: • heuristics: any of the remaining entry Add to the group with least enlargement • ? Why is this linear?

  28. R-Tree: Other Operations • Insertion: • ChooseLeaf with least enlargement • SplitNode • AdjustTree: • propagate node split/adjust MBR upward • ? what are different from B-tree? • Deletion: • seek and destroy • CondenseTree: (in principle) • propagate node deletion/adjust MBR upward • reinsert orphaned entries • in practice, often just delete

  29. R-trees in PostgreSQL • How are various operations implemented? • split scheme? • deletion?

  30. R-Tree Variants • R*-trees: • new and improved split algorithm: nlog(n) • forced reinserts of extreme entries • decrease overlapping of sibling regions • regions packed better • splits less often • R+-tree: • instead of overlapping regions, use multiple inserts • speed up search, slow down insert/delete/update

  31. What’s Next? • Unified trees • implement 6 methods then you get an “X” tree!

  32. End Of Talk

  33. Quote: R-tree Degrades to Linked List Lin et al. The TV-tree: An index structure for high-dimensional data VLDB Journal, 1995: R-tree does not work if a single feature vector requires more storage space than a disk page, since R-tree will have a fanout of 1, reducing to a linked list. --> key compression for high-dimension feature vector keys

More Related