1 / 27

Lecture 6: Indexing: GiST

Lecture 6: Indexing: GiST. Sept. 13 2006 ChengXiang Zhai. Most slides are adapted from Kevin Chang’s lecture slides. Search Trees: Previous Approaches. Specialized search trees (yet another tree!): redundant code: most trees are very similar concurrency control, logging/recovery: tricky

gamba
Download Presentation

Lecture 6: Indexing: GiST

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 6: Indexing: GiST Sept. 13 2006 ChengXiang Zhai Most slides are adapted from Kevin Chang’s lecture slides

  2. Search Trees: Previous Approaches • Specialized search trees (yet another tree!): • redundant code: most trees are very similar • concurrency control, logging/recovery: tricky • Trees for extensible data types: • B-tree for any data with linear ordering • e.g.: index titles (alph. ordering) with B-tree • problem: does not support natural queries • e.g.: WHERE book.title has “database”?

  3. GiST: Generalized Search Tree • General: cover B+-tree, R-tree, etc… • Extensible: • domain-specific data types & queries definable • Easy to extend: six “key methods” for a new tree • Efficient: can match specialized trees • Reusable: concurrency, recovery for indexes

  4. Example: Indexing Book Titles • Titles of 4 books: • T1 = “database optimization” • T2 = “web database” • T3 = “complexity of optimization algorithms” • T4 = “algorithms and complexity” • Indexable with (extensible) B+-tree? • linear ordering: T4, T3, T1, T2 • Note: Just an example for demonstrating GiST! • What we will do to index “titles” is not the best and typical way to index “textual data”! --- No notion of fuzzy “relevance”. • stay tuned for text and web search

  5. Queries on Titles • Indexing is to accommodate query processing • What “predicates” to ask about titles?

  6. Queries on Titles • Equality predicates: • WHERE book.title = “web databases” • Containment predicates: • WHERE book.title has “web” • Prefix predicates: • WHERE book.title start-with “web” • RegEx predicates: (generalize all the others) • WHERE book.title like “# web # database”

  7. Extensible B+-Tree for Titles • Observations: • indexed values have linear ordering: T4, T3, T1, T2 • keys are simply separators: T4, c, T3, d, T1, w, T2 d c w T4: alg. … T3: complexity … T1: database … T2: web …

  8. Using B+-Tree: What’s Wrong? • What predicates can B+tree support well? • equality, containing, prefix, regex? d c w T4: alg. … T3: complexity … T1: database … T2: web …

  9. GiST: Generalizing Balanced Search Trees • GiST is not universal (just reasonable generalization) • balanced tree of <key, ptr> pairs, keys can overlap • GRE test:B-Tree : R-Tree = R-Tree: ________ • What is the key generalization? key1 key2 … … internal nodes (directory) leaf nodes (linked list)

  10. The Key Generalization: The Key • Key evolution: 1-D separator --> 2-D MBR --> predicates • R-Tree : B-Tree • generalizing key from 1-D line to 2-D area • bounding range to (minimal) bounding region • GiST : R-Tree • generalizing key from 2-D MBR to “predicates” • a predicate that all values v in subtree will satisfy • B-tree keys: • [k1:k2) --> contains([k1:k2), v) • R-tree keys: • (x1,y1,x2,y2) --> contains((x1,y1, x2,y2), v) • RD-tree keys: • {x1,…xk}  subset({x1,…,xk},v)

  11. Gist for Title Indexing: Predicates Must first determine predicates: • What query predicates to support? • equality: equal(v, “web db”) • containing: has(v, “web”) • What key predicate to use? • Criteria for choosing key predicates? • What do you suggest?

  12. GiST for Title Indexing: Predicates • Key predicates: Contains(S, v) SL SR {alg, comp, opt} {db, opt, web} SLL SLR SRL SRR {alg, comp} {comp, opt} {db, opt} {db, web} T4: alg. … T3: complexity … T1: database … T2: web …

  13. GiST: Built-in Tree Operations • Search(root R, predicate q) • Insert(root R, entry E, level l) • Delete(root R, entry E)

  14. GiST: Application-Specific Methods Search: • Consistent(E, q): search subtree E for predicate q? Labeling: • Union(E1, …, En): how to label the union of E1, …, En? Categorization: • Penalty(E1, E2): penalty for inserting E2 in subtree E1 • PickSplit(E1, …, En): how to split into two groups of entries Compression: (storage/time tradeoff) • Compress(E): E --> Ec • Decompress(Ec): --> E’ such that E.p implies E’.p

  15. Search Operation: Consistent Method • Search(root R, predicate q): • traverse subtrees where Consistent true • return leaf entries that are consistent

  16. Consistent Method • Consistent(E, q): • Can E.p and q both hold? • Does E.p imply (not q)? • Title GiST: • key predicate: p = Contains(S, v) or simply S • e.g., SL = {alg, comp, opt} • e.g., SR = {db, opt, web} • Consistent(SL, has(v, “web”))? • how to implement? • Consistent(SR, equals(v, “web database”))? • how to implement?

  17. Insert Operation • Insert(root R, entry E, level l) • descend tree minimizing potential increase in Penalty • stop at level specified • if there is room at node, insert there • else split according to PickSplit • propagate changes using Union to adjust keys • Why do we need a “level parameter”?

  18. Title GiST: Insert • Where to insert T5:“complexity of web algorithms” ? SL SR {alg, comp, opt} {db, opt, web} SLL SLR SRL SRR {alg, comp} {comp, opt} {db, opt} {db, web} T4: alg. … T3: complexity … T1: database … T2: web …

  19. Penalty Method • Penality(E1, E2): • penalty for inserting E2 in subtree E1 • Title GiST: • E2 with S ={comp,web, alg} (i.e., T5:“complexity of web algorithms”) • Where to insert? • root: SL = {alg, comp, opt} vs. SR = {db, opt, web}? • Penalty: • how to implement?

  20. PickSplit Method • PickSplit(E1, …, En): • how to split into two groups of entries • Title GiST: • suppose we have 3 entries (after an Insert) • S1 = {alg, comp} • S2 = {comp, opt} • S3 = {comp, web, alg} (new) • ? how to split {S1, S2, S3} into two? • something similar to R-tree algorithm will do

  21. Union Method • Union(E1, …, En): • Generates a label for the subtree with E1, …, En • Title GiST: • key predicate: p = Contains(S, v) or simply S • S1 = {alg, comp}, S2 = {comp, opt} • Combined key = ? • Union(E1=(SL, ptr1), E2=(SR, ptr2)) = ? • how to implement?

  22. Compress/Decompress Method? Key storage vs. search time tradeoff • Compress(E): E --> Ec • Decompress(Ec): --> E’.p can be “looser” than E.p (less pruning power) • Lossy compression: may need more time for search • Title GiST: • Any suggestions?

  23. Title GiST: Compress/Decompress • Example 1: no compression • Compress(E) --> Ec = E • Decompress(Ec) --> E’ = Ec • Example 2: compress by taking word initials • Compress: {algorithm, complexity, optimization} --> {al, co, op} • Decompress: {al, co, op} --> {al*, co*, op*}

  24. GiST: No Magic • It offers (only) what its model is based on • It does not represent all possible index structures: • e.g.: duplicate objects by multiple inserts (R+-tree) • e.g.: support notion of distance and similarity • rather than Boolean based predicates • any more?

  25. Outlook: Indexability • Observation: • the simplest version of the Consistent method is a routine that always returns MAYBE-- which gives you a search tree with no efficiency gain • Big questions: • what is an index machinery? (analog: turing machine) • how do we characterize “workload”? (analog: languages) • can index always help in search? (analog: decidability, complexity) • what are the performance parameters? (analog: size of input) • what are the performance measures? (analog: time, space complexity) • Initial result: Hellerstein, Koutsoupias and Papadimitriou: On the Analysis of Indexing Schemes, PODS 97 • An empirical question: • Can we learn an indexing strategy from the characteristics of the workload?

  26. What You Should Know • What is GiST? • What are the six key methods? • How does GiST generalize other more specialized trees? • What are some limitations of GiST?

  27. Carry Away Messages • Once again, generalize whenever it’s possible • 1-dimension indexing (B+-tree, interval-based)-> Multi-dimension indexing (R-tree, region-based) -> Arbitrary objects (GiST, predicate-based) • Avoid over-generalization • While “predicate” is quite general, it doesn’t guarantee pruning power • Where’s the notion of “bounding” in GiST? • Whenever you see “yet another X”, think about possibilities for a more general formulation of X

More Related