Lecture 6: Indexing: GiST

Lecture 6: Indexing: GiST Sept. 12, 2007 ChengXiang Zhai Most slides are adapted from Kevin Chang’s lecture slides

Search Trees: Previous Approaches • Specialized search trees (yet another tree!): • redundant code: most trees are very similar • concurrency control, logging/recovery: tricky • Trees for extensible data types: • B-tree for any data with linear ordering • e.g.: index titles (alph. ordering) with B-tree • problem: does not support natural queries • e.g.: WHERE book.title has “database”?

GiST: Generalized Search Tree • General: cover B+-tree, R-tree, etc… • Extensible: • domain-specific data types & queries definable • Easy to extend: six “key methods” for a new tree • Efficient: can match specialized trees • Reusable: concurrency, recovery for indexes

Example: Indexing Book Titles • Titles of 4 books: • T1 = “database optimization” • T2 = “web database” • T3 = “complexity of optimization algorithms” • T4 = “algorithms and complexity” • Indexable with (extensible) B+-tree? • linear ordering: T4, T3, T1, T2 • Note: Just an example for demonstrating GiST! • What we will do to index “titles” is not the best and typical way to index “textual data”! --- No notion of fuzzy “relevance”. • stay tuned for text and web search

Queries on Titles • Indexing is to accommodate query processing • What “predicates” to ask about titles?

Queries on Titles • Equality predicates: • WHERE book.title = “web databases” • Containment predicates: • WHERE book.title has “web” • Prefix predicates: • WHERE book.title start-with “web” • RegEx predicates: (generalize all the others) • WHERE book.title like “# web # database”

Extensible B+-Tree for Titles • Observations: • indexed values have linear ordering: T4, T3, T1, T2 • keys are simply separators: T4, c, T3, d, T1, w, T2 d c w T4: alg. … T3: complexity … T1: database … T2: web …

Using B+-Tree: What’s Wrong? • What predicates can B+tree support well? • equality, containing, prefix, regex? d c w T4: alg. … T3: complexity … T1: database … T2: web …

GiST: Generalizing Balanced Search Trees • GiST is not universal (just reasonable generalization) • balanced tree of <key, ptr> pairs, keys can overlap • GRE test:B-Tree : R-Tree = R-Tree: ________ • What is the key generalization? key1 key2 … … internal nodes (directory) leaf nodes (linked list)

The Key Generalization: The Key • Key evolution: 1-D separator --> 2-D MBR --> predicates • R-Tree : B-Tree • generalizing key from 1-D line to 2-D area • bounding range to (minimal) bounding region • GiST : R-Tree • generalizing key from 2-D MBR to “predicates” • a predicate that all values v in subtree will satisfy • B-tree keys: • [k1:k2) --> contains([k1:k2), v) • R-tree keys: • (x1,y1,x2,y2) --> contains((x1,y1, x2,y2), v) • RD-tree keys: • {x1,…xk}  subset({x1,…,xk},v)

Gist for Title Indexing: Predicates Must first determine predicates: • What query predicates to support? • equality: equal(v, “web db”) • containing: has(v, “web”) • What key predicate to use? • Criteria for choosing key predicates? • What do you suggest?

GiST for Title Indexing: Predicates • Key predicates: Contains(S, v) SL SR {alg, comp, opt} {db, opt, web} SLL SLR SRL SRR {alg, comp} {comp, opt} {db, opt} {db, web} T4: alg. … T3: complexity … T1: database … T2: web …

GiST: Built-in Tree Operations • Search(root R, predicate q) • Insert(root R, entry E, level l) • Delete(root R, entry E)

GiST: Application-Specific Methods Search: • Consistent(E, q): search subtree E for predicate q? Labeling: • Union(E1, …, En): how to label the union of E1, …, En? Categorization: • Penalty(E1, E2): penalty for inserting E2 in subtree E1 • PickSplit(E1, …, En): how to split into two groups of entries Compression: (storage/time tradeoff) • Compress(E): E --> Ec • Decompress(Ec): --> E’ such that E.p implies E’.p

Search Operation: Consistent Method • Search(root R, predicate q): • traverse subtrees where Consistent true • return leaf entries that are consistent

Consistent Method • Consistent(E, q): • Can E.p and q both hold? • Does E.p imply (not q)? • Title GiST: • key predicate: p = Contains(S, v) or simply S • e.g., SL = {alg, comp, opt} • e.g., SR = {db, opt, web} • Consistent(SL, has(v, “web”))? • how to implement? • Consistent(SR, equals(v, “web database”))? • how to implement?

Insert Operation • Insert(root R, entry E, level l) • descend tree minimizing potential increase in Penalty • stop at level specified • if there is room at node, insert there • else split according to PickSplit • propagate changes using Union to adjust keys • Why do we need a “level parameter”?

Title GiST: Insert • Where to insert T5:“complexity of web algorithms” ? SL SR {alg, comp, opt} {db, opt, web} SLL SLR SRL SRR {alg, comp} {comp, opt} {db, opt} {db, web} T4: alg. … T3: complexity … T1: database … T2: web …

Penalty Method • Penality(E1, E2): • penalty for inserting E2 in subtree E1 • Title GiST: • E2 with S ={comp,web, alg} (i.e., T5:“complexity of web algorithms”) • Where to insert? • root: SL = {alg, comp, opt} vs. SR = {db, opt, web}? • Penalty: • how to implement?

PickSplit Method • PickSplit(E1, …, En): • how to split into two groups of entries • Title GiST: • suppose we have 3 entries (after an Insert) • S1 = {alg, comp} • S2 = {comp, opt} • S3 = {comp, web, alg} (new) • how to split {S1, S2, S3} into two? • something similar to what R-tree algorithm will do

Union Method • Union(E1, …, En): • Generates a label for the subtree with E1, …, En • Title GiST: • key predicate: p = Contains(S, v) or simply S • S1 = {alg, comp}, S2 = {comp, opt} • Combined key = ? • Union(E1=(SL, ptr1), E2=(SR, ptr2)) = ? • how to implement?

Compress/Decompress Method? Key storage vs. search time tradeoff • Compress(E): E --> Ec • Decompress(Ec): --> E’.p can be “looser” than E.p (less pruning power) • Lossy compression: may need more time for search • Title GiST: • Any suggestions?

Title GiST: Compress/Decompress • Example 1: no compression • Compress(E) --> Ec = E • Decompress(Ec) --> E’ = Ec • Example 2: compress by taking word initials • Compress: {algorithm, complexity, optimization} --> {al, co, op} • Decompress: {al, co, op} --> {al*, co*, op*}

GiST: No Magic • It offers (only) what its model is based on • It does not represent all possible index structures: • e.g.: duplicate objects by multiple inserts (R+-tree) • e.g.: support notion of distance and similarity • rather than Boolean based predicates • any more?

Outlook: Indexability • Observation: • the simplest version of the Consistent method is a routine that always returns MAYBE-- which gives you a search tree with no efficiency gain • Big questions: • what is an index machinery? (analog: turing machine) • how do we characterize “workload”? (analog: languages) • can index always help in search? (analog: decidability, complexity) • what are the performance parameters? (analog: size of input) • what are the performance measures? (analog: time, space complexity) • Initial result: Hellerstein, Koutsoupias and Papadimitriou: On the Analysis of Indexing Schemes, PODS 97 • An empirical question: • Can we learn an indexing strategy from the characteristics of the workload?

What You Should Know • What is GiST? • What are the six key methods? • How does GiST generalize other more specialized trees? • What are some limitations of GiST?

Carry Away Messages • Once again, generalize whenever it’s possible • 1-dimension indexing (B+-tree, interval-based)-> Multi-dimension indexing (R-tree, region-based) -> Arbitrary objects (GiST, predicate-based) • Avoid over-generalization • While “predicate” is quite general, it doesn’t guarantee pruning power • Where’s the notion of “bounding” in GiST? • Whenever you see “yet another X”, think about possibilities for a more general formulation of X

Lecture 6: Indexing: GiST