- 60 Views
- Uploaded on
- Presentation posted in: General

Lecture 6: Indexing: GiST

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Lecture 6: Indexing: GiST

Sept. 12, 2007

ChengXiang Zhai

Most slides are adapted from Kevin Chang’s lecture slides

- Specialized search trees (yet another tree!):
- redundant code: most trees are very similar
- concurrency control, logging/recovery: tricky

- Trees for extensible data types:
- B-tree for any data with linear ordering
- e.g.: index titles (alph. ordering) with B-tree

- problem: does not support natural queries
- e.g.: WHERE book.title has “database”?

- B-tree for any data with linear ordering

- General: cover B+-tree, R-tree, etc…
- Extensible:
- domain-specific data types & queries definable

- Easy to extend: six “key methods” for a new tree
- Efficient: can match specialized trees
- Reusable: concurrency, recovery for indexes

- Titles of 4 books:
- T1 = “database optimization”
- T2 = “web database”
- T3 = “complexity of optimization algorithms”
- T4 = “algorithms and complexity”

- Indexable with (extensible) B+-tree?
- linear ordering: T4, T3, T1, T2

- Note: Just an example for demonstrating GiST!
- What we will do to index “titles” is not the best and typical way to index “textual data”! --- No notion of fuzzy “relevance”.
- stay tuned for text and web search

- Indexing is to accommodate query processing
- What “predicates” to ask about titles?

- Equality predicates:
- WHERE book.title = “web databases”

- Containment predicates:
- WHERE book.title has “web”

- Prefix predicates:
- WHERE book.title start-with “web”

- RegEx predicates: (generalize all the others)
- WHERE book.title like “# web # database”

- Observations:
- indexed values have linear ordering: T4, T3, T1, T2
- keys are simply separators: T4, c, T3, d, T1, w, T2

d

c

w

T4: alg. …

T3: complexity …

T1: database …

T2: web …

- What predicates can B+tree support well?
- equality, containing, prefix, regex?

d

c

w

T4: alg. …

T3: complexity …

T1: database …

T2: web …

- GiST is not universal (just reasonable generalization)
- balanced tree of <key, ptr> pairs, keys can overlap

- GRE test:B-Tree : R-Tree = R-Tree: ________
- What is the key generalization?

key1 key2 …

…

internal nodes (directory)

leaf nodes (linked list)

- Key evolution: 1-D separator --> 2-D MBR --> predicates
- R-Tree : B-Tree
- generalizing key from 1-D line to 2-D area
- bounding range to (minimal) bounding region

- generalizing key from 1-D line to 2-D area
- GiST : R-Tree
- generalizing key from 2-D MBR to “predicates”
- a predicate that all values v in subtree will satisfy

- B-tree keys:
- [k1:k2) --> contains([k1:k2), v)

- R-tree keys:
- (x1,y1,x2,y2) --> contains((x1,y1, x2,y2), v)

- RD-tree keys:
- {x1,…xk} subset({x1,…,xk},v)

- generalizing key from 2-D MBR to “predicates”

Must first determine predicates:

- What query predicates to support?
- equality: equal(v, “web db”)
- containing: has(v, “web”)

- What key predicate to use?
- Criteria for choosing key predicates?
- What do you suggest?

- Key predicates: Contains(S, v)

SL

SR

{alg, comp, opt}

{db, opt, web}

SLL

SLR

SRL

SRR

{alg, comp}

{comp, opt}

{db, opt}

{db, web}

T4: alg. …

T3: complexity …

T1: database …

T2: web …

- Search(root R, predicate q)
- Insert(root R, entry E, level l)
- Delete(root R, entry E)

Search:

- Consistent(E, q): search subtree E for predicate q?
Labeling:

- Union(E1, …, En): how to label the union of E1, …, En?
Categorization:

- Penalty(E1, E2): penalty for inserting E2 in subtree E1
- PickSplit(E1, …, En): how to split into two groups of entries
Compression: (storage/time tradeoff)

- Compress(E): E --> Ec
- Decompress(Ec): --> E’ such that E.p implies E’.p

- Search(root R, predicate q):
- traverse subtrees where Consistent true
- return leaf entries that are consistent

- Consistent(E, q):
- Can E.p and q both hold?
- Does E.p imply (not q)?

- Title GiST:
- key predicate: p = Contains(S, v) or simply S
- e.g., SL = {alg, comp, opt}
- e.g., SR = {db, opt, web}

- Consistent(SL, has(v, “web”))?
- how to implement?

- Consistent(SR, equals(v, “web database”))?
- how to implement?

- key predicate: p = Contains(S, v) or simply S

- Insert(root R, entry E, level l)
- descend tree minimizing potential increase in Penalty
- stop at level specified

- if there is room at node, insert there
- else split according to PickSplit
- propagate changes using Union to adjust keys
- Why do we need a “level parameter”?

- descend tree minimizing potential increase in Penalty

- Where to insert T5:“complexity of web algorithms” ?

SL

SR

{alg, comp, opt}

{db, opt, web}

SLL

SLR

SRL

SRR

{alg, comp}

{comp, opt}

{db, opt}

{db, web}

T4: alg. …

T3: complexity …

T1: database …

T2: web …

- Penality(E1, E2):
- penalty for inserting E2 in subtree E1

- Title GiST:
- E2 with S ={comp,web, alg} (i.e., T5:“complexity of web algorithms”)
- Where to insert?
- root: SL = {alg, comp, opt} vs. SR = {db, opt, web}?

- Penalty:
- how to implement?

- PickSplit(E1, …, En):
- how to split into two groups of entries

- Title GiST:
- suppose we have 3 entries (after an Insert)
- S1 = {alg, comp}
- S2 = {comp, opt}
- S3 = {comp, web, alg} (new)

- how to split {S1, S2, S3} into two?
- something similar to what R-tree algorithm will do

- suppose we have 3 entries (after an Insert)

- Union(E1, …, En):
- Generates a label for the subtree with E1, …, En

- Title GiST:
- key predicate: p = Contains(S, v) or simply S
- S1 = {alg, comp}, S2 = {comp, opt}
- Combined key = ?

- Union(E1=(SL, ptr1), E2=(SR, ptr2)) = ?
- how to implement?

- key predicate: p = Contains(S, v) or simply S

Key storage vs. search time tradeoff

- Compress(E): E --> Ec
- Decompress(Ec): --> E’.p can be “looser” than E.p (less pruning power)
- Lossy compression: may need more time for search
- Title GiST:
- Any suggestions?

- Example 1: no compression
- Compress(E) --> Ec = E
- Decompress(Ec) --> E’ = Ec

- Example 2: compress by taking word initials
- Compress:
{algorithm, complexity, optimization} --> {al, co, op}

- Decompress:
{al, co, op} --> {al*, co*, op*}

- Compress:

- It offers (only) what its model is based on
- It does not represent all possible index structures:
- e.g.: duplicate objects by multiple inserts (R+-tree)
- e.g.: support notion of distance and similarity
- rather than Boolean based predicates

- any more?

- Observation:
- the simplest version of the Consistent method is a routine that always returns MAYBE-- which gives you a search tree with no efficiency gain

- Big questions:
- what is an index machinery? (analog: turing machine)
- how do we characterize “workload”? (analog: languages)
- can index always help in search? (analog: decidability, complexity)
- what are the performance parameters? (analog: size of input)
- what are the performance measures? (analog: time, space complexity)

- Initial result: Hellerstein, Koutsoupias and Papadimitriou:
On the Analysis of Indexing Schemes, PODS 97

- An empirical question:
- Can we learn an indexing strategy from the characteristics of the workload?

- What is GiST?
- What are the six key methods?
- How does GiST generalize other more specialized trees?
- What are some limitations of GiST?

- Once again, generalize whenever it’s possible
- 1-dimension indexing (B+-tree, interval-based)-> Multi-dimension indexing (R-tree, region-based) -> Arbitrary objects (GiST, predicate-based)

- Avoid over-generalization
- While “predicate” is quite general, it doesn’t guarantee pruning power
- Where’s the notion of “bounding” in GiST?

- Whenever you see “yet another X”, think about possibilities for a more general formulation of X