Lecture 6 indexing gist
1 / 27

Lecture 6: Indexing: GiST - PowerPoint PPT Presentation

  • Uploaded on

Lecture 6: Indexing: GiST. Sept. 12, 2007 ChengXiang Zhai. Most slides are adapted from Kevin Chang’s lecture slides. Search Trees: Previous Approaches. Specialized search trees (yet another tree!): redundant code: most trees are very similar concurrency control, logging/recovery: tricky

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Lecture 6: Indexing: GiST' - ken

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Lecture 6 indexing gist

Lecture 6: Indexing: GiST

Sept. 12, 2007

ChengXiang Zhai

Most slides are adapted from Kevin Chang’s lecture slides

Search trees previous approaches
Search Trees: Previous Approaches

  • Specialized search trees (yet another tree!):

    • redundant code: most trees are very similar

    • concurrency control, logging/recovery: tricky

  • Trees for extensible data types:

    • B-tree for any data with linear ordering

      • e.g.: index titles (alph. ordering) with B-tree

    • problem: does not support natural queries

      • e.g.: WHERE book.title has “database”?

Gist generalized search tree
GiST: Generalized Search Tree

  • General: cover B+-tree, R-tree, etc…

  • Extensible:

    • domain-specific data types & queries definable

  • Easy to extend: six “key methods” for a new tree

  • Efficient: can match specialized trees

  • Reusable: concurrency, recovery for indexes

Example indexing book titles
Example: Indexing Book Titles

  • Titles of 4 books:

    • T1 = “database optimization”

    • T2 = “web database”

    • T3 = “complexity of optimization algorithms”

    • T4 = “algorithms and complexity”

  • Indexable with (extensible) B+-tree?

    • linear ordering: T4, T3, T1, T2

  • Note: Just an example for demonstrating GiST!

    • What we will do to index “titles” is not the best and typical way to index “textual data”! --- No notion of fuzzy “relevance”.

    • stay tuned for text and web search

Queries on titles
Queries on Titles

  • Indexing is to accommodate query processing

  • What “predicates” to ask about titles?

Queries on titles1
Queries on Titles

  • Equality predicates:

    • WHERE book.title = “web databases”

  • Containment predicates:

    • WHERE book.title has “web”

  • Prefix predicates:

    • WHERE book.title start-with “web”

  • RegEx predicates: (generalize all the others)

    • WHERE book.title like “# web # database”

Extensible b tree for titles
Extensible B+-Tree for Titles

  • Observations:

    • indexed values have linear ordering: T4, T3, T1, T2

    • keys are simply separators: T4, c, T3, d, T1, w, T2




T4: alg. …

T3: complexity …

T1: database …

T2: web …

Using b tree what s wrong
Using B+-Tree: What’s Wrong?

  • What predicates can B+tree support well?

    • equality, containing, prefix, regex?




T4: alg. …

T3: complexity …

T1: database …

T2: web …

Gist generalizing balanced search trees
GiST: Generalizing Balanced Search Trees

  • GiST is not universal (just reasonable generalization)

    • balanced tree of <key, ptr> pairs, keys can overlap

  • GRE test:B-Tree : R-Tree = R-Tree: ________

    • What is the key generalization?

key1 key2 …

internal nodes (directory)

leaf nodes (linked list)

The key generalization the key
The Key Generalization: The Key

  • Key evolution: 1-D separator --> 2-D MBR --> predicates

  • R-Tree : B-Tree

    • generalizing key from 1-D line to 2-D area

      • bounding range to (minimal) bounding region

  • GiST : R-Tree

    • generalizing key from 2-D MBR to “predicates”

      • a predicate that all values v in subtree will satisfy

    • B-tree keys:

      • [k1:k2) --> contains([k1:k2), v)

    • R-tree keys:

      • (x1,y1,x2,y2) --> contains((x1,y1, x2,y2), v)

    • RD-tree keys:

      • {x1,…xk}  subset({x1,…,xk},v)

Gist for title indexing predicates
Gist for Title Indexing: Predicates

Must first determine predicates:

  • What query predicates to support?

    • equality: equal(v, “web db”)

    • containing: has(v, “web”)

  • What key predicate to use?

    • Criteria for choosing key predicates?

    • What do you suggest?

Gist for title indexing predicates1
GiST for Title Indexing: Predicates

  • Key predicates: Contains(S, v)



{alg, comp, opt}

{db, opt, web}





{alg, comp}

{comp, opt}

{db, opt}

{db, web}

T4: alg. …

T3: complexity …

T1: database …

T2: web …

Gist built in tree operations
GiST: Built-in Tree Operations

  • Search(root R, predicate q)

  • Insert(root R, entry E, level l)

  • Delete(root R, entry E)

Gist application specific methods
GiST: Application-Specific Methods


  • Consistent(E, q): search subtree E for predicate q?


  • Union(E1, …, En): how to label the union of E1, …, En?


  • Penalty(E1, E2): penalty for inserting E2 in subtree E1

  • PickSplit(E1, …, En): how to split into two groups of entries

    Compression: (storage/time tradeoff)

  • Compress(E): E --> Ec

  • Decompress(Ec): --> E’ such that E.p implies E’.p

Search operation consistent method
Search Operation: Consistent Method

  • Search(root R, predicate q):

    • traverse subtrees where Consistent true

    • return leaf entries that are consistent

Consistent method
Consistent Method

  • Consistent(E, q):

    • Can E.p and q both hold?

    • Does E.p imply (not q)?

  • Title GiST:

    • key predicate: p = Contains(S, v) or simply S

      • e.g., SL = {alg, comp, opt}

      • e.g., SR = {db, opt, web}

    • Consistent(SL, has(v, “web”))?

      • how to implement?

    • Consistent(SR, equals(v, “web database”))?

      • how to implement?

Insert operation
Insert Operation

  • Insert(root R, entry E, level l)

    • descend tree minimizing potential increase in Penalty

      • stop at level specified

    • if there is room at node, insert there

    • else split according to PickSplit

    • propagate changes using Union to adjust keys

    • Why do we need a “level parameter”?

Title gist insert
Title GiST: Insert

  • Where to insert T5:“complexity of web algorithms” ?



{alg, comp, opt}

{db, opt, web}





{alg, comp}

{comp, opt}

{db, opt}

{db, web}

T4: alg. …

T3: complexity …

T1: database …

T2: web …

Penalty method
Penalty Method

  • Penality(E1, E2):

    • penalty for inserting E2 in subtree E1

  • Title GiST:

    • E2 with S ={comp,web, alg} (i.e., T5:“complexity of web algorithms”)

    • Where to insert?

      • root: SL = {alg, comp, opt} vs. SR = {db, opt, web}?

    • Penalty:

      • how to implement?

Picksplit method
PickSplit Method

  • PickSplit(E1, …, En):

    • how to split into two groups of entries

  • Title GiST:

    • suppose we have 3 entries (after an Insert)

      • S1 = {alg, comp}

      • S2 = {comp, opt}

      • S3 = {comp, web, alg} (new)

    • how to split {S1, S2, S3} into two?

    • something similar to what R-tree algorithm will do

Union method
Union Method

  • Union(E1, …, En):

    • Generates a label for the subtree with E1, …, En

  • Title GiST:

    • key predicate: p = Contains(S, v) or simply S

      • S1 = {alg, comp}, S2 = {comp, opt}

      • Combined key = ?

    • Union(E1=(SL, ptr1), E2=(SR, ptr2)) = ?

      • how to implement?

Compress decompress method
Compress/Decompress Method?

Key storage vs. search time tradeoff

  • Compress(E): E --> Ec

  • Decompress(Ec): --> E’.p can be “looser” than E.p (less pruning power)

  • Lossy compression: may need more time for search

  • Title GiST:

    • Any suggestions?

Title gist compress decompress
Title GiST: Compress/Decompress

  • Example 1: no compression

    • Compress(E) --> Ec = E

    • Decompress(Ec) --> E’ = Ec

  • Example 2: compress by taking word initials

    • Compress:

      {algorithm, complexity, optimization} --> {al, co, op}

    • Decompress:

      {al, co, op} --> {al*, co*, op*}

Gist no magic
GiST: No Magic

  • It offers (only) what its model is based on

  • It does not represent all possible index structures:

    • e.g.: duplicate objects by multiple inserts (R+-tree)

    • e.g.: support notion of distance and similarity

      • rather than Boolean based predicates

    • any more?

Outlook indexability
Outlook: Indexability

  • Observation:

    • the simplest version of the Consistent method is a routine that always returns MAYBE-- which gives you a search tree with no efficiency gain

  • Big questions:

    • what is an index machinery? (analog: turing machine)

    • how do we characterize “workload”? (analog: languages)

    • can index always help in search? (analog: decidability, complexity)

      • what are the performance parameters? (analog: size of input)

      • what are the performance measures? (analog: time, space complexity)

  • Initial result: Hellerstein, Koutsoupias and Papadimitriou:

    On the Analysis of Indexing Schemes, PODS 97

  • An empirical question:

    • Can we learn an indexing strategy from the characteristics of the workload?

What you should know
What You Should Know

  • What is GiST?

  • What are the six key methods?

  • How does GiST generalize other more specialized trees?

  • What are some limitations of GiST?

Carry away messages
Carry Away Messages

  • Once again, generalize whenever it’s possible

    • 1-dimension indexing (B+-tree, interval-based)-> Multi-dimension indexing (R-tree, region-based) -> Arbitrary objects (GiST, predicate-based)

  • Avoid over-generalization

    • While “predicate” is quite general, it doesn’t guarantee pruning power

    • Where’s the notion of “bounding” in GiST?

  • Whenever you see “yet another X”, think about possibilities for a more general formulation of X