Lecture 6 indexing gist
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Lecture 6: Indexing: GiST PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on
  • Presentation posted in: General

Lecture 6: Indexing: GiST. Sept. 12, 2007 ChengXiang Zhai. Most slides are adapted from Kevin Chang’s lecture slides. Search Trees: Previous Approaches. Specialized search trees (yet another tree!): redundant code: most trees are very similar concurrency control, logging/recovery: tricky

Download Presentation

Lecture 6: Indexing: GiST

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lecture 6 indexing gist

Lecture 6: Indexing: GiST

Sept. 12, 2007

ChengXiang Zhai

Most slides are adapted from Kevin Chang’s lecture slides


Search trees previous approaches

Search Trees: Previous Approaches

  • Specialized search trees (yet another tree!):

    • redundant code: most trees are very similar

    • concurrency control, logging/recovery: tricky

  • Trees for extensible data types:

    • B-tree for any data with linear ordering

      • e.g.: index titles (alph. ordering) with B-tree

    • problem: does not support natural queries

      • e.g.: WHERE book.title has “database”?


Gist generalized search tree

GiST: Generalized Search Tree

  • General: cover B+-tree, R-tree, etc…

  • Extensible:

    • domain-specific data types & queries definable

  • Easy to extend: six “key methods” for a new tree

  • Efficient: can match specialized trees

  • Reusable: concurrency, recovery for indexes


Example indexing book titles

Example: Indexing Book Titles

  • Titles of 4 books:

    • T1 = “database optimization”

    • T2 = “web database”

    • T3 = “complexity of optimization algorithms”

    • T4 = “algorithms and complexity”

  • Indexable with (extensible) B+-tree?

    • linear ordering: T4, T3, T1, T2

  • Note: Just an example for demonstrating GiST!

    • What we will do to index “titles” is not the best and typical way to index “textual data”! --- No notion of fuzzy “relevance”.

    • stay tuned for text and web search


Queries on titles

Queries on Titles

  • Indexing is to accommodate query processing

  • What “predicates” to ask about titles?


Queries on titles1

Queries on Titles

  • Equality predicates:

    • WHERE book.title = “web databases”

  • Containment predicates:

    • WHERE book.title has “web”

  • Prefix predicates:

    • WHERE book.title start-with “web”

  • RegEx predicates: (generalize all the others)

    • WHERE book.title like “# web # database”


Extensible b tree for titles

Extensible B+-Tree for Titles

  • Observations:

    • indexed values have linear ordering: T4, T3, T1, T2

    • keys are simply separators: T4, c, T3, d, T1, w, T2

d

c

w

T4: alg. …

T3: complexity …

T1: database …

T2: web …


Using b tree what s wrong

Using B+-Tree: What’s Wrong?

  • What predicates can B+tree support well?

    • equality, containing, prefix, regex?

d

c

w

T4: alg. …

T3: complexity …

T1: database …

T2: web …


Gist generalizing balanced search trees

GiST: Generalizing Balanced Search Trees

  • GiST is not universal (just reasonable generalization)

    • balanced tree of <key, ptr> pairs, keys can overlap

  • GRE test:B-Tree : R-Tree = R-Tree: ________

    • What is the key generalization?

key1 key2 …

internal nodes (directory)

leaf nodes (linked list)


The key generalization the key

The Key Generalization: The Key

  • Key evolution: 1-D separator --> 2-D MBR --> predicates

  • R-Tree : B-Tree

    • generalizing key from 1-D line to 2-D area

      • bounding range to (minimal) bounding region

  • GiST : R-Tree

    • generalizing key from 2-D MBR to “predicates”

      • a predicate that all values v in subtree will satisfy

    • B-tree keys:

      • [k1:k2) --> contains([k1:k2), v)

    • R-tree keys:

      • (x1,y1,x2,y2) --> contains((x1,y1, x2,y2), v)

    • RD-tree keys:

      • {x1,…xk}  subset({x1,…,xk},v)


Gist for title indexing predicates

Gist for Title Indexing: Predicates

Must first determine predicates:

  • What query predicates to support?

    • equality: equal(v, “web db”)

    • containing: has(v, “web”)

  • What key predicate to use?

    • Criteria for choosing key predicates?

    • What do you suggest?


Gist for title indexing predicates1

GiST for Title Indexing: Predicates

  • Key predicates: Contains(S, v)

SL

SR

{alg, comp, opt}

{db, opt, web}

SLL

SLR

SRL

SRR

{alg, comp}

{comp, opt}

{db, opt}

{db, web}

T4: alg. …

T3: complexity …

T1: database …

T2: web …


Gist built in tree operations

GiST: Built-in Tree Operations

  • Search(root R, predicate q)

  • Insert(root R, entry E, level l)

  • Delete(root R, entry E)


Gist application specific methods

GiST: Application-Specific Methods

Search:

  • Consistent(E, q): search subtree E for predicate q?

    Labeling:

  • Union(E1, …, En): how to label the union of E1, …, En?

    Categorization:

  • Penalty(E1, E2): penalty for inserting E2 in subtree E1

  • PickSplit(E1, …, En): how to split into two groups of entries

    Compression: (storage/time tradeoff)

  • Compress(E): E --> Ec

  • Decompress(Ec): --> E’ such that E.p implies E’.p


Search operation consistent method

Search Operation: Consistent Method

  • Search(root R, predicate q):

    • traverse subtrees where Consistent true

    • return leaf entries that are consistent


Consistent method

Consistent Method

  • Consistent(E, q):

    • Can E.p and q both hold?

    • Does E.p imply (not q)?

  • Title GiST:

    • key predicate: p = Contains(S, v) or simply S

      • e.g., SL = {alg, comp, opt}

      • e.g., SR = {db, opt, web}

    • Consistent(SL, has(v, “web”))?

      • how to implement?

    • Consistent(SR, equals(v, “web database”))?

      • how to implement?


Insert operation

Insert Operation

  • Insert(root R, entry E, level l)

    • descend tree minimizing potential increase in Penalty

      • stop at level specified

    • if there is room at node, insert there

    • else split according to PickSplit

    • propagate changes using Union to adjust keys

    • Why do we need a “level parameter”?


Title gist insert

Title GiST: Insert

  • Where to insert T5:“complexity of web algorithms” ?

SL

SR

{alg, comp, opt}

{db, opt, web}

SLL

SLR

SRL

SRR

{alg, comp}

{comp, opt}

{db, opt}

{db, web}

T4: alg. …

T3: complexity …

T1: database …

T2: web …


Penalty method

Penalty Method

  • Penality(E1, E2):

    • penalty for inserting E2 in subtree E1

  • Title GiST:

    • E2 with S ={comp,web, alg} (i.e., T5:“complexity of web algorithms”)

    • Where to insert?

      • root: SL = {alg, comp, opt} vs. SR = {db, opt, web}?

    • Penalty:

      • how to implement?


Picksplit method

PickSplit Method

  • PickSplit(E1, …, En):

    • how to split into two groups of entries

  • Title GiST:

    • suppose we have 3 entries (after an Insert)

      • S1 = {alg, comp}

      • S2 = {comp, opt}

      • S3 = {comp, web, alg} (new)

    • how to split {S1, S2, S3} into two?

    • something similar to what R-tree algorithm will do


Union method

Union Method

  • Union(E1, …, En):

    • Generates a label for the subtree with E1, …, En

  • Title GiST:

    • key predicate: p = Contains(S, v) or simply S

      • S1 = {alg, comp}, S2 = {comp, opt}

      • Combined key = ?

    • Union(E1=(SL, ptr1), E2=(SR, ptr2)) = ?

      • how to implement?


Compress decompress method

Compress/Decompress Method?

Key storage vs. search time tradeoff

  • Compress(E): E --> Ec

  • Decompress(Ec): --> E’.p can be “looser” than E.p (less pruning power)

  • Lossy compression: may need more time for search

  • Title GiST:

    • Any suggestions?


Title gist compress decompress

Title GiST: Compress/Decompress

  • Example 1: no compression

    • Compress(E) --> Ec = E

    • Decompress(Ec) --> E’ = Ec

  • Example 2: compress by taking word initials

    • Compress:

      {algorithm, complexity, optimization} --> {al, co, op}

    • Decompress:

      {al, co, op} --> {al*, co*, op*}


Gist no magic

GiST: No Magic

  • It offers (only) what its model is based on

  • It does not represent all possible index structures:

    • e.g.: duplicate objects by multiple inserts (R+-tree)

    • e.g.: support notion of distance and similarity

      • rather than Boolean based predicates

    • any more?


Outlook indexability

Outlook: Indexability

  • Observation:

    • the simplest version of the Consistent method is a routine that always returns MAYBE-- which gives you a search tree with no efficiency gain

  • Big questions:

    • what is an index machinery? (analog: turing machine)

    • how do we characterize “workload”? (analog: languages)

    • can index always help in search? (analog: decidability, complexity)

      • what are the performance parameters? (analog: size of input)

      • what are the performance measures? (analog: time, space complexity)

  • Initial result: Hellerstein, Koutsoupias and Papadimitriou:

    On the Analysis of Indexing Schemes, PODS 97

  • An empirical question:

    • Can we learn an indexing strategy from the characteristics of the workload?


What you should know

What You Should Know

  • What is GiST?

  • What are the six key methods?

  • How does GiST generalize other more specialized trees?

  • What are some limitations of GiST?


Carry away messages

Carry Away Messages

  • Once again, generalize whenever it’s possible

    • 1-dimension indexing (B+-tree, interval-based)-> Multi-dimension indexing (R-tree, region-based) -> Arbitrary objects (GiST, predicate-based)

  • Avoid over-generalization

    • While “predicate” is quite general, it doesn’t guarantee pruning power

    • Where’s the notion of “bounding” in GiST?

  • Whenever you see “yet another X”, think about possibilities for a more general formulation of X


  • Login