Supporting keyword queries over databases a simpl e istic first step
1 / 43

Supporting keyword queries over databases: A simpl{e/istic} first step - PowerPoint PPT Presentation

  • Updated On :

Supporting keyword queries over databases: A simpl{e/istic} first step. Slide added after the class discussion. As you may have heard, Google struck deals with AZ, UT and some other states to make their public records databases searchable by “Google Users”

Related searches for Supporting keyword queries over databases: A simpl{e/istic} first step

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Supporting keyword queries over databases: A simpl{e/istic} first step' - sunee

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Supporting keyword queries over databases a simpl e istic first step l.jpg
Supporting keyword queries over databases: A simpl{e/istic} first step

Slide added after the class discussion

  • As you may have heard, Google struck deals with AZ, UT and some other states to make their public records databases searchable by “Google Users”

  • Clearly the databases are not going to be searched with SQL queries, but rather with keyword search. How does one do keyword search over database records?

  • One simple idea is to

    • Generate the universal relation of the database (i.e., de-normalize and create a single big table)

    • Write out each tuple as a simple HTML file

    • Support keyword queries on these HTML files

  • Looks like an obviously dumb idea (we are destroying the structure in the tuple by writing it out as an html file!)

    • So we need a “smart-sounding” name for the idea

      • We shall call it “SURFACING THE DEEP WEB”

  • We surface the deepweb into HTML pages since we have a very nice hammer in terms of keyword based search that can then be used on those pages…

Soft joins whirl cohen l.jpg
Soft Joins..WHIRL [Cohen] first step

  • We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output in a ranked form—with the rank proportional to the similarity

    • Neat idea… but does have some implementation difficulties

      • Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples

        • Uses A*-search to focus on top-K answers

        • (See Surajit et. Al. CIDR 2005 who argue for a whole new query algebra to help support top-K query processing)

Whirl queries l.jpg
WHIRL queries first step

  • Assume two relations:

    review(movieTitle,reviewText): archive of reviews

    listing(theatre, movieTitle, showTimes, …): now showing

Whirl queries5 l.jpg
WHIRL queries first step

  • “Find reviews of sci-fi comedies [movie domain]

    FROM review SELECT * WHERE r.text~’sci fi comedy’

    (like standard ranked retrieval of “sci-fi comedy”)

  • “ “Where is [that sci-fi comedy] playing?”

    FROM review as r, LISTING as s, SELECT *

    WHERE r.title~s.title and r.text~’sci fi comedy’

    (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”)

Whirl queries6 l.jpg
WHIRL queries first step

  • Similarity is based on TFIDF rare wordsare most important.

  • Search for high-ranking answers uses inverted indices….

Whirl queries7 l.jpg

Years are common in the review archive, so have first step low weight

WHIRL queries

  • Similarity is based on TFIDF rare wordsare most important.

  • Search for high-ranking answers uses inverted indices….

- It is easy to find the (few) items that match on “important” terms

- Search for strong matches can prune “unimportant terms”

Whirl results l.jpg
WHIRL results first step

  • This sort of worked:

    • Interactive speeds (<0.3s/q) with a few hundred thousand tuples.

    • For 2-way joins, average precision (sort of like area under precision-recall curve) from 85% to 100% on 13 problems in 6 domains.

    • Average precision better than 90% on 5-way joins

Whirl and soft integration l.jpg

WHIRL worked for a number of web-based demo applications first step.

e.g., integrating data from 30-50 smallish web DBs with <1 FTE labor

WHIRL could link many data types reasonably well, without engineering

WHIRL generated numerous papers (Sigmod98, KDD98, Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000, JAIR2001)

WHIRL was relational

But see ELIXIR (SIGIR2001)

WHIRL users need to know schema of source DBs

WHIRL’s query-time linkage worked only for TFIDF, token-based distance metrics

 Text fields with few misspellimgs

WHIRL was memory-based

all data must be centrally stored—no federated data.

 small datasets only

WHIRL and soft integration

String similarity metrics l.jpg
String Similarity Metrics first step

  • Tf-idf measures are not really very good at handling similarity between “short textual attributes” (e.g. titles)

    • String similarity metrics are more suitable

  • String similarity can be handled in terms of

    • Edit distance (# of primitive ops such as “backspace”, “overtype”) needed to convert one string into another

    • N-gram distance (see next slide)

N gram distance l.jpg
N-gram distance first step

  • An n-gram of a string is a contiguous n-character subsequence of the string

    • 3 grams of string “hitchhiker” are

      • {hit, itc, tch, chh, hhi, hik, ike, ker}

    • “space” can be treated as a special character

  • A string S can be represented as a set of its n-grams

    • Similarity between two strings can be defined in terms of the similarity between the sets

      • Can do jaccard similarity

  • N-grams are to strings what K-shingles are to documents

    • Document duplicate detection is often done in terms of the set similarity between its shingles

      • Each shingle is hashed to a hash signature. A jaccard similarity is computed between the document shingle sets

        • Useful for plagiarism detection (see Turnitin software does it..)

Performance l.jpg
Performance first step

Query processing in data integration gathering and using source statistics l.jpg

Query Processing in Data Integration first step(Gathering and Using Source Statistics)

Query optimization challenges l.jpg

Query Optimization Challenges first step

-- Deciding what to optimize

--Getting the statistics on sources

--Doing the optimization

What to optimize l.jpg
What to Optimize first step

  • Traditional DB optimizers compare candidate plans purely in terms of the time they take to produce all answers to a query.

  • In Integration scenarios, the optimization is “multi-objective”

    • Total time of execution

    • Cost to first few tuples

      • Often, the users are happier with plans that give first tuples faster

    • Coverage of the plan

      • Full coverage is no longer an iron-clad requirement

        • Too many relevant sources, Uncontrolled overlap between the sources

      • Can’t call them all!

    • (Robustness,

    • Access premiums…)

Roadmap l.jpg
Roadmap first step

  • We will first focus on optimization issues in vertical integration (“data aggregation” ) scenarios

    • Learning source statistics

    • Using them to do source selection

  • Then move to optimization issues in horizontal integration (“data linking”) scenarios.

    • Join optimization issues in data integration scenarios

Query processing issues in data aggregation l.jpg
Query Processing Issues in Data Aggregation first step

  • Recall that in DA, all sources are exporting fragments of the same relation R

    • E.g. Employment opps; bibliography records; item/price records etc

    • The fragment of R exported by a source may have fewer columns and/or fewer rows

  • The main issue in DA is “Source Selection”

    • Given a query q, which source(s) should be selected and in what order

  • Objective: Call the least number of sources that will give most number of high-quality tuples in the least amount of time

    • Decision version: Call k sources that ….

    • Quality of tuples– may be domain specific (e.g. give lowest price records) or domain independent (e.g. give tuples with fewest null values)

Issues affecting source selection in da l.jpg
Issues affecting Source Selection in DA first step

  • Source Overlap

    • In most cases you want to avoid calling overlapping sources

    • …but in some cases you want to call overlapping sources

      • E.g. to get as much information about a tuple as possible; to get the lowest priced tuple etc.

  • Source latency

    • You want to call sources that are likely to respond fast

  • Source quality

    • You want to call sources that have high quality data

      • Domain independent: E.g. High density (fewer null values)

      • Domain specific E.g. sources having lower cost books

  • Source “consistency”?

    • Exports data that is error free

Learning source statistics l.jpg
Learning Source Statistics first step

  • Coverage, overlap, latency, density and quality statistics about sources are not likely to be exported by sources!

    • Need to learn them

  • Most of the statistics are source and query specific

    • Coverage and Overlap of a source may depend on the query

    • Latency may depend on the query

    • Density may depend on the query

  • Statistics can be learned in a qualitative or quantitative way

    • LCW vs. coverage/overlap statistics

    • Feasible access patterns vs. binding pattern specific latency statistics

  • Quantitative is more general and amenable to learning

  • Too costly to learn statistics w.r.t. each specific query

    • Challenge: Find right type of query classes with respect to which statistics are learned

      • Query class definition may depend on the type of statistics

  • Since sources, user population and network are all changing, statistics need to be maintained (through incremental changes)

  • Bibfinder case study l.jpg

    BibFinder Case Study first step

    See the bibfinder slides

    Case study bibfinder l.jpg

    BibFinder: A popular CS bibliographic mediator first step

    Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect, Network Bibliography, CSB, CiteSeer

    More than 58000 real user queries collected

    Mediated schema relation in BibFinder:

    paper(title, author, conference/journal, year)

    Primary key: title+author+year

    Focus on Selection queries

    Q(title, author, year) :- paper(title, author, conference/journal, year),


    Case Study: BibFinder

    Selecting top k sources for a given query l.jpg
    Selecting top-K sources for a given query first step

    • Given a query Q, and sources S1….Sn, we need the coverage and overlap statistics of sources Si w.r.t. Q

      • P(S|Q) is the coverage (Probability that a random tuple belonging to Q is exported by source S)

      • P({S1..Sj}|Q) is the overlap between S1..Sj w.r.t. query Q (Probability that a random tuple belonging to Q is exported by all the sources S1..Sj).

      • If we have the coverage and overlap statistics, then it is possible to pick the top-K sources that will give maximal number of tuples for Q.

    Computing effective coverage provided by a set of sources l.jpg
    Computing Effective Coverage provided by a set of sources first step

    Suppose we are calling 3 sources S1, S2, S3 to answer a query Q.

    The effective coverage we get is P(S1US2US3|Q).

    In order to compute this union, we need the intersection (overlap)

    statistics (in addition to the coverage statistics)

    Given the above, we can pick the optimal 3-sources for answering Q

    by considering all 3-sized subsets of source set S1….Sn, and picking

    the set with highest coverage

    Selecting top k sources the greedy way l.jpg
    Selecting top-K sources: the greedy way first step

    Selecting optimal K sources is hard in general. One way to reduce

    cost is to select sources greedily, one after other.

    For example, to select 3 sources, we select first source Si as

    the source with highest P(Si|Q) value.

    To pick the jth source, we will compute the residual coverage of

    each of the remaining sources, given the 1,2…j-1 sources we have

    already picked

    The residual coverage computation requires

    overlap statistics). For example picking a third source in the context

    of sources S1 and S2 will require us to calculate:

    Challenges in gathering overlap statistics l.jpg
    Challenges in gathering overlap statistics first step

    • Sources are incomplete and partially overlapping

    • Calling every possible source isinefficient and impolite

    • Need coverage and overlap statistics to figure out what sources are most relevant for every possible query!

    • We introduce a frequency-based approach for mining these statistics

    Slide31 l.jpg

    Query List first step

    • Each query q corresponds to a

    • Vector of coverage/overlap

    • Statistics. If there are 3 sources

    • S1, s2, s3, we have:

    • [P(S1|q),P(S2|q), P(S3|q)

    • P(S1&S2|q), P(S2&S3|q) P(S1&S3|q)

    • P(S1&S2&S3|q) ]

    • A sparse vector with exponential

    • dimensions

      • By keeping thresholds on min

      • overlap, we can avoid remembering

      • small values

      • The larger the thresholds, the sparser

      • the vectors

    Issues in storing using statistics l.jpg
    Issues in Storing & Using Statistics first step

    • Storing statistics for each query is disadvantageous

      • Too many queries

      • Stored statistics can only be useful if the same query comes up again

    • Idea1: Focus on only frequently asked queries

    • Idea 2: Store statistics w.r.t. query classes

      • Generate query classes by clustering..

    • When a new query comes, we can map it to some existing query classes

  • But Clustering directly on queries won’t work

    • Because we won’t know how to map a new query into existing query classes

    • Idea: First do “subspace” clustering—cluster attribute values

    • A query class is then defined as a cross product of attribute value clusters

  • Slide34 l.jpg

    StatMiner first step

    A query is a vector of

    overlap statistics

    Latency statistics or what good is coverage without good response time l.jpg
    Latency statistics first step(Or what good is coverage without good response time?)

    • Sources vary significantly in terms of their response times

      • The response time depends both on the source itself, as well as the query that is asked of it

        • Specifically, what fields are bound in the selection query can make a difference

    • ..So, learn statistics w.r.t. binding patterns

    Query binding patterns l.jpg
    Query Binding Patterns first step

    • A binding pattern refers to which arguments of a relational query are “bound”

      • Given a relation S(X,Y,Z)

        • A query S(“Rao”, Y, “Tom”) has binding pattern bfb

        • A query S(X,Y, “TOM”) has binding pattern ffb

    • Binding patterns can be generalized to take “types” of bindings

      • E.g. S(X,Y,1) may be ffn (n being numeric binding) and

      • S(X,Y, “TOM”) may be ffs (s being string binding)

    • Sources tend to have different latencies based on the binding pattern

      • In extreme cases, certain binding patterns may have infinite latency (i.e., you are not allowed to ask that query)

        • Called “infeasible” binding patterns

    Digression l.jpg
    (Digression) first step

    • LCWs are the “qualitative” versions of quantitative coverage/overlap statistics

    • Feasible binding patterns are “qualitative” versions of quantitative latency statistics

    Combining coverage and response time l.jpg
    Combining coverage and response time first step

    • Qn: How do we define an optimal plan in the context of both coverage/overlap and response time requirements?

      • An instance of “multi-objective” optimization

        • General solution involves presenting a set of “pareto-optimal” solutions to the user and let her decide

          • Pareto-optimal set is a set of solutions where no solution is dominated by another one in all optimization dimensions (i.e., both better coverage and lower response time)

        • Another idea is to combine both objectives into a single weighted objective

    It is possible to optimize for first tuples l.jpg
    It first stepis possible to optimize for first tuples