issues in bridging db ir l.
Download
Skip this Video
Download Presentation
Issues in Bridging DB & IR

Loading in 2 Seconds...

play fullscreen
1 / 71

Issues in Bridging DB & IR - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

Issues in Bridging DB & IR. 11/21. Administrivia. Homework 4 socket open *PLEASE* start working. There may not be a week extra time before submission Considering making Homework 4 subsume the second exam—okay? Topics coming up DB/IR (1.5 classes); Collection Selection (.5 classes)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Issues in Bridging DB & IR' - kerry


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
administrivia
Administrivia
  • Homework 4 socket open
    • *PLEASE* start working. There may not be a week extra time before submission
    • Considering making Homework 4 subsume the second exam—okay?
  • Topics coming up
    • DB/IR (1.5 classes); Collection Selection (.5 classes)
    • Social Network Analysis (1 class); Webservices (1 class)
    • Interactive review/Summary (last class)
db and ir two parallel universes

parallel universes forever ?

DB and IR: Two Parallel Universes

Database Systems

Information Retrieval

canonical

application:

accounting

libraries

text

numbers,

short strings

data type:

foundation:

algebraic /

logic based

probabilistic /

statistics based

search

paradigm:

Boolean retrieval

(exact queries,

result sets/bags)

ranked retrieval

(vague queries,

result lists)

db vs ir
DBs allow structured querying

Queries and results (tuples) are different objects

Soundness & Completeness expected

User is expected to know what she is doing

IR only supports unstructured querying

Queries and results are both documents!

High Precision & Recall is hoped for

User is expected to be a dunderhead.

DB vs. IR
top down motivation applications 1 customer support

Why customizable scoring?

  • wealth of different apps within this app class
  • different customer classes
  • adjustment to evolving business needs
  • scoring on text + structured data
  • (weighted sums, language models, skyline,
  • w/ correlations, etc. etc.)
Top-down Motivation: Applications (1)- Customer Support -

Typical data:

Customers (CId, Name, Address, Area, Category, Priority, ...)

Requests (RId, CId, Date, Product, ProblemType, Body, RPriority, WFId, ...)

Answers (AId, RId, Date, Class, Body, WFId, WFStatus, ...)

Typical queries:

premium customer from Germany:

„A notebook, model ... configured with ..., has a problem with the driver of

its Wave-LAN card. I already tried the fix ..., but received error message ...“

  • request classification & routing
  • find similar requests

Platform desiderata (from app developer‘s viewpoint):

  • Flexible ranking and scoring on text, categorical, numerical attributes
  • Incorporation of dimension hierarchies for products, locations, etc.
  • Efficient execution of complex queries over text and data attributes
  • Support for high update rates concurrently with high query load
top down motivation applications 2
Top-down Motivation: Applications (2)

More application classes:

  • Global health-care management for monitoring epidemics
  • News archives for journalists, press agencies, etc.
  • Product catalogs for houses, cars, vacation places, etc.
  • Customer relationship management in banks, insurances, telcom, etc.
  • Bulletin boards for social communities
  • P2P personalized & collaborative Web search
  • etc. etc.
top down motivation applications 3

Prob

Prob

0.95

0.9

0.75

  • facts now have confidence scores
  • queries involve probabilistic inferences
  • and result ranking
  • relevant for „business intelligence“
Top-down Motivation: Applications (3)

Next wave Text2Data:

use Information-Extraction technology

(regular expressions, HMMs, lexicons,

other NLP and ML techniques)

to convert text docs into relational facts, moving up in the value chain

Example:

„The CIDR‘05 conference takes place in Asilomar from Jan 4 to Jan 7,

and is organized by D.J. DeWitt, Mike Stonebreaker, ...“

Conference

ConfOrganization

Name Year Location Date

Name Year Chair

CIDR 2005 Asilomar 05/01/04

CIDR 2005 P68

CIDR 2005 P35

People

Id Name

P35 Michael Stonebraker

P68 David J. DeWitt

some specific problems
Some specific problems
  • How to handle textual attributes in data processing (e.g. Joins)?
  • How to support keyword-based querying over normalized relations?
  • How to handle imprecise queries?

(Ullas Nambiar’s work)

  • How to do query processing for top-K results?

(Surajit et. Al. paper in CIDR-2005)

1 handling text fields in data tuples
1. Handling text fields in data tuples
  • Often you have database relations some of whose fields are “Textual”
    • E.g. a movie database, which has, in addition to year, director etc., a column called “Review” which is unstructured text
  • Normal DB operations ignore this unstructured stuff (can’t join over them).
    • SQL sometimes supports “Contains” constraint (e.g. give me movies that contain “Rotten” in the review
stir simple text in relations
The elements of a tuple are seen as Documents (rather than atoms)

Query language is same as SQL save a “similarity” predicate

STIR (Simple Text in Relations)
soft joins whirl cohen
Soft Joins..WHIRL [Cohen]
  • We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output in a ranked form—with the rank proportional to the similarity
    • Neat idea… but does have some implementation difficulties
      • Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples
        • Uses A*-search to focus on top-K answers
        • (See Surajit et. Al. CIDR 2005 who argue for a whole new query algebra to help support top-K query processing)
whirl queries
WHIRL queries
  • Assume two relations:

review(movieTitle,reviewText): archive of reviews

listing(theatre, movieTitle, showTimes, …): now showing

whirl queries13
WHIRL queries
  • “Find reviews of sci-fi comedies [movie domain]

FROM review SELECT * WHERE r.text~’sci fi comedy’

(like standard ranked retrieval of “sci-fi comedy”)

  • “ “Where is [that sci-fi comedy] playing?”

FROM review as r, LISTING as s, SELECT *

WHERE r.title~s.title and r.text~’sci fi comedy’

(best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”)

whirl queries14
WHIRL queries
  • Similarity is based on TFIDF rare wordsare most important.
  • Search for high-ranking answers uses inverted indices….
whirl queries15

Years are common in the review archive, so have low weight

WHIRL queries
  • Similarity is based on TFIDF rare wordsare most important.
  • Search for high-ranking answers uses inverted indices….

- It is easy to find the (few) items that match on “important” terms

- Search for strong matches can prune “unimportant terms”

whirl results
WHIRL results
  • This sort of worked:
    • Interactive speeds (<0.3s/q) with a few hundred thousand tuples.
    • For 2-way joins, average precision (sort of like area under precision-recall curve) from 85% to 100% on 13 problems in 6 domains.
    • Average precision better than 90% on 5-way joins
whirl and soft integration
WHIRL worked for a number of web-based demo applications.

e.g., integrating data from 30-50 smallish web DBs with <1 FTE labor

WHIRL could link many data types reasonably well, without engineering

WHIRL generated numerous papers (Sigmod98, KDD98, Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000, JAIR2001)

WHIRL was relational

But see ELIXIR (SIGIR2001)

WHIRL users need to know schema of source DBs

WHIRL’s query-time linkage worked only for TFIDF, token-based distance metrics

 Text fields with few misspellimgs

WHIRL was memory-based

all data must be centrally stored—no federated data.

 small datasets only

WHIRL and soft integration
slide18

SELECT R.a,S.a,S.b,T.b FROM R,S,T

WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar)

Query Q

WHIRL vision: very radical, everything was inter-dependent

Link items as

needed by Q

Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine.

string similarity metrics
String Similarity Metrics
  • Tf-idf measures are not really very good at handling similarity between “short textual attributes” (e.g. titles)
    • String similarity metrics are more suitable
  • String similarity can be handled in terms of
    • Edit distance (# of primitive ops such as “backspace”, “overtype”) needed to convert one string into another
    • N-gram distance (see next slide)
n gram distance
N-gram distance
  • An n-gram of a string is a contiguous n-character subsequence of the string
    • 3 grams of string “hitchhiker” are
      • {hit, itc, tch, chh, hhi, hik, ike, ker}
    • “space” can be treated as a special character
  • A string S can be represented as a set of its n-grams
    • Similarity between two strings can be defined in terms of the similarity between the sets
      • Can do jaccard similarity
  • N-grams are to strings what K-shingles are to documents
    • Document duplicate detection is often done in terms of the set similarity between its shingles
      • Each shingle is hashed to a hash signature. A jaccard similarity is computed between the document shingle sets
        • Useful for plagiarism detection (see Turnitin software does it..)
2 supporting keyword search on databases
2. Supporting keyword search on databases

How do we answer

a query like

“Soumen Sunita”?

Issues:

--the schema is normalized

(not everything in one table)

--How to rank multiple tuples

which contain the keywords?

motivation
Motivation
  • Keyword search of documents on the Web has been enormously successful
    • Simple and intuitive, no need to learn any query language
  • Database querying using keywords is desirable
    • SQL is not appropriate for casual users
    • Form interfaces cumbersome:
      • Require separate form for each type of query — confusing for casual users of Web information systems
      • Not suitable for ad hoc queries
motivation24
Motivation
  • Many Web documents are dynamically generated from databases
    • E.g. Catalog data
  • Keyword querying of generated Web documents
    • May miss answers that need to combine information on different pages
    • Suffers from duplication overheads
examples of keyword queries
Examples of Keyword Queries
  • On a railway reservation database
    • “mumbai bangalore”
  • On a university database
    • “database course”
  • On an e-store database
    • “camcorder panasonic”
  • On a book store database
    • “sudarshan databases”
differences from ir web search
Differences from IR/Web Search
  • Related data split across multiple tuples due to normalization
    • E.g. Paper (paper-id, title, journal), Author (author-id, name) Writes (author-id, paper-id, position)
  • Different keywords may match tuples from different relations
    • What joins are to be computed can only be decided on the fly
      • Cites(citing-paper-id, cited-paper-id)
connectivity
Connectivity
  • Tuples may be connected by
    • Foreign key and object references
    • Inclusion dependencies and join conditions
    • Implicit links (shared words), etc.
  • Would like to find sets of (closely) connected tuples that match all given keywords
basic model
Basic Model
  • Database: modeled as a graph
    • Nodes = tuples
    • Edges = references between tuples
      • foreign key, inclusion dependencies, ..
      • Edges are directed.

BANKS: Keyword search…

MultiQuery Optimization

paper

writes

Charuta

S. Sudarshan

Prasan Roy

author

answer example
Answer Example

Query: sudarshan roy

paper

MultiQuery Optimization

writes

writes

author

author

S. Sudarshan

Prasan Roy

combining keyword search and browsing
Combining Keyword Search and Browsing
  • Catalog searching applications
    • Keywords may restrict answers to a small set, then user needs to browse answers
    • If there are multiple answers, hierarchical browsing required on the answers
what banks does
What Banks Does

The whole DB seen

as a directed graph

(edges correspond to

foreign keys)

Answers are subgraphs

Ranked by edge weights

solutions as rooted weighted trees
Solutions as rooted weighted trees
  • In BANKS, each potential solution is a rooted weighted tree where
    • Nodes are tuples from tables
      • Node weight can be defined in terms of “pagerank” style notions (e.g. back-degree)
        • They use log(1+x) where x is the number of back links
    • Edges are foreign-primary key references between tuples across tables
      • Links are given domain specific weights
        • Paperwrites is seen as stronger than Papercites table
    • Tuples in the tree must cover keywords
  • Relevance of a tree is based on its weight
    • Weight of the tree is a combination of its node and link weights
part iii answer imprecise queries with

Part III: Answer Imprecise Queries with

[ICDE 2006;WebDB, 2004; WWW, 2004]

why imprecise queries

A Feasible Query

Make =“Toyota”, Model=“Camry”, Price ≤ $7000

  • Toyota
  • Camry
  • $7000
  • 1999

Want a ‘sedan’ priced around $7000

  • Toyota
  • Camry
  • $7000
  • 2001
  • Camry
  • Toyota
  • $6700
  • 2000
  • Toyota
  • Camry
  • $6500
  • 1998
  • ………

What about the price of a Honda Accord?

Is there a Camry for $7100?

Solution: Support Imprecise Queries

Why Imprecise Queries ?
dichotomy in query processing
Dichotomy in Query Processing
  • Databases

User knows what she wants

User query completely expresses the need

Answers exactly matching query constraints

  • IR Systems
  • User has an idea of what she wants
  • User query captures the need to some degree
  • Answers ranked by degree of relevance

Imprecise queries on databases cross the divide

existing approaches
Existing Approaches
  • Similarity search over Vector space
    • Data must be stored as vectors of text

WHIRL, W. Cohen, 1998

  • Enhanced database model
    • Add ‘similar-to’ operator to SQL. Distances provided by an expert/system designer

VAGUE, A. Motro, 1998

    • Support similarity search and query refinement over abstract data types

Binderberger et al, 2003

  • User guidance
    • Users provide information about objects required and their possible neighborhood

Proximity Search, Goldman et al, 1998

  • Limitations:
  • User/expert must provide similarity measures
  • New operators to use distance measures
  • Not applicable over autonomous databases
  • Our Objectives:
  • Minimal user input
  • Database internals not affected
  • Domain-independent & applicable to Web databases
imprecise queries vs empty queries
Imprecise queries vs. Empty queries
  • The “empty query” problem arises when the user’s query, when submitted to the database leads to empty set of answers.
    • We want to develop methods that can automatically minimally relax this empty query and resubmit it so the user gets some results
  • Existing approaches for empty query problem are mostly syntactic—and rely on relaxing various query constraints
    • Little attention is paid to the best order in which to relax the constraints.
  • Imprecise query problem is a general case of empty query problem
    • We may have non-empty set of answers to the base query
    • We are interested not just in giving some tuples but give them in the order of relevance
general ideas for supporting imprecise queries
General ideas for supporting imprecise queries
  • Main issues are
    • How to rewrite the base query such that more relevant tuples can be retrieved.
    • How to rank the retrieved tuples in the order of relevance.

A spectrum of approaches are possible—including

    • Data-dependent approaches
    • User-dependent approaches
    • Collaborative approaches

We will look at an approach—which is basically data-dependent

an example
An Example
  • Relation:-CarDB(Make, Model, Price, Year)
  • Imprecise query
  • Q :− CarDB(Model like “Camry”, Price like “10k”)
  • Base query
  • Qpr :− CarDB(Model = “Camry”, Price = “10k”)
  • Base set Abs
  • Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”
  • Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001”
obtaining extended set
Obtaining Extended Set
  • Problem: Given base set, find tuples from database similar to tuples in base set.
  • Solution:
    • Consider each tuple in base set as a selection query.

e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”

    • Relax each such query to obtain “similar” precise queries.

e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000”

    • Execute and determine tuples having similarity above some threshold.
  • Challenge: Which attribute should be relaxed first ?
    • Make ? Model ? Price ? Year ?
  • Solution:Relax least important attribute first.
least important attribute
Least Important Attribute
  • Definition: An attribute whose binding value when changed has minimal effect on values binding other attributes.
    • Does not decide values of other attributes
    • Value may depend on other attributes
  • E.g. Changing/relaxing Price will usually not affect other attributes
  • but changing Model usually affects Price
  • Dependence between attributes useful to decide relative importance
    • Approximate Functional Dependencies & Approximate Keys
      • Approximate in the sense that they are obeyed by a large percentage (but not all) of tuples in the database
        • Can use TANE, an algorithm by Huhtala et al [1999]
attribute ordering
Attribute Ordering
  • Given a relation R
    • Determine the AFDs and Approximate Keys
    • Pick key with highest support, say Kbest
    • Partition attributes of R into
      • key attributes i.e. belonging to Kbest
      • non-key attributes I.e. not belonging toKbest
    • Sort the subsets using influence weights
  • where Ai ∈ A’ ⊆ R, j ≠ i & j =1 to |Attributes(R)|
  • Attribute relaxation order is all non-keysfirst then keys
  • Multi-attribute relaxation - independence assumption

CarDB(Make, Model, Year, Price)

Key attributes: Make, Year

Non-key: Model, Price

Order: Price, Model, Year, Make

1- attribute: { Price, Model, Year, Make}

2-attribute: {(Price, Model), (Price, Year), (Price, Make)….. }

tuple similarity
Tuple Similarity
  • Tuples obtained after relaxation are ranked according to their
  • similarity to the corresponding tuples in base set

where Wi = normalized influence weights, ∑ Wi = 1 , i = 1 to |Attributes(R)|

  • Value Similarity
    • Euclidean for numerical attributes e.g. Price, Year
    • Concept Similarity for categorical e.g. Make, Model
concept value similarity

JaccardSim(A,B) =

Concept (Value) Similarity
  • Concept: Any distinct attribute value pair. E.g. Make=Toyota
    • Visualized as a selection query binding a single attribute
    • Represented as a supertuple
  • Concept Similarity:Estimated as the percentage of correlated values common to two given concepts
  • where v1,v2 Є Aj, i ≠ j and Ai, Aj Є R
    • Measured as the Jaccard Similarity among supertuples representing the concepts

Supertuple for Concept Make=Toyota

concept value similarity graph
Concept (Value) Similarity Graph

Dodge

Nissan

0.15

0.11

BMW

Honda

0.12

0.22

0.25

Ford

0.16

Chevrolet

Toyota

empirical evaluation of
Empirical Evaluation of
  • Goal
    • Evaluate the effectiveness of the query relaxation and concept learning
  • Setup
    • A database of used cars

CarDB( Make, Model, Year, Price, Mileage, Location, Color)

    • Populated using 30k tuples from Yahoo Autos
    • Concept similarity estimated for Make, Model, Location, Color
    • Two query relaxation algorithms
      • RandomRelax – randomly picks attribute to relax
      • GuidedRelax – uses relaxation order determined using approximate keys and AFDs
evaluating the effectiveness of relaxation
Evaluating the effectiveness of relaxation
  • Test Scenario
    • 10 randomly selected base queries from CarDB
    • 20 tuples showing similarity > Є
      • 0.5 < Є < 1
    • Weighted summation of attribute similarities
      • Euclidean distance used for Year, Price, Mileage
      • Concept Similarity used for Make, Model, Location, Color
    • Limit 64 relaxed queries per base query
      • 128 max possible – 7 attributes
    • Efficiency measured using metric
efficiency of relaxation in

Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є=0.7.

  • Resilient to change in Є
  • Average 8 tuples extracted per relevant tuple for Є=0.5. Increases to 120 tuples for Є=0.7.
  • Not resilient to change in Є
Efficiency of Relaxation in

Random Relaxation

Guided Relaxation

different scale

summary
Summary
  • An approach for answering imprecise queries over Web database
    • Mine and use AFDs to determine attribute importance
    • Domain-independent concept similarity estimation technique
    • Tuple similarity score as a weighted sum of attribute similarity scores
  • Empirical evaluation shows
    • Reasonable concept similarity models estimated
    • Set of similar precise queries efficiently identified
collection selection meta search introduction
Collection Selection/Meta Search Introduction
  • Metasearch Engine
      • A system that provides unified access to multiple existing search engines.
  • Metasearch Engine Components
    • Database Selector
      • Identifying potentially useful databases for each user query
    • Document Selector
      • Identifying potentially useful document returned from selected databases
    • Query Dispatcher and Result Merger
      • Ranking the selected documents
collection selection

Collection

Query

Results

Selection

Execution

Merging

WSJ

WP

CNN

NYT

FT

Collection Selection
evaluating collection selection
Evaluating collection selection
  • Let c1..cj be the collections that are chosen to be accessed for the query Q. Let d1…dk be the top documents returned from these collections.
  • We compare these results to the results that would have been returned from a central union database
    • Ground Truth: The ranking of documents that the retrieval technique (say vector space or jaccard similarity) would have retrieved from a central union database that is the union of all the collections
  • Compare precision of the documents returned by accessing
general scheme challenges
General Scheme & Challenges
  • Get a representative of each of the database
    • Representative is a sample of files from the database
      • Challenge: get an unbiased sample when you can only access the database through queries.
  • Compare the query to the representatives to judge the relevance of a database
    • Coarse approach: Convert the representative files into a single file (super-document). Take the (vector) similarity between the query and the super document of a database to judge that database’s relevance
    • Finer approach: Keep the representative as a mini-database. Union the mini-databases to get a central mini-database. Apply the query to the central mini-database and find the top k answers from it. Decide the relevance of each database based on which of the answers came from which database’s representative
      • You can use an estimate of the size of the database too
    • What about overlap between collections? (See ROSCO paper)
uniform probing for content summary construction
Uniform Probing for Content Summary Construction
  • Automatic extraction of document frequency statistics from uncooperative databases
    • [Callan and Connell TOIS 2001],[Callan et al. SIGMOD 1999]
  • Main Ideas
    • Pick a word and send it as a query to database D
      • RandomSampling-OtherResource(RS-Ord): from a dictionary
      • RandomSampling-LearnedResource(RS-Lrd): from retrieved documents
    • Retrieval the top-K documents returned
    • If the number of retrieved documents exceeds a threshod T, stop, otherwise retart at the beginning
    • k=4 , T=300
    • Compute the sample document frequency for each word that appeared in a retrieved document.
cori net approach representative as a super document
CORI Net Approach (Representative as a super document)
  • Representative Statistics
    • The document frequency for each term and each database
    • The database frequency for each term
  • Main Ideas
    • Visualize the representative of a database as a super document, and the set of all representative as a database of super documents
    • Document frequency becomes term frequency in the super document, and database frequency becomes document frequency in the super database
    • Ranking scores can be computed using a similarity function such as the Cosine function
redde approach representative as a mini collection
ReDDE Approach(Representative as a mini-collection)
  • Use the representatives as mini collections
  • Construct a union-representative that is the union of the mini-collections (such that each document keeps information on which collection it was sampled from)
  • Send the query first to union-collection, get the top-k ranked results
    • See which of the results in the top-k came from which mini-collection. The collections are ranked in terms of how much their mini-collections contributed to the top-k answers of the query.
    • Scale the number of returned results by the expected size of the actual collection
selecting among overlapping collections
Selecting among overlapping collections

Results

1. ……

2. ……

3. ……

.

.

Collection

Query

Results

Selection

Execution

Merging

WSJ

WP

CNN

NYT

FT

  • Overlap between collections
    • News meta-searcher, bibliography search engine, etc.
  • Objectives:
    • Retrieve variety of results
    • Avoid collections with irrelevant or

redundant results

“bank mergers”

  • Collections:
  • FT
  • CNN

Existing work (e.g. CORI) assumes

collections are disjoint!

MS Thesis Defense

Thomas Hernandez

slide63

ROSCO approach

MS Thesis Defense

Thomas Hernandez

challenge defining computing overlap
Challenge: Defining & Computing Overlap

Collection C

Collection C

1

2

1. Result A

1. Result V

2. Result B

2. Result W

3. Result C

3. Result X

4. Result D

4. Result Y

5. Result E

5. Result Z

6. Result F

7. Result G

Collection C

Collection C

1

2

1. Result A

1. Result V

2. Result B

2. Result W

3. Result C

3. Result X

4. Result D

4. Result Y

5. Result E

5. Result Z

6. Result F

7. Result G

Collection C

3

1. Result I

2. Result J

3. Result K

4. Result L

5. Result M

  • Collection overlap may be non-symmetric, or “directional”. (A)
  • Document overlap may be non-transitive. (B)

A.

B.

MS Thesis Defense

Thomas Hernandez

gathering overlap statistics
Gathering Overlap Statistics
  • Solution:
    • Consider query result set of a particular collection as a single bag of words:
    • Approximate overlap as the intersection between the result set bags:
  • Approximate overlap between 3+ collections using only pairwise overlaps

MS Thesis Defense

Thomas Hernandez

controlling statistics
Controlling Statistics
  • Objectives:
    • Limit the number of statistics stored
    • Improve the chances of having statistics for new queries
  • Solution:
    • Identify frequent item sets among queries (Apriori algorithm)
    • Store statistics only with respect to these frequent item sets

MS Thesis Defense

Thomas Hernandez

the online component
The Online Component

Collection Selection System

Gather coverage

User query

Map the query to

and overlap

frequent item sets

information for

past queries

Compute statistics

Identify frequent

for the query using

item sets among

Coverage / Overlap

mapped item sets

queries

Statistics

Collection

Order

Determine

Compute statistics

collection order for

for the frequent

1. ……

2. ……

query

item sets

.

Online Component

Offline Component

Map the query to

frequent item sets

Compute statistics

for the query using

mapped item sets

Determine

collection order for

query

  • Purpose: determine collection order for user query
    • 1. Map query to stored item sets
    • 2. Compute statistics for query
    • 3. Determine collection order

MS Thesis Defense

Thomas Hernandez

training our system
Training our System
  • Training set: 90% of the query list
  • Gathering statistics for training queries:
    • Probing of the 15 collections
  • Identifying frequent item sets:
    • Support threshold used: 0.05% (i.e. 9 queries)
    • 681 frequent item sets found
  • Computing statistics for item sets:
    • Statistics fit in a 1.28MB file
    • Sample entry:

network,neural

22 MIX15 0.11855 CI,SC 747AG 0.07742 AD 0.01893SC,MIX15 801.13636 …

MS Thesis Defense

Thomas Hernandez

performance evaluation
Performance Evaluation
  • Measuring number of new and duplicate results:
    • Duplicate result: has cosine similarity > 0.95 with at least one retrieved result
    • New result: has no duplicate
  • Oracular approach:
    • Knowswhich collection has most new results
    • Retrieves large portion of new results early

MS Thesis Defense

Thomas Hernandez

evaluaton of collection selection
Evaluaton of Collection Selection

MS Thesis Defense

Thomas Hernandez

slide71
MS Thesis Defense

Thomas Hernandez