Issues in Bridging DB & IR

Issues in Bridging DB & IR 11/21

Administrivia • Homework 4 socket open • *PLEASE* start working. There may not be a week extra time before submission • Considering making Homework 4 subsume the second exam—okay? • Topics coming up • DB/IR (1.5 classes); Collection Selection (.5 classes) • Social Network Analysis (1 class); Webservices (1 class) • Interactive review/Summary (last class)

parallel universes forever ? DB and IR: Two Parallel Universes Database Systems Information Retrieval canonical application: accounting libraries text numbers, short strings data type: foundation: algebraic / logic based probabilistic / statistics based search paradigm: Boolean retrieval (exact queries, result sets/bags) ranked retrieval (vague queries, result lists)

DBs allow structured querying Queries and results (tuples) are different objects Soundness & Completeness expected User is expected to know what she is doing IR only supports unstructured querying Queries and results are both documents! High Precision & Recall is hoped for User is expected to be a dunderhead. DB vs. IR

Why customizable scoring? • wealth of different apps within this app class • different customer classes • adjustment to evolving business needs • scoring on text + structured data • (weighted sums, language models, skyline, • w/ correlations, etc. etc.) Top-down Motivation: Applications (1)- Customer Support - Typical data: Customers (CId, Name, Address, Area, Category, Priority, ...) Requests (RId, CId, Date, Product, ProblemType, Body, RPriority, WFId, ...) Answers (AId, RId, Date, Class, Body, WFId, WFStatus, ...) Typical queries: premium customer from Germany: „A notebook, model ... configured with ..., has a problem with the driver of its Wave-LAN card. I already tried the fix ..., but received error message ...“ • request classification & routing • find similar requests Platform desiderata (from app developer‘s viewpoint): • Flexible ranking and scoring on text, categorical, numerical attributes • Incorporation of dimension hierarchies for products, locations, etc. • Efficient execution of complex queries over text and data attributes • Support for high update rates concurrently with high query load

Top-down Motivation: Applications (2) More application classes: • Global health-care management for monitoring epidemics • News archives for journalists, press agencies, etc. • Product catalogs for houses, cars, vacation places, etc. • Customer relationship management in banks, insurances, telcom, etc. • Bulletin boards for social communities • P2P personalized & collaborative Web search • etc. etc.

Prob Prob 0.95 0.9 0.75 • facts now have confidence scores • queries involve probabilistic inferences • and result ranking • relevant for „business intelligence“ Top-down Motivation: Applications (3) Next wave Text2Data: use Information-Extraction technology (regular expressions, HMMs, lexicons, other NLP and ML techniques) to convert text docs into relational facts, moving up in the value chain Example: „The CIDR‘05 conference takes place in Asilomar from Jan 4 to Jan 7, and is organized by D.J. DeWitt, Mike Stonebreaker, ...“ Conference ConfOrganization Name Year Location Date Name Year Chair CIDR 2005 Asilomar 05/01/04 CIDR 2005 P68 CIDR 2005 P35 People Id Name P35 Michael Stonebraker P68 David J. DeWitt

Some specific problems • How to handle textual attributes in data processing (e.g. Joins)? • How to support keyword-based querying over normalized relations? • How to handle imprecise queries? (Ullas Nambiar’s work) • How to do query processing for top-K results? (Surajit et. Al. paper in CIDR-2005)

1. Handling text fields in data tuples • Often you have database relations some of whose fields are “Textual” • E.g. a movie database, which has, in addition to year, director etc., a column called “Review” which is unstructured text • Normal DB operations ignore this unstructured stuff (can’t join over them). • SQL sometimes supports “Contains” constraint (e.g. give me movies that contain “Rotten” in the review

The elements of a tuple are seen as Documents (rather than atoms) Query language is same as SQL save a “similarity” predicate STIR (Simple Text in Relations)

Soft Joins..WHIRL [Cohen] • We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output in a ranked form—with the rank proportional to the similarity • Neat idea… but does have some implementation difficulties • Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples • Uses A*-search to focus on top-K answers • (See Surajit et. Al. CIDR 2005 who argue for a whole new query algebra to help support top-K query processing)

WHIRL queries • Assume two relations: review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing

WHIRL queries • “Find reviews of sci-fi comedies [movie domain] FROM review SELECT * WHERE r.text~’sci fi comedy’ (like standard ranked retrieval of “sci-fi comedy”) • “ “Where is [that sci-fi comedy] playing?” FROM review as r, LISTING as s, SELECT * WHERE r.title~s.title and r.text~’sci fi comedy’ (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”)

WHIRL queries • Similarity is based on TFIDF rare wordsare most important. • Search for high-ranking answers uses inverted indices….

Years are common in the review archive, so have low weight WHIRL queries • Similarity is based on TFIDF rare wordsare most important. • Search for high-ranking answers uses inverted indices…. - It is easy to find the (few) items that match on “important” terms - Search for strong matches can prune “unimportant terms”

WHIRL results • This sort of worked: • Interactive speeds (<0.3s/q) with a few hundred thousand tuples. • For 2-way joins, average precision (sort of like area under precision-recall curve) from 85% to 100% on 13 problems in 6 domains. • Average precision better than 90% on 5-way joins

WHIRL worked for a number of web-based demo applications. e.g., integrating data from 30-50 smallish web DBs with <1 FTE labor WHIRL could link many data types reasonably well, without engineering WHIRL generated numerous papers (Sigmod98, KDD98, Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000, JAIR2001) WHIRL was relational But see ELIXIR (SIGIR2001) WHIRL users need to know schema of source DBs WHIRL’s query-time linkage worked only for TFIDF, token-based distance metrics  Text fields with few misspellimgs WHIRL was memory-based all data must be centrally stored—no federated data.  small datasets only WHIRL and soft integration

SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar) Query Q WHIRL vision: very radical, everything was inter-dependent Link items as needed by Q Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine.

String Similarity Metrics • Tf-idf measures are not really very good at handling similarity between “short textual attributes” (e.g. titles) • String similarity metrics are more suitable • String similarity can be handled in terms of • Edit distance (# of primitive ops such as “backspace”, “overtype”) needed to convert one string into another • N-gram distance (see next slide)

N-gram distance • An n-gram of a string is a contiguous n-character subsequence of the string • 3 grams of string “hitchhiker” are • {hit, itc, tch, chh, hhi, hik, ike, ker} • “space” can be treated as a special character • A string S can be represented as a set of its n-grams • Similarity between two strings can be defined in terms of the similarity between the sets • Can do jaccard similarity • N-grams are to strings what K-shingles are to documents • Document duplicate detection is often done in terms of the set similarity between its shingles • Each shingle is hashed to a hash signature. A jaccard similarity is computed between the document shingle sets • Useful for plagiarism detection (see Turnitin software does it..)

Performance

2. Supporting keyword search on databases How do we answer a query like “Soumen Sunita”? Issues: --the schema is normalized (not everything in one table) --How to rank multiple tuples which contain the keywords?

Motivation • Keyword search of documents on the Web has been enormously successful • Simple and intuitive, no need to learn any query language • Database querying using keywords is desirable • SQL is not appropriate for casual users • Form interfaces cumbersome: • Require separate form for each type of query — confusing for casual users of Web information systems • Not suitable for ad hoc queries

Motivation • Many Web documents are dynamically generated from databases • E.g. Catalog data • Keyword querying of generated Web documents • May miss answers that need to combine information on different pages • Suffers from duplication overheads

Examples of Keyword Queries • On a railway reservation database • “mumbai bangalore” • On a university database • “database course” • On an e-store database • “camcorder panasonic” • On a book store database • “sudarshan databases”

Differences from IR/Web Search • Related data split across multiple tuples due to normalization • E.g. Paper (paper-id, title, journal), Author (author-id, name) Writes (author-id, paper-id, position) • Different keywords may match tuples from different relations • What joins are to be computed can only be decided on the fly • Cites(citing-paper-id, cited-paper-id)

Connectivity • Tuples may be connected by • Foreign key and object references • Inclusion dependencies and join conditions • Implicit links (shared words), etc. • Would like to find sets of (closely) connected tuples that match all given keywords

Basic Model • Database: modeled as a graph • Nodes = tuples • Edges = references between tuples • foreign key, inclusion dependencies, .. • Edges are directed. BANKS: Keyword search… MultiQuery Optimization paper writes Charuta S. Sudarshan Prasan Roy author

Answer Example Query: sudarshan roy paper MultiQuery Optimization writes writes author author S. Sudarshan Prasan Roy

Combining Keyword Search and Browsing • Catalog searching applications • Keywords may restrict answers to a small set, then user needs to browse answers • If there are multiple answers, hierarchical browsing required on the answers

What Banks Does The whole DB seen as a directed graph (edges correspond to foreign keys) Answers are subgraphs Ranked by edge weights

Solutions as rooted weighted trees • In BANKS, each potential solution is a rooted weighted tree where • Nodes are tuples from tables • Node weight can be defined in terms of “pagerank” style notions (e.g. back-degree) • They use log(1+x) where x is the number of back links • Edges are foreign-primary key references between tuples across tables • Links are given domain specific weights • Paperwrites is seen as stronger than Papercites table • Tuples in the tree must cover keywords • Relevance of a tree is based on its weight • Weight of the tree is a combination of its node and link weights

BANKS: Keyword Search in DB

11/23: Imprecise Queries Collection Selection Happy Thanks Giving!

Part III: Answer Imprecise Queries with [ICDE 2006;WebDB, 2004; WWW, 2004]

A Feasible Query Make =“Toyota”, Model=“Camry”, Price ≤ $7000 • Toyota • Camry • $7000 • 1999 Want a ‘sedan’ priced around $7000 • Toyota • Camry • $7000 • 2001 • Camry • Toyota • $6700 • 2000 • Toyota • Camry • $6500 • 1998 • ……… What about the price of a Honda Accord? Is there a Camry for $7100? Solution: Support Imprecise Queries Why Imprecise Queries ?

Dichotomy in Query Processing • Databases User knows what she wants User query completely expresses the need Answers exactly matching query constraints • IR Systems • User has an idea of what she wants • User query captures the need to some degree • Answers ranked by degree of relevance Imprecise queries on databases cross the divide

Existing Approaches • Similarity search over Vector space • Data must be stored as vectors of text WHIRL, W. Cohen, 1998 • Enhanced database model • Add ‘similar-to’ operator to SQL. Distances provided by an expert/system designer VAGUE, A. Motro, 1998 • Support similarity search and query refinement over abstract data types Binderberger et al, 2003 • User guidance • Users provide information about objects required and their possible neighborhood Proximity Search, Goldman et al, 1998 • Limitations: • User/expert must provide similarity measures • New operators to use distance measures • Not applicable over autonomous databases • Our Objectives: • Minimal user input • Database internals not affected • Domain-independent & applicable to Web databases

Imprecise queries vs. Empty queries • The “empty query” problem arises when the user’s query, when submitted to the database leads to empty set of answers. • We want to develop methods that can automatically minimally relax this empty query and resubmit it so the user gets some results • Existing approaches for empty query problem are mostly syntactic—and rely on relaxing various query constraints • Little attention is paid to the best order in which to relax the constraints. • Imprecise query problem is a general case of empty query problem • We may have non-empty set of answers to the base query • We are interested not just in giving some tuples but give them in the order of relevance

General ideas for supporting imprecise queries • Main issues are • How to rewrite the base query such that more relevant tuples can be retrieved. • How to rank the retrieved tuples in the order of relevance. A spectrum of approaches are possible—including • Data-dependent approaches • User-dependent approaches • Collaborative approaches We will look at an approach—which is basically data-dependent

AFDs based Query Relaxation

An Example • Relation:-CarDB(Make, Model, Price, Year) • Imprecise query • Q :− CarDB(Model like “Camry”, Price like “10k”) • Base query • Qpr :− CarDB(Model = “Camry”, Price = “10k”) • Base set Abs • Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” • Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001”

Obtaining Extended Set • Problem: Given base set, find tuples from database similar to tuples in base set. • Solution: • Consider each tuple in base set as a selection query. e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” • Relax each such query to obtain “similar” precise queries. e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000” • Execute and determine tuples having similarity above some threshold. • Challenge: Which attribute should be relaxed first ? • Make ? Model ? Price ? Year ? • Solution:Relax least important attribute first.

Least Important Attribute • Definition: An attribute whose binding value when changed has minimal effect on values binding other attributes. • Does not decide values of other attributes • Value may depend on other attributes • E.g. Changing/relaxing Price will usually not affect other attributes • but changing Model usually affects Price • Dependence between attributes useful to decide relative importance • Approximate Functional Dependencies & Approximate Keys • Approximate in the sense that they are obeyed by a large percentage (but not all) of tuples in the database • Can use TANE, an algorithm by Huhtala et al [1999]

Attribute Ordering • Given a relation R • Determine the AFDs and Approximate Keys • Pick key with highest support, say Kbest • Partition attributes of R into • key attributes i.e. belonging to Kbest • non-key attributes I.e. not belonging toKbest • Sort the subsets using influence weights • where Ai ∈ A’ ⊆ R, j ≠ i & j =1 to |Attributes(R)| • Attribute relaxation order is all non-keysfirst then keys • Multi-attribute relaxation - independence assumption CarDB(Make, Model, Year, Price) Key attributes: Make, Year Non-key: Model, Price Order: Price, Model, Year, Make 1- attribute: { Price, Model, Year, Make} 2-attribute: {(Price, Model), (Price, Year), (Price, Make)….. }

Tuple Similarity • Tuples obtained after relaxation are ranked according to their • similarity to the corresponding tuples in base set where Wi = normalized influence weights, ∑ Wi = 1 , i = 1 to |Attributes(R)| • Value Similarity • Euclidean for numerical attributes e.g. Price, Year • Concept Similarity for categorical e.g. Make, Model

JaccardSim(A,B) = Concept (Value) Similarity • Concept: Any distinct attribute value pair. E.g. Make=Toyota • Visualized as a selection query binding a single attribute • Represented as a supertuple • Concept Similarity:Estimated as the percentage of correlated values common to two given concepts • where v1,v2 Є Aj, i ≠ j and Ai, Aj Є R • Measured as the Jaccard Similarity among supertuples representing the concepts Supertuple for Concept Make=Toyota

Concept (Value) Similarity Graph Dodge Nissan 0.15 0.11 BMW Honda 0.12 0.22 0.25 Ford 0.16 Chevrolet Toyota

Empirical Evaluation of • Goal • Evaluate the effectiveness of the query relaxation and concept learning • Setup • A database of used cars CarDB( Make, Model, Year, Price, Mileage, Location, Color) • Populated using 30k tuples from Yahoo Autos • Concept similarity estimated for Make, Model, Location, Color • Two query relaxation algorithms • RandomRelax – randomly picks attribute to relax • GuidedRelax – uses relaxation order determined using approximate keys and AFDs

Issues in Bridging DB & IR