Keyword++: A Framework to Improve Keyword Search Over Entity Databases

Keyword++: A Framework to Improve Keyword Search Over Entity Databases

Motivation • Current keyword search over databases have limitations for entity databases related to keyword matching • Not returning all relevant results • Returning irrelevant results

Motivation

Related Work • Most previous searching for entity database required users to input formatted queries • Examples • (Amazon customer service #phone) • (#professor #university #research=’database’) • Where each word with a # referred to an entity and other words are meant as keywords

Problem • Given • Search interface S over an entity relation E • Ϙis a set of historical keyword queries • Find • For all k in Ϙfind its mapping Mσ(k) and its confidence score Ms(k) for the mapping • Using mapping M, find the best CNF (Conjunctive normal form) SQL query Tσ(Q) for a keyword query Q

Mapping Keywords to Predicates • DQP (Differential Query Pair) • Qf and Qb where Qf = Qb U {k} • Qfis the foreground query (set of keywords) • Qbis the background query (set of keywords) • k is the differential keyword

Mapping Keywords to Predicates • DQP (Differential Query Pair) • Qb= [small laptop] • Returns 20 laptops, only 3 have brand “Lenovo” • Qf= [small IBM laptop] • Returns 10 laptops, 5 have brand “Lenovo”

Mapping Keywords to Predicates • Generating DQPs for Keywords • Given query Q and a keyword k in Q • Make new DQPs by Qf = Q and Qb = Q - {k} • With historical keyword queries, Ϙcan be used • Get all Qfand Qbin Ϙwhere Qf = Qb U {k}

Mapping Keywords to Predicates • Scoring Predicates using DQPs • D(A) is the range of values for a given attribute • For every value v in D(A), let p(v, A, Se) be the probability that the attribute A has the value v for a set of objects Se • P(A, Se) is the distribution of p(v, A, Se) for all v in D(A) • SfandSbare the sets of results for Qf and Qb

Mapping Keywords to Predicates • Correlation Metrics • KL-divergence (used for categorical predicates) • Measures the difference between two probabilities • Given SfandSb, the KL-divergence is:

Mapping Keywords to Predicates • Correlation Metrics • A is BrandName and v is “Lenovo” • Qb= [small laptop] • Probability of .15 (3 out of 20) • Qf= [small IBM laptop] • Probability of .5 (5 out of 10)

Mapping Keywords to Predicates • Correlation Metrics • Earth Mover’s Distance (used for numerical predicates) • Measures the difference between two probability distributions • Given SfandSb, and the sorted values for D(A), the EMD is:

Mapping Keywords to Predicates • Correlation Metrics • A is ScreenSize, Qb= [IBM laptop], Qf= [small IBM laptop]

Mapping Keywords to Predicates • Score Aggregation • Given a keyword k and a set of DQPs each with respect to k, the aggregate score for keyword k with respect to a predicate σ is:

Mapping Keywords to Predicates • Scoring Threshold • Categorical and Numerical Predicates • Keyword queries with low numbers of DQPs must have a higher threshold to create a mapping

Mapping Keywords to Predicates • Scoring Thresholds • Create mapping Mσ(k) with Ms(k)= AggScore

Query Translation • Q = [t1, t2, …, tq] • Qi = [t1, …, ti] is the prefix of Q with itokens • Example • Q = [small IBM laptop] and n = 2 • Q1 = [small] and Ts(Q1) = Ms(“small”) • Q2 = [small IBM] • Ts(Q2) = Ts(Q1) + Ms(“IBM”) • Ts(Q2) = Ms(“small IBM”) • Pick the one with the higher score for rewriting Q2

Query Translation • SELECT * FROM Table WHERE cnf(σA=v) AND cnf(σContains(A,t)) ORDER BY {σ(A,SO)} • cnf(σA=v) is a conjunctive form of categorical predicates • cnf(σContains(A,t)) is a conjunctive form of textual predicates • {σ(A,SO)} is a list an ordered list of numerical predicates

Query Translation • Example • Q = [small IBM laptop] • SELECT * FROM Table WHERE BrandName = ‘Lenovo’ AND ProductDescription LIKE ‘%laptop%’ ORDER BY ScreenSize ASC

Experiments • Dataset • Entity table with 8,000 laptops • 28 categoricalattributes • 7 numerical attributes • 2 textual attributes (ProductName and ProductDescription)

Experiments • Comparison Methods • Ground truth from 100,000 web search queries classified as web queries • Compared with keyword-and approach and query-portal approach • keyword-and: returns entities contain all query tokens • query-portal approach: web search engine • Evaluated for precision, recall, and Jaccard

Experiments • Results

Fuzzy Matching of Web Queries to Structured Data

Motivation • Example • A user issues a keyword query “Indy 4 near San Fran,” instead of “Indiana Jones and the Kingdom of the Crystal Skull near the city of San Francisco”

Problem • Synonyms, Hypernyms, and Hyponyms • Let ε be the set of entities over which the synonyms are to be defined • Let S be the universal set of strings where each string is sequence of one or more words • We assume their exists an oracle function F(s, ε) -> E where s ∈ S and E⊆ ε

Problem • Synonyms, Hypernyms, and Hyponyms • Synonym: s1 ∈ S is a synonym of another string s2∈ Sif and only if F(s1, ε) = F(s2, ε) • Example: s1 = “Indiana Jones IV” and s2 = “Indian Jones 4” • Hypernym: s1 ∈ S is a hypernymof another string s2∈ S if and only F(s1, ε) ⊃F(s2, ε) • Example: s1 = “Indiana Jones series” • Hyponym: s1 ∈ S is a hyponymof another string s2∈ S if and only F(s1, ε) ⊂F(s2, ε)

Problem • Web Synonym Finding • Given a set of string U, the data sets A and L and the reference set of entities ε • Return for each string u ∈ U, its unique set of Web synonyms Wu = { w∈S | GA(u, P) ≈ GL(w, P) }

Candidate Generation • Finding Surrogates • Issue a search to the Bing Search API and maintain the top-k results • A web page p is a surrogate for u, the keyword query, if p is in the results • Referencing Surrogates • A query w is a synonym candidate for u if at least one surrogate of u has been clicked when w was issued as the keyword query

Candidate Selection • Intersection Page Count • Intersecting Click Ratio

Candidate Selection

Keyword++: A Framework to Improve Keyword Search Over Entity Databases