Cross-Lingual Query Suggestion Using Query Logs of Different Languages

Cross-LingualQuery Suggestion Using Query Logs of Different Languages SIGIR 07

Abstract • Query suggestion • To suggest relevant queries for a given query • To help users better specify their information needs • Cross-Lingual Query Suggestion (CLQS): • For a query in one language, we suggest similar or relevant queries in other languages. • cross-lingual keyword bidding (Search Engine) • cross-language information retrieval (CLIR)

Introduction • CLQS vs. Cross-Lingual Query Expansion • Full queries formulated by users in another language. • The users of search engines • similar interests in the same period of time • queries on similar topics in different languages • Key point • How to learn a similarity measure between two queries • MLQS: Term Co-Occurrence based MI and c2

Estimating Cross-Lingual Query similarity • Discriminative Model for Estimating Cross-Lingual Query Similarity • Monolingual Query Similarity Measure Based on Click-through Information • Features Used for Learning Cross-Lingual Query Similarity Measure • Bilingual Dictionary • Parallel Corpora • Online Mining for Related Queries • Monolingual Query Suggestion • Estimating Cross-lingual Query Similarity

Discriminative Model for Estimating Cross-Lingual Query Similarity – 1/2 • qf : a source language query • qe : a target language query • simML : Monolingual query similarity • simCL : Cross-lingual query similarity • Tqf : translation of qf in the target language

Discriminative Model for Estimating Cross-Lingual Query Similarity – 2/2 • Learning: LIBSVM regression algorithm • f : feature functions • f : mapping feature space onto kernel space • w : weight vector in the kernel space • relevant vs. irrelevant • strongly relevant, weakly relevant or irrelevant

Monolingual Query Similarity Measure Based on Click-through Information • click-through information in query logs [26] • KN(x) : number of keyword in a query x • RD(x): number of clicked URLs for a query x • a = 0.4 , b =0.6

1. Bilingual Dictionary – 1/2 • 120,000 unique entries (built-in-house) • Given an input query qf={wf1,wf2,…,wfn} (in source language) • By bilingual dictionary D: D(wfi)={ti1,ti2,…,tim} • C(x,y) is the number of queries in the log containing both x and y. • C(x) is the number of queries in the log containing x. • N is the total number of queries in the log

1.Bilingual Dictionary – 2/2 • The set of top-4 query translations is denoted as S(Tqf) • T  S(Tqf) • Retrieve all queries containing T in target language and assign Sdict(T) as their value

2. Parallel Corpora • Given a pair of queries • qf : in the source language • qe : in the target language • Bi-Directional Translation Score : • IBM model 1 & GIZA++ tool • P(yj|xi) is the word to word translation probability • Top 10 queries {qe} with qf from the query log

3. Online Mining for Related Queries – 1/3 • OOV is a major knowledge bottleneck for query translation and CLIR • Assumption : • A query in the target co-occurs with the source query in many web pages • They are probably semantically related • but, amount of noise

3. Online Mining for Related Queries – 2/3 • Frequency in the Snippets • For example: • Given a query q=abc in source language • By dictionary : a={a1,a2,a3}, b={b1,b2} and c={c1} • Web query : q ^ (a1 v a2 v a3) ^ (b1v b2) ^ (c1) in target language • 700 snippets , most frequent 10 target queries

3. Online Mining for Related Queries – 3/3 • Any query qe mined from the web will be associated with a feature CODC Measure with SCODC(qf,qe)

4. Monolingual Query Suggestion • Q0 : candidate queries (in target language) • For each target query qe, • SQML(qe) : monolingual source query

Estimating Cross-lingual Query Similarity • Four categories of features are used to learn the cross-lingual query similarity. • cross-lingual query similarity score • Learning: LIBSVM regression algorithm • f : feature functions • f : mapping feature space onto kernel space • w : weight vector in the kernel space

Performance Evaluation – Log Data • Data Resources : • MSN Search Engine • French (source language) vs. English ( target language) • A one-month English query log • 7 million unique English queries • Occurrence frequency more than 5 • 5,000 French queries • 4,171 queries have their translations in the English queries • 70% training weight of LIBSVM • 10% development data • 20% testing

Source Language Target Language CLIR qf CLQS {qe} BM25 Performance Evaluation - CLIR • Data Resources : • TREC6 CLIR data (AP88-90 newswire, 750MB) • 25 short French-English queries Pairs (CL1-CL25) • average long 3.3 • match in the web query logs for training CLQS

CLQS

CLIR

Conclusion • Cross-lingual query suggestion • Query Logs • French to English • TREC6 French to English CLIR task • CLQO demonstrates the high quality

Cross-Lingual Query Suggestion Using Query Logs of Different Languages

Cross-Lingual Query Suggestion Using Query Logs of Different Languages

Presentation Transcript

Mining Query Logs

Query Suggestion Using Hitting Time

Relational Query Languages

Logical Query Languages

Query Suggestion

Publishing Search Query logs

Query Languages

XML Query Languages

XML Query Languages

Query Languages

Query Languages

Query Languages

XML Query Languages

Interactive SQL Query Suggestion

RDF Query Languages

Mining Query Logs

Relevance feedback using query-logs

Relational Query Languages

XML Query Languages