Enrich Query Representation by Query Understanding

Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Mismatching Problem • Mismatching is Fundamental Problem in Search • Examples: • NY ↔ New York, game cheats ↔ game cheatcodes • Search Engine Challenges • Head or frequent queries • Rich information available: clicks, query sessions, anchor texts, and etc. • Tail or infrequent queries • Information becomes sparse and limited • Our Proposal • Enrich both queries and documents and conduct matching on the enriched representation.

Matching at Different Semantic Levels Structure Match intent with answers (structures of query and document) Microsoft Office home find homepage of Microsoft Office 21 movie find movie named 21 buy laptop less than 1000 find online dealers to buy laptop with less than 1000 dollars Match topics of query and documents Topic Level of Semantics … working for Microsoft … my office is in … Microsoft Office Topic: PC Software Topic: Personal Homepage Match terms with same meanings Sense utube youtube NY New York motherboard mainboard Match exactly same terms Term NY New York disk disc

Enrich Query Representation Query Parsing <person-name> michaeljordan</person-name> <location>berkeley</location> Named entity segmentation and disambiguation Large-scale knowledge base Structure Level Query Classification <query-topics> academic </query-topics> Definition of classes Accuracy & efficiency Topic Level Query Refinement Alternative Query Finding ill-formed well-formed <correction token =“berkele”> berkeley</correction> <similar-queries> michael I. jordanberkeley </ similar-queries > Ambiguity: msil or mail Equivalence (or dependency): department or dept, login or sign on Sense Level Tokenization <token>michael</token> <token>jordan</token> <token>berkele</token> C# C 1,000 1 000 MAX_PATH MAX PATH Term Level michaeljordanberkele Understanding Representation

Query Refinement Using CRF-QR (SIGIR’08)

Query Refinement Papers on Machin Learn Spelling Error Correction Inflection “ ” Machine Learning Papers on Phrase Segmentation Operations are mutually dependant: Spelling Error Correction Inflection Phrase Segmentation

Conventional CRF papers on machin learn papers machin on X x0 x1 x2 x3 learn …… …… papers on machin learn Y y00 y10 y20 y30 papers learn machin learns on paper in upon machine learning paper in machine learning machines …… …… y01 y11 y21 y31 …… …… …… …… …… …… …… …… … … … … Intractable

CRF for Query Refinement h O X Y

CRF for Query Refinement lean walk machined super soccer machining data the learning paper mp3 book think macin clearn O machina lyrics learned new pc com lear machi harry machine journal university net course blearn X … … … … … … … … … … … y2 y3 Y o2 o3 x2 x3 machin learn 1. Oconstrains the mapping from X to Y(Reduce Space)

CRF for Query Refinement walk super soccer data the paper mp3 book think O lyrics new pc com harry journal university net course X … … … … … … … … … … … machined machi macin learned lear clearn machine machina machining blearn lean learning Y y2 y2 y2 y2 y3 y3 y3 y3 Insertion Insertion +ed +ed +ing +ing x2 Deletion x3 Deletion machin learn 1. Oconstrains the mapping from X to Y(Reduce Space) 2. Oindexesthe mapping from X to Y(Sharing Parameters)

Named Entity Recognition in Query (SIGIR’09, SIGKDD’09)

Named Entity Recognition in Query harry potter author harry potter harry potter film harry potter film harry potter – Movie (0.95) harry potter author harry potter – Book (0.95) harry potter – Movie (0.5) harry potter – Book (0.4) harry potter – Game (0.1)

Challenges • Named Entity Recognition in Document • Challenges • Queries are short (2-3 words on average) • Less context features • Queries are not well-formed (typos, lower cased, …) • Less content features • Knowledge Database • Coverage and Freshness • Ambiguity

Our Approach to NERQ q e c t Harry Potter Walkthrough “Harry Potter” (Named Entity) + “# Walkthrough” (Context) “Game” Class • Goal of NERQ becomes to find the best triple (e, t, c)* for query q satisfying

Training With Topic Model • Ideal Training Data T = {(ei, ti, ci)} • Real Training Data T = {(ei, ti, *)} • Queries are ambiguous (harry potter, harry potter review) • Training data are a relatively few

Training With Topic Model (cont.) e t c harry potter kung fu panda iron man …………………… …………………… ………………………………………… …………………… # wallpapers # movies # walkthrough # book price …………………… …………………… ………………………………………… Movie Game Book …………………… # is a placeholder for name entity. Here # means “harry potter” Topics

Weakly Supervised Topic Model • Introducing Supervisions • Supervisions are always better • Alignment between Implicit Topics and Explicit Classes • Weak Supervisions • Label named entities rather than queries (doc. class labels) • Multiple class labels (binary Indicator) Kung Fu Panda ? ? Movie Game Book Distribution Over Classes

WS-LDA • LDA + Soft Constraints (w.r.t. Supervisions) • Soft Constraints Soft Constraints LDA Probability Document Probability on i-th Class Document Binary Label on i-th Class 1 1 0 1 1 0

Extension: Leveraging Clicks Game Movie Book Context t # wallpapers # movies # walkthrough # book price …………………… URL words Title words Snippet words Content words Other features Clicked Host Name t’ www.imdb.com www.wikipedia.com www.gamespot.com www.sparknotes.com cheats.ign.com ……………………

Summary The goal of query understanding is to enrich query representation and essentially solve the problem of term mismatching.

Thanks!

Enrich Query Representation by Query Understanding

Enrich Query Representation by Query Understanding

Presentation Transcript

Understanding Query Ambiguity

Query Understanding for Relevance Measurement

Query by Pitch

Query-by-Example (QBE)

Putting Query Representation and Understanding in Context:

Multi-Aspect Query Summarization by Composite Query

Query Processing: Query Formulation

QBE - Query by Example

Query-By-Example

QUERY BY EXCEL

Query-by-Example (QBE)

QUERY

Query by Document

Query

Query

Query Processing – Query Trees

Query-by-Example (QBE)

Query