210 likes | 372 Views
Enrich Query Representation by Query Understanding. Gu Xu Microsoft Research Asia. Mismatching Problem. Mismatching is Fundamental Problem in Search Examples: NY ↔ New York, game cheats ↔ game cheatcodes Search Engine Challenges Head or frequent queries
E N D
Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia
Mismatching Problem • Mismatching is Fundamental Problem in Search • Examples: • NY ↔ New York, game cheats ↔ game cheatcodes • Search Engine Challenges • Head or frequent queries • Rich information available: clicks, query sessions, anchor texts, and etc. • Tail or infrequent queries • Information becomes sparse and limited • Our Proposal • Enrich both queries and documents and conduct matching on the enriched representation.
Matching at Different Semantic Levels Structure Match intent with answers (structures of query and document) Microsoft Office home find homepage of Microsoft Office 21 movie find movie named 21 buy laptop less than 1000 find online dealers to buy laptop with less than 1000 dollars Match topics of query and documents Topic Level of Semantics … working for Microsoft … my office is in … Microsoft Office Topic: PC Software Topic: Personal Homepage Match terms with same meanings Sense utube youtube NY New York motherboard mainboard Match exactly same terms Term NY New York disk disc
Enrich Query Representation Query Parsing <person-name> michaeljordan</person-name> <location>berkeley</location> Named entity segmentation and disambiguation Large-scale knowledge base Structure Level Query Classification <query-topics> academic </query-topics> Definition of classes Accuracy & efficiency Topic Level Query Refinement Alternative Query Finding ill-formed well-formed <correction token =“berkele”> berkeley</correction> <similar-queries> michael I. jordanberkeley </ similar-queries > Ambiguity: msil or mail Equivalence (or dependency): department or dept, login or sign on Sense Level Tokenization <token>michael</token> <token>jordan</token> <token>berkele</token> C# C 1,000 1 000 MAX_PATH MAX PATH Term Level michaeljordanberkele Understanding Representation
Query Refinement Papers on Machin Learn Spelling Error Correction Inflection “ ” Machine Learning Papers on Phrase Segmentation Operations are mutually dependant: Spelling Error Correction Inflection Phrase Segmentation
Conventional CRF papers on machin learn papers machin on X x0 x1 x2 x3 learn …… …… papers on machin learn Y y00 y10 y20 y30 papers learn machin learns on paper in upon machine learning paper in machine learning machines …… …… y01 y11 y21 y31 …… …… …… …… …… …… …… …… … … … … Intractable
CRF for Query Refinement h O X Y
CRF for Query Refinement lean walk machined super soccer machining data the learning paper mp3 book think macin clearn O machina lyrics learned new pc com lear machi harry machine journal university net course blearn X … … … … … … … … … … … y2 y3 Y o2 o3 x2 x3 machin learn 1. Oconstrains the mapping from X to Y(Reduce Space)
CRF for Query Refinement walk super soccer data the paper mp3 book think O lyrics new pc com harry journal university net course X … … … … … … … … … … … machined machi macin learned lear clearn machine machina machining blearn lean learning Y y2 y2 y2 y2 y3 y3 y3 y3 Insertion Insertion +ed +ed +ing +ing x2 Deletion x3 Deletion machin learn 1. Oconstrains the mapping from X to Y(Reduce Space) 2. Oindexesthe mapping from X to Y(Sharing Parameters)
Named Entity Recognition in Query harry potter author harry potter harry potter film harry potter film harry potter – Movie (0.95) harry potter author harry potter – Book (0.95) harry potter – Movie (0.5) harry potter – Book (0.4) harry potter – Game (0.1)
Challenges • Named Entity Recognition in Document • Challenges • Queries are short (2-3 words on average) • Less context features • Queries are not well-formed (typos, lower cased, …) • Less content features • Knowledge Database • Coverage and Freshness • Ambiguity
Our Approach to NERQ q e c t Harry Potter Walkthrough “Harry Potter” (Named Entity) + “# Walkthrough” (Context) “Game” Class • Goal of NERQ becomes to find the best triple (e, t, c)* for query q satisfying
Training With Topic Model • Ideal Training Data T = {(ei, ti, ci)} • Real Training Data T = {(ei, ti, *)} • Queries are ambiguous (harry potter, harry potter review) • Training data are a relatively few
Training With Topic Model (cont.) e t c harry potter kung fu panda iron man …………………… …………………… ………………………………………… …………………… # wallpapers # movies # walkthrough # book price …………………… …………………… ………………………………………… Movie Game Book …………………… # is a placeholder for name entity. Here # means “harry potter” Topics
Weakly Supervised Topic Model • Introducing Supervisions • Supervisions are always better • Alignment between Implicit Topics and Explicit Classes • Weak Supervisions • Label named entities rather than queries (doc. class labels) • Multiple class labels (binary Indicator) Kung Fu Panda ? ? Movie Game Book Distribution Over Classes
WS-LDA • LDA + Soft Constraints (w.r.t. Supervisions) • Soft Constraints Soft Constraints LDA Probability Document Probability on i-th Class Document Binary Label on i-th Class 1 1 0 1 1 0
Extension: Leveraging Clicks Game Movie Book Context t # wallpapers # movies # walkthrough # book price …………………… URL words Title words Snippet words Content words Other features Clicked Host Name t’ www.imdb.com www.wikipedia.com www.gamespot.com www.sparknotes.com cheats.ign.com ……………………
Summary The goal of query understanding is to enrich query representation and essentially solve the problem of term mismatching.