Automatic Query Expansion in Information Retrieval

Automatic Query Expansion in Information Retrieval By Ryan Herbeck

What is Automatic Query Expansion (AQE)? “A process which consists of selecting and adding terms to the user's query with the goal of minimizing query-document mismatch and thereby improving retrieval performance.” Takes a user’s original query and selects and adds related words to it Used to increase effectiveness of relevant document retrieval in information retrieval systems

Current Information Retrieval (IR) Systems • Standard interface (one textbox, accepts keywords) • Keywords matched against keyword collection • Results are sorted and returned • Using multiple topic-specific keywords returns quality results • Issues: • User queries are usually short • Natural language is ambiguous • Prone to errors and omissions as a result

Vocabulary Problem • System indexers and users often use different words • “Saltines” and “crackers” • Synonymy: same word, different meanings • “Java,” “Ruby” • Polysemy: different words, similar meaning • “TV” and “television,” “CD” and “compact disk” • Synonymy + word inflections => decrease in recall • Recall: ability to retrieve all relevant documents • Polysemy => decrease in precision • Precision: ability to retrieve only relevant documents

Proposed Solutions Interactive query refinement Relevance feedback Word sense disambiguation Search results clustering AQE

Early AQE • Suggested as early as 1960 • Investigated a variety of techniques • Vector feedback • Term-term clustering • Comparative analysis of term distributions • Experimented on small scale collections • Yielded inconclusive results about effectiveness • Gain in recall was often compensated by loss in precision

Queries Today • Volume of data has increased significantly • Number of terms in a user’s query has remained low • 2009: average query length was 2.30 words; same as in 1999 • Most common queries are 1-3 words in length • Vocabulary problem is worse • Scarcity of query terms reduces synonymy handling • Diversity and size of data increases effects of polysemy • The need for and scope of AQE have increased

Applications of AQE • Question Answering • Goal: Provide direct responses as opposed to whole documents • Expand question with related terms expected to be found in documents with answers • Multimedia Information Retrieval • IR systems search over metadata (annotations, captions, etc.) • When no metadata exists, IR systems use content analysis which can be combined with AQE techniques • Automatic speech recognition, visual features

Applications of AQE • Information Filtering • Monitor a stream of documents and select relevant ones • Documents arrive continuously (e-news, blogs, e-mail, etc.) • Cross-Language Information Retrieval • Retrieve documents in a language differing from the query • Issues: • Insufficient language coverage • Untranslatable terms • Translation ambiguity

Related Techniques Interactive Query Refinement Relevance Feedback Word Sense Disambiguation Search Results Clustering

Interactive Query Refinement (IQE) • Example: Google Suggest • System suggests several formulations of the query • Decision of query formulation made by user • Does not handle feature selection and query reformulation issues • Potential for producing better results than AQE, but requires user expertise

Relevance Feedback Returns initial query results Receives user feedback about the relevancy of results Performs a new query based on user feedback Makes the new query more similar to the relevant documents retrieved, whereasAQE forms a query more similar to the user’s intentions Data sources of relevance feedback may have more reliability than that of AQE

Word Sense Disambiguation (WSD) • Identifies word meanings in context • Approaches • Represent words by their text definitions • Use of WordNet • English lexical database which groups words into synonym subsets (synsets), gives general definitions and records semantic relations between synsets • Find all of a word’s contexts and cluster similar ones • Computational and effectiveness limitations • Typical queries may be too short for WSD • Example: “CD”

Search Results Clustering (SRC) Organizes and groups search results by topic Attempts to optimize clustering structure and label quality Labels could be seen as query refinements, but intended to help the user browse through results Example: http://clusty.com

How AQE Works Data Preprocessing Feature Generation and Ranking Feature Selection Query Reformulation

Data Preprocessing • Reformat data source for more effective subsequent processing • Index the collection of documents and run the query against the collection index • Extract text from documents • Extract words without punctuation and ignoring case • Remove articles and prepositions • Reduce word inflections and derivations • Assign a weighted importance value to each word

Data Preprocessing • Example: • HTML: • ‘<b>Automatic query expansion</b> expands queries automatically.’ • Indexed representation (weight determined by frequency): • automat 0.33, queri 0.33, expan 0.16, expand 0.16 • Each document is represented as a collection of weighted terms

Feature Generation and Ranking • Input: original query, transformed data source • Output: set of candidate expansion features (terms that could be added to the original query) • Original query may be preprocessed to have common words removed and/or important words extracted • Techniques: • One-to-One Associations • One-to-Many Associations • Analysis of Top-Ranked Documents • Query Language Modeling

Feature Generation and Ranking • One-to-One Associations • Between expansion features and query terms • One feature is related to one query term • One or more features are generated and ranked for each term • Approaches • Stemming algorithm: reduces words to root form • WordNet: synonym sets (synsets), records semantic relations • Prevents ambiguity (select one synset for one query term) • Compute term-to-term similarities in a document collection • Mine user query logs

Feature Generation and Ranking • One-to-Many Associations • One feature is related to one or more query terms • Approaches • Extend one-to-one association techniques to other query terms • Generate a term if it is related to more than one term • Filters weakly related features • Combine multiple relationships between term pairs • Construct term network for the query • Network contains word pairs linked by relations (synonyms, stems, etc.)

Feature Generation and Ranking • Analysis of Top-Ranked Documents • Retrieve top results for original query • Generate expansion features from related terms in these documents • Features are related to query as a whole, as opposed to individual query terms • Approach: Pseudo-Relevance Feedback • Score each term in top documents by a applying a weighting function to the whole collection of documents • Sum up all weights of each term and sort the terms based on sums • Issue: weights reflect importance over collection more than importance over query

Feature Generation and Ranking • Query Language Modeling • Generate probability distribution over query terms • Best features have the highest probabilities • Approaches: • Mixture Model • Builds a model from top-ranked documents collection • Extracts the most distinct part from the document collection • Use an expectation-maximization algorithm to get probabilities • Relevance Model • Builds a model from top-ranked documents individually • Documents further down the list have less and less influence on word probabilities

Feature Selection Select top features for query expansion Features are not evaluated further, simply selected based on rank Limited number of features selected for rapid processing Using all features is not necessarily better than using only a few Typically select 10-30 features Could select features only within a certain rank range

Query Reformulation • Modify the original query by adding selected features to itand perform search • Approaches: • Query reweighting: assign a weight to each feature using a weighting formula • Simply add selected features to the original query without weighting

Classification of AQE Techniques Linguistic Analysis Corpus-Specific Global Techniques Query-Specific Local Techniques Search Log Analysis Web Data

Linguistic Analysis • Focus on morphological, lexical, syntactic and semantic relationships for expansion • Analysis based on dictionaries, thesauri, or sources such as WordNet • Sensitive to word sense ambiguity • Examples: • Stemming algorithm: reduce terms to root form • Ontology browsing: paraphrase user’s query in context • Syntactic analysis: extract relations between terms to find features that appear in related relations

Corpus-Specific Global Techniques • Corpus: large structured set of texts • Analyze contents of a full database to find features used similarly • Find correlations between term pairs at document level or within paragraphs or sentences • Data-driven • May not have a simple interpretation

Query-Specific Local Techniques • Utilize local context provided by the query • Make use of top-ranked documents • Examples: • Analysis of feature distribution difference • Model-based AQE • Top-document preprocessing • Removes irrelevant features before using term-ranking function

Search Log Analysis • Mines users’ search logs for implicit query associations • Search logs contain queries and URLs of clicked pages • Example: user searches “apple,” find a past query “iPhone” • May encode implicit relevance feedback instead of retrieval feedback • Examples: • Extract features from past related queries that are related to the current query • Use top documents from past related queries • Extract terms directly from visited documents

Web Data • Use of anchor texts to generate features • Anchor text: visible, clickable text of a hyperlink • Most anchor texts are similar to real user queries • Anchor texts typically describe contents of the document • Issues: • “click here” • One-word/short anchor texts • Use of Wikipedia documents and hyperlinks

Critical Issues Parameter Setting Efficiency Usability

Parameter Setting • Rely on several parameters • Number of pseudo-relevant documents • Number of expansion terms • Variables within term-ranking and weighting functions • Could use fixed values for key parameters • Fixed values may not work well for all queries

Efficiency • Need to deliver real-time results to a large volume of users • Balancing performance time with quality results • Good AQE is computationally expensive • Execution time of expanded query

Usability • Implementation hidden to users • Users may receive high-ranked documents that contain none of the query terms • Query terms substituted entirely by synonyms • Irrelevant document contains query terms in anchor text • Could increase user control • Show user the features used • Allow user to revise expanded query • AQE is better for non-expert users

Conclusions No perfect solution for the vocabulary problem Overcomes user reluctance and difficulty to provide better refined queries to meet their needs Variety of implementations Efficiency is gradually increasing Near the end of its experimental stage Not yet ready to be implemented on large-scale IR systems such as web search engines

Questions?

References Carpineto, C. and Romano, G. 2012. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44, 1, Article 1 (January 2012), 50 pages. Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T. 1987. The vocabulary problem in human-system communication. Comm. ACM 30, 11, 964-971. Mitra, M., Singhal, A., and Buckley, C. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 206-214. Vechtomova, O. 2009. Query expansion for information retrieval. In Encyclopedia of Database Systems, L. Liu and M. T. Özsu Eds., Springer, 2254-2257.

Automatic Query Expansion in Information Retrieval

Automatic Query Expansion in Information Retrieval

Presentation Transcript

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval

Query expansion techniques

Automatic Term Mismatch Diagnosis for Selective Query Expansion

Advantages of Query Biased Summaries in Information Retrieval

AIRUS (Automatic Information Retrieval Using Speech)

A Study on Query Expansion Methods for Patent Retrieval

Exploring Sentence Level Query Expansion in Language Modeling Based Information Retrieval

Query Expansion

Information Retrieval - Query expansion

Mining Dependency Relations for Query Expansion in Passage Retrieval

QUERY AND DOCUMENT EXPANSION IN TEXT RETRIEVAL

Modern Information Retrieval Chapter 5 Query Operations

Query Expansion in Information Retrieval using a Bayesian Network-Based Thesaurus

Query Expansion

Information Retrieval - Query expansion

Query Expansion

Query Caching in Agent-based Distributed Information Retrieval

Query Expansion

Dirichlet Mixtures for Query Estimation in Information Retrieval

Information Retrieval - Query expansion

A Study on Query Expansion Methods for Patent Retrieval