1 / 37

Automatic Query Expansion in Information Retrieval

Automatic Query Expansion in Information Retrieval. By Ryan Herbeck. What is Automatic Query Expansion (AQE)?. “A process which consists of selecting and adding terms to the user's query with the goal of minimizing query-document mismatch and thereby improving retrieval performance .”

lela
Download Presentation

Automatic Query Expansion in Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Query Expansion in Information Retrieval By Ryan Herbeck

  2. What is Automatic Query Expansion (AQE)? “A process which consists of selecting and adding terms to the user's query with the goal of minimizing query-document mismatch and thereby improving retrieval performance.” Takes a user’s original query and selects and adds related words to it Used to increase effectiveness of relevant document retrieval in information retrieval systems

  3. Current Information Retrieval (IR) Systems • Standard interface (one textbox, accepts keywords) • Keywords matched against keyword collection • Results are sorted and returned • Using multiple topic-specific keywords returns quality results • Issues: • User queries are usually short • Natural language is ambiguous • Prone to errors and omissions as a result

  4. Vocabulary Problem • System indexers and users often use different words • “Saltines” and “crackers” • Synonymy: same word, different meanings • “Java,” “Ruby” • Polysemy: different words, similar meaning • “TV” and “television,” “CD” and “compact disk” • Synonymy + word inflections => decrease in recall • Recall: ability to retrieve all relevant documents • Polysemy => decrease in precision • Precision: ability to retrieve only relevant documents

  5. Proposed Solutions Interactive query refinement Relevance feedback Word sense disambiguation Search results clustering AQE

  6. Early AQE • Suggested as early as 1960 • Investigated a variety of techniques • Vector feedback • Term-term clustering • Comparative analysis of term distributions • Experimented on small scale collections • Yielded inconclusive results about effectiveness • Gain in recall was often compensated by loss in precision

  7. Queries Today • Volume of data has increased significantly • Number of terms in a user’s query has remained low • 2009: average query length was 2.30 words; same as in 1999 • Most common queries are 1-3 words in length • Vocabulary problem is worse • Scarcity of query terms reduces synonymy handling • Diversity and size of data increases effects of polysemy • The need for and scope of AQE have increased

  8. Applications of AQE • Question Answering • Goal: Provide direct responses as opposed to whole documents • Expand question with related terms expected to be found in documents with answers • Multimedia Information Retrieval • IR systems search over metadata (annotations, captions, etc.) • When no metadata exists, IR systems use content analysis which can be combined with AQE techniques • Automatic speech recognition, visual features

  9. Applications of AQE • Information Filtering • Monitor a stream of documents and select relevant ones • Documents arrive continuously (e-news, blogs, e-mail, etc.) • Cross-Language Information Retrieval • Retrieve documents in a language differing from the query • Issues: • Insufficient language coverage • Untranslatable terms • Translation ambiguity

  10. Related Techniques Interactive Query Refinement Relevance Feedback Word Sense Disambiguation Search Results Clustering

  11. Interactive Query Refinement (IQE) • Example: Google Suggest • System suggests several formulations of the query • Decision of query formulation made by user • Does not handle feature selection and query reformulation issues • Potential for producing better results than AQE, but requires user expertise

  12. Relevance Feedback Returns initial query results Receives user feedback about the relevancy of results Performs a new query based on user feedback Makes the new query more similar to the relevant documents retrieved, whereasAQE forms a query more similar to the user’s intentions Data sources of relevance feedback may have more reliability than that of AQE

  13. Word Sense Disambiguation (WSD) • Identifies word meanings in context • Approaches • Represent words by their text definitions • Use of WordNet • English lexical database which groups words into synonym subsets (synsets), gives general definitions and records semantic relations between synsets • Find all of a word’s contexts and cluster similar ones • Computational and effectiveness limitations • Typical queries may be too short for WSD • Example: “CD”

  14. Search Results Clustering (SRC) Organizes and groups search results by topic Attempts to optimize clustering structure and label quality Labels could be seen as query refinements, but intended to help the user browse through results Example: http://clusty.com

  15. How AQE Works Data Preprocessing Feature Generation and Ranking Feature Selection Query Reformulation

  16. Data Preprocessing • Reformat data source for more effective subsequent processing • Index the collection of documents and run the query against the collection index • Extract text from documents • Extract words without punctuation and ignoring case • Remove articles and prepositions • Reduce word inflections and derivations • Assign a weighted importance value to each word

  17. Data Preprocessing • Example: • HTML: • ‘<b>Automatic query expansion</b> expands queries automatically.’ • Indexed representation (weight determined by frequency): • automat 0.33, queri 0.33, expan 0.16, expand 0.16 • Each document is represented as a collection of weighted terms

  18. Feature Generation and Ranking • Input: original query, transformed data source • Output: set of candidate expansion features (terms that could be added to the original query) • Original query may be preprocessed to have common words removed and/or important words extracted • Techniques: • One-to-One Associations • One-to-Many Associations • Analysis of Top-Ranked Documents • Query Language Modeling

  19. Feature Generation and Ranking • One-to-One Associations • Between expansion features and query terms • One feature is related to one query term • One or more features are generated and ranked for each term • Approaches • Stemming algorithm: reduces words to root form • WordNet: synonym sets (synsets), records semantic relations • Prevents ambiguity (select one synset for one query term) • Compute term-to-term similarities in a document collection • Mine user query logs

  20. Feature Generation and Ranking • One-to-Many Associations • One feature is related to one or more query terms • Approaches • Extend one-to-one association techniques to other query terms • Generate a term if it is related to more than one term • Filters weakly related features • Combine multiple relationships between term pairs • Construct term network for the query • Network contains word pairs linked by relations (synonyms, stems, etc.)

  21. Feature Generation and Ranking • Analysis of Top-Ranked Documents • Retrieve top results for original query • Generate expansion features from related terms in these documents • Features are related to query as a whole, as opposed to individual query terms • Approach: Pseudo-Relevance Feedback • Score each term in top documents by a applying a weighting function to the whole collection of documents • Sum up all weights of each term and sort the terms based on sums • Issue: weights reflect importance over collection more than importance over query

  22. Feature Generation and Ranking • Query Language Modeling • Generate probability distribution over query terms • Best features have the highest probabilities • Approaches: • Mixture Model • Builds a model from top-ranked documents collection • Extracts the most distinct part from the document collection • Use an expectation-maximization algorithm to get probabilities • Relevance Model • Builds a model from top-ranked documents individually • Documents further down the list have less and less influence on word probabilities

  23. Feature Selection Select top features for query expansion Features are not evaluated further, simply selected based on rank Limited number of features selected for rapid processing Using all features is not necessarily better than using only a few Typically select 10-30 features Could select features only within a certain rank range

  24. Query Reformulation • Modify the original query by adding selected features to itand perform search • Approaches: • Query reweighting: assign a weight to each feature using a weighting formula • Simply add selected features to the original query without weighting

  25. Classification of AQE Techniques Linguistic Analysis Corpus-Specific Global Techniques Query-Specific Local Techniques Search Log Analysis Web Data

  26. Linguistic Analysis • Focus on morphological, lexical, syntactic and semantic relationships for expansion • Analysis based on dictionaries, thesauri, or sources such as WordNet • Sensitive to word sense ambiguity • Examples: • Stemming algorithm: reduce terms to root form • Ontology browsing: paraphrase user’s query in context • Syntactic analysis: extract relations between terms to find features that appear in related relations

  27. Corpus-Specific Global Techniques • Corpus: large structured set of texts • Analyze contents of a full database to find features used similarly • Find correlations between term pairs at document level or within paragraphs or sentences • Data-driven • May not have a simple interpretation

  28. Query-Specific Local Techniques • Utilize local context provided by the query • Make use of top-ranked documents • Examples: • Analysis of feature distribution difference • Model-based AQE • Top-document preprocessing • Removes irrelevant features before using term-ranking function

  29. Search Log Analysis • Mines users’ search logs for implicit query associations • Search logs contain queries and URLs of clicked pages • Example: user searches “apple,” find a past query “iPhone” • May encode implicit relevance feedback instead of retrieval feedback • Examples: • Extract features from past related queries that are related to the current query • Use top documents from past related queries • Extract terms directly from visited documents

  30. Web Data • Use of anchor texts to generate features • Anchor text: visible, clickable text of a hyperlink • Most anchor texts are similar to real user queries • Anchor texts typically describe contents of the document • Issues: • “click here” • One-word/short anchor texts • Use of Wikipedia documents and hyperlinks

  31. Critical Issues Parameter Setting Efficiency Usability

  32. Parameter Setting • Rely on several parameters • Number of pseudo-relevant documents • Number of expansion terms • Variables within term-ranking and weighting functions • Could use fixed values for key parameters • Fixed values may not work well for all queries

  33. Efficiency • Need to deliver real-time results to a large volume of users • Balancing performance time with quality results • Good AQE is computationally expensive • Execution time of expanded query

  34. Usability • Implementation hidden to users • Users may receive high-ranked documents that contain none of the query terms • Query terms substituted entirely by synonyms • Irrelevant document contains query terms in anchor text • Could increase user control • Show user the features used • Allow user to revise expanded query • AQE is better for non-expert users

  35. Conclusions No perfect solution for the vocabulary problem Overcomes user reluctance and difficulty to provide better refined queries to meet their needs Variety of implementations Efficiency is gradually increasing Near the end of its experimental stage Not yet ready to be implemented on large-scale IR systems such as web search engines

  36. Questions?

  37. References Carpineto, C. and Romano, G. 2012. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44, 1, Article 1 (January 2012), 50 pages. Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T. 1987. The vocabulary problem in human-system communication. Comm. ACM 30, 11, 964-971. Mitra, M., Singhal, A., and Buckley, C. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 206-214. Vechtomova, O. 2009. Query expansion for information retrieval. In Encyclopedia of Database Systems, L. Liu and M. T. Özsu Eds., Springer, 2254-2257.

More Related