1 / 30

Information Retrieval Laboratory Department of Computer Science ir.iit

Automatic Web Query Classification using Labeled and Unlabeled Training Data Steven M. Beitzel, Eric C. Jensen, David D. Lewis, Abdur Chowdhury Aleksander Kolcz, Ophir Frieder. Information Retrieval Laboratory Department of Computer Science http://ir.iit.edu. Overview.

dayton
Download Presentation

Information Retrieval Laboratory Department of Computer Science ir.iit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Web Query Classification using Labeled and Unlabeled Training DataSteven M. Beitzel, Eric C. Jensen, David D. Lewis, Abdur ChowdhuryAleksander Kolcz, Ophir Frieder Information Retrieval Laboratory Department of Computer Science http://ir.iit.edu

  2. Overview • Introduction: Query Classification • Motivations & Prior Work • Our approach • Results & Analysis • Conclusions • Future Work

  3. Introduction • Goal is to develop a system that can identify a query with relevant topical categories • Automatic classifiers help a search service decide when to use specialized databases • Specialized databases may provide tailored, topic-specific results

  4. Problem Statement • A query contains more information than just its terms • Search is not just about finding relevant documents – users have: • Target task • Target topic • General information need • Queries are simply an attempt to express all of the above in a couple of terms (average of 2.2 per query)

  5. Popular Web Queries

  6. Problem Statement (2) • Current search systems focus mainly on the terms in the queries • No focus on extracting topic information • Manual query classification is expensive • Does not take advantage of the large supply of unlabeled data available in query logs

  7. Prior Work • Much early text classification was document-based • Query Classification: • Manual (human assessors) • Automatic • Clustering Techniques – doesn’t help identify topics • Supervised learning via retrieved documents • Still expensive – retrieved documents must be classified

  8. Query frequency vs. % of Weekly Query Stream

  9. Automatic Query Classification Motivations • Web queries have very few features • Achieving and sustaining classification recall is difficult • Web query logs provide a rich source of unlabeled data; we must harness this data to aid classification

  10. Our Approach • Combine three methods of classification: • Labeled Data Approaches: • Manual (exact-match lookup using labeled queries) • Supervised Learning (Perceptron trained with labeled queries) • Unlabeled Data Approach: • Unsupervised Rule Learning with unlabeled data from a large query log • Disjunctive Combination of the above

  11. Approach #1 - Exact-Match to Manual Classifications • A team of editors manually classified approximately 1M popular queries into 18 topical categories • General topics (sports, health, entertainment) • Mostly popular queries • Pros • Expect high precision from exact-match lookup • Cons • Expensive to maintain • Very low classification recall • Not robust to changes in the query stream

  12. Approach #2 - Supervised Learning with a Perceptron • Goal: achieve higher levels of recall than human efforts • Supervised Learning • Used heavily in text classification • Bayes, Perceptron, SVM, etc… • Use manually classified queries to train a classifier

  13. Supervised Learning Experiments • Perceptron-based machine learning system • Separate collections for training and testing: • Training: • Nearly 1M web queries manually classified by a team of editors • Grouped non-exclusively into 18 topical categories, and trained each category independently • Testing: • 20,000 web queries classified by human assessors • ~30% agreement with classifications in training set

  14. Supervised Learning Exp. (2) • Test queries were submitted to the trained learner for evaluation • Calculated true-positive and false-positive rates over all feature sets for each class • Plotted classifier performance using Detection-Error Tradeoff (DET) curves

  15. Supervised Learning DET Curves

  16. Supervised Learning Analysis • The DET curves for each class show a clear trend: • To lower the rate of false-negatives, substantial false-positives must be tolerated • This is a clear illustration of the query classification “recall problem” that has been identified in prior studies

  17. Approach #3 - Unsupervised Rule Learning Using Unlabeled Data • We have query logs with very large numbers of queries • Must take advantage of millions of users showing us how they look for things • Build on manual efforts • Manual efforts tell us some words from each category • Find words associated with each category • Learn how people look for topics, e.g. “what words do users use to find musicians or lawn-mowers”

  18. Unsupervised Rule Learning Using Unlabeled Data (2) • Find good predictors of a class based on how users look for queries related to certain categories • Use those words to predict new members of each category • Apply the notion of selectional preferences to find weighted rules for classifying queries automatically

  19. Selectional Preferences: Step 1 • Obtain a large log of unlabeled web queries • View each query as pairs of lexical units: • <head, tail> • Only applicable to queries of 2+ terms • Queries with n terms form n-1 pairs • Example: “directions to ICDM” forms two pairs: • <directions, to ICDM> and <directions to, ICDM> • Count and record the frequency of each pair

  20. Selectional Preferences: Step 2 • Obtain a set of manually labeled queries • Check the heads and tails of each pair to see if they appear in the manually labeled set • Convert each <head, tail> pair into: • <head, CATEGORY> (forward preference) • <CATEGORY, tail> (backward preference) • Discard <head, tail> pairs for which there is no category information at all • Sum counts for all contributing pairs and normalize by the number of contributing pairs

  21. Selectional Preferences: Step 2

  22. Selectional Preferences: Step 3 • Score each preference using Resnik’s Selectional Preference Strength formula: • Where urepresents a category, as found in Step 2. • S(x) is the sum of the weighted scores for every category associated with a given lexical unit

  23. Selectional Preferences: Step 4 • Use the mined preferences and weighted scores from Steps 3 and 4 to assign classifications to unseen queries

  24. Forward Rules harlem club X ENT->0.722 PLACES->0.378 TRAVEL->1.531 harley all stainless X AUTOS->3.448 SHOPPING->0.021 harley chicks with X PORN->5.681 Backward Rules X gets hot wont start AUTOS->2.049 PLACES->0.594 X getaway bargain PLACES->0.877 SHOPPING->0.047 TRAVEL->0.862 X getaway bargain hotel and airfare PLACES->0.594 TRAVEL->2.057 Selectional Preference Rule Examples

  25. Combined Approach • Each approach exploits different qualities of our query stream • A natural next step is to combine them • How similar are the approaches?

  26. Evaluation Metrics • Classification Precision: • #true positives / (#true positives + #false positives) • Classification Recall: • #true positives / (#true positives + # false negatives) • F-Measure: Higher values of beta put more emphasis on recall

  27. Effectiveness of each approach

  28. Performance of Perceptron vs. SP Rules at varying levels of Beta

  29. Conclusions • Our system successfully makes use of large amounts of unlabeled data • The Selectional Preference rules allow us to classify a significantly larger portion of the query stream than manual efforts alone • Excellent potential for further improvements

  30. Future Work • Expand available classification features per query • Mine web query logs for related terms and patterns • More intelligent combination methods • Learned combination functions • Voting algorithms • Utilize external sources of information • Patterns and trends from query log analysis • Topical ontology lookups • Experiment using other datasets (KDD Cup) • Use automatic query classification to improve effectiveness and efficiency in a production search system

More Related