1 / 52

Research in IR at MS

Research in IR at MS. Microsoft Research (http://research.microsoft.com) Adaptive Systems and Interaction - IR/UI Machine Learning and Applied Statistics Data Mining Natural Language Processing Collaboration and Education MSR Cambridge MSR Beijing

tolinka
Download Presentation

Research in IR at MS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research in IR at MS • Microsoft Research (http://research.microsoft.com) • Adaptive Systems and Interaction - IR/UI • Machine Learning and Applied Statistics • Data Mining • Natural Language Processing • Collaboration and Education • MSR Cambridge • MSR Beijing • Microsoft Product Groups … many IR-related

  2. IR Research at MSR • Improvements to representations and matching algorithms • New directions • User modeling • Domain modeling • Interactive interfaces • An example: Text categorization

  3. Query Words Ranked List Traditional View of IR

  4. User Modeling Query Words Ranked List Domain Modeling Information Use What’s Missing?

  5. Domain/Obj Modeling • Not all objects are equal … potentially big win • A priori importance • Information use (“readware”; collab filtering) • Inter-object relationships • Link structure / hypertext • Subject categories - e.g., text categorization. text clustering • Metadata • E.g., reliability, recency, cost ->combining

  6. User/Task Modeling • Demographics • Task -- What’s the user’s goal? • e.g., Lumiere • Short and long-term content interests • e.g., Implicit queries • Interest model = f(content_similarity, time, interest) • e.g., Letiza, WebWatcher, Fab

  7. Information Use • Beyond batch IR model (“query->results”) • Consider larger task context • Knowledge management • Human attention is critical resource • Techniques for automatic information summarization, organization, discover, filtering, mining, etc. • Advanced UIs and interaction techniques • E.g, tight coupling of search, browsing to support information management

  8. Query Words Ranked List The Broader View of IR User Modeling Domain Modeling Information Use

  9. Text Categorization: Road Map • Text Categorization Basics • Inductive Learning Methods • Reuters Results • Future Plans

  10. Text Categorization • Text Categorization: assign objects to one or more of a predefined set of categories using text features • Example Applications: • Sorting new items into existing structures (e.g., general ontologies, file folders, spam vs.not) • Information routing/filtering/push • Topic specific processing • Structured browsing & search

  11. Text Categorization - Methods • Human classifiers (e.g., Dewey, LCSH, MeSH, Yahoo!, CyberPatrol) • Hand-crafted knowledge engineered systems (e.g., CONSTRUE) • Inductive learning of classifiers • (Semi-) automatic classification

  12. Classifiers • A classifier is a function: f(x) = conf(class) • from attribute vectors, x=(x1,x2, … xd) • to target values, confidence(class) • Example classifiers • if (interest AND rate) OR (quarterly), then confidence(interest) = 0.9 • confidence(interest) = 0.3*interest + 0.4*rate + 0.1*quarterly

  13. Inductive Learning Methods • Supervised learning to build classifiers • Labeled training data (i.e., examples of each category) • “Learn” classifier • Test effectiveness on new instances • Statistical guarantees of effectiveness • Classifiers easy to construct and update; requires only subject knowledge • Customizable for individual’s categories and tasks

  14. Text Representation • Vector space representation of documents word1 word2 word3 word4 ... Doc 1 = <1, 0, 3, 0, … > Doc 2 = <0, 1, 0, 0, … > Doc 3 = <0, 0, 0, 5, … > • Text can have 107 or more dimensions e.g., 100k web pages had 2.5 million distinct words • Mostly use: simple words, binary weights, subset of features

  15. # Words (f) 1 2 3 … m Words by rank order (r) Feature Selection • Word distribution - remove frequent and infrequent words based on Zipf’s law: log(frequency) * log(rank) ~ constant

  16. Feature Selection (cont’d) • Fit to categories - use mutual information to select features which best discriminate category vs. not • Designer features - domain specific, including non-text features • Use 100-500 best features from this process as input to learning methods

  17. Inductive Learning Methods • Find Similar • Decision Trees • Naïve Bayes • Bayes Nets • Support Vector Machines (SVMs) • All support: • “Probabilities” - graded membership; comparability across categories • Adaptive - over time; across individuals

  18. Find Similar • Aka, relevance feedback • Rocchio • Classifier parameters are a weighted combination of weights in positive and negative examples -- “centroid” • New items classified using: • Use all features, idf weights,

  19. P(class) = .6 P(class) = .9 P(class) = .2 Decision Trees • Learn a sequence of tests on features, typically using top-down, greedy search • Binary (yes/no) or continuous decisions f1 !f1 f7 !f7

  20. C x1 x2 x3 xn Naïve Bayes • Aka, binary independence model • Maximize: Pr (Class | Features) • Assume features are conditionally independent - math easy; surprisingly effective

  21. C x1 x2 x3 xn Bayes Nets • Maximize: Pr (Class | Features) • Does not assume independence of features - dependency modeling

  22. support vectors Support Vector Machines • Vapnik (1979) • Binary classifiers that maximize margin • Find hyperplane separating positive and negative examples • Optimization for maximum margin: • Classify new items using:

  23. Support Vector Machines • Extendable to: • Non-separable problems (Cortes & Vapnik, 1995) • Non-linear classifiers (Boser et al., 1992) • Good generalization performance • Handwriting recognition (LeCun et al.) • Face detection (Osuna et al.) • Text classification (Joachims) • Platt’s Sequential Minimal Optimization (SMO) algorithm very efficient

  24. SVMs: Platt’s SMO Algorithm • SMO is a very fast way to train SVMs • SMO works by analytically solving the smallest possible QP sub-problem • Substantially faster than chunking • Scales somewhere between

  25. Text Classification Process text files Index Server word counts per file Find similar Feature selection data set Learning Methods Support vector machine Decision tree Naïve Bayes Bayes nets test classifier

  26. Reuters Data Set (21578 - ModApte split) • 9603 training articles; 3299 test articles • Example “interest” article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. REUTER • Average article 200 words long

  27. Reuters Data Set (21578 - ModApte split) • 118 categories • An article can be in more than one category • Learn 118 binary category distinctions • Most common categories (#train, #test) • Earn (2877, 1087) • Acquisitions (1650, 179) • Money-fx (538, 179) • Grain (433, 149) • Crude (389, 189) • Trade (369,119) • Interest (347, 131) • Ship (197, 89) • Wheat (212, 71) • Corn (182, 56)

  28. Category: “interest” rate=1 rate.t=1 lending=0 prime=0 discount=0 pct=1 year=0 year=1

  29. Category: Interest • Example SVM features - • -0.71 dlrs • -0.35 world • -0.33 sees • -0.25 year • -0.24 group • -0.24 dlr • -0.24 january • 0.70 prime • 0.67 rate • 0.63 interest • 0.60 rates • 0.46 discount • 0.43 bundesbank • 0.43 baker

  30. Accuracy Scores • Based on contingency table • Effectiveness measure for binary classification • error rate = (b+c)/n • accuracy = 1 - error rate • precision (P) = a/(a+b) • recall (R) = a/(a+c) • break-even = (P+R)/2 • F measure = 2PR/(P+R)

  31. Reuters - Accuracy ((R+P)/2) Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category Break Even: (Recall + Precision) / 2

  32. ROC for Category - Grain Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category

  33. ROC for Category - Earn

  34. ROC for Category - Acq

  35. ROC for Category - Money-Fx

  36. ROC for Category - Grain

  37. ROC for Category - Crude

  38. ROC for Category - Trade

  39. ROC for Category - Interest

  40. ROC for Category - Ship

  41. ROC for Category - Wheat

  42. ROC for Category - Corn

  43. SVM: Dumais et al. vs. Joachims • Top 10 Reuters categories, microavg BE

  44. Reuters - Sample Size (SVM) sample set 1 sample set 2

  45. Reuters - Other Experiments • Simple words vs. NLP-derived phrases • NLP-derived phrases • factoids (April_8, Salomon_Brothers_International) • mulit-word dictionary entries (New_York, interest_rate) • noun phrases (first_quarter, modest_growth) • No advantage for Find Similar, Naïve Bayes, SVM • Vary number of features • Binary vs. 0/1/2 features • No advantage of 0/1/2 for Decision Trees, SVM

  46. Number of Features - NLP

  47. Reuters Summary • Accurate classifiers can be learned automatically from training examples • Linear SVMs are efficient and provide very good classification accuracy • Best results for this test collection • Widely applicable, flexible, and adaptable representations

  48. Text Classification Horizon • Other applications • OHSUMED, TREC, spam vs. not-spam, Web • Text representation enhancements • Use of hierarchical category structure • UI for semi-automatic classification • Dynamic interests

  49. More Information • General stuff -http://research.microsoft.com/~sdumais • SMO -http://research.microsoft.com/~jplatt

More Related