1 / 23

Supervised IR

This paper investigates the problem of obtaining feedback automatically with minimal effort in supervised information retrieval. The proposed approach includes incremental user feedback, initial fixed training sets, and user tags for relevant and irrelevant documents. It also explores pattern matching, query and document vector set co-occurrence, Rocchio relevance feedback, and probabilistic IR and text classification. The goal is to compute the probability of relevance given a representation of a document.

cmeier
Download Presentation

Supervised IR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supervised IR Refined Computation of relevant set based on: • Incremental user feedback (Relevance Feedback) OR • Initial fixed training sets • User tags relevant/irrelevant • Routing problem  initial class Big open Question – How do we obtain feedback automatically with minimal effort?

  2. “Unsupervised” IRPredicting relevance without user feedback Pattern Matching: Query vector/set Document vector/set Co-occurrence of terms assumed to be indication of relevance

  3. Relevance Feedback Incremental Feedback in vector model Refer to Rocchio, 71 Q0 = Initial Query Q1 = aQ0 + b Ri -dSi NRel NIrrel 1 NRel 1 NIrrel S S i = 1 i = 1

  4. Probabilistic IR/Text Classification Document Retrieval If P(Rel|Doci) > P(Irrel|Doci) Then Doci is “relevant” Else Doci is “not relevant” -OR- P(Rel|Doci) P(Irrel|Doci) Then Doci is “relevant”… Magnitude of ratio indicates our confidence If > 1

  5. Text Classification Select Classj such that: P(Classj | Doci) is maximized (Bowling, DogBreeding, etc.) (incoming mail message) Alternately P(Classj | Doci) P(NOT Classj | Doci) is maximized

  6. General Formulation • Compute: • P(Classj | Evidence) • One of a fixed K classesor set of feature values • *disjoint classes* - Can’t be a (e.g. Words in a language,Medical Test Results, etc) member of more than 1 • Uses: • REL/IRREL  Document Retrieval • Work/Bowling/Dog Breeding  Text Classification/Routing • Spanish/Italian/English  Language ID • Sick/Well  Medical Diagnosis • Herpes/Not Herpes  Medical Diagnosis

  7. Feature Set Goal: To Compute: P(Classj | Doci)  Abstract Formulation P(Classj| Representation of Doci)  Probability given a representation of Doci P(Classj| W1, W2,…Wk) One Representation of a vector of words in the document -OR- P(Classj| F1, F2, … Fk) More general, a list of document features

  8. Problem – Sparse Feature Set In Medical Diagnosis: worth considering all possible feature combinations Can compute P(Evidence|Classi) directly from data for all evidence patterns Eg P(T,T,F|Herpes) = 12/Total Herpes In IR: Too many combinations of feature values to estimate class distribution after all combinations

  9. Bayes Rule Posterior probability of class given evidence Prior probability of class P(Classi|Evidence) = P(Evidence|Classi) x P(Classi) P(Evidence) Uninformative prior: P(Classi) = 1 (Total # of Classes)

  10. Example in Medical Diagnosis A single blood test Probability of test if patient has herpes P(Herpes|Evidence) = P(Evidence|Herpes) * P(Herpes) Probability of herpes given a test result P(Evidence) Prior probability of patient having herpes Prob of a (pos/neg) test result P(Herpes|Positive Test) = .9 P(Herpes|Negative Test) = .001 P(Not Herpes|Positive Test) = .1 P(Not Herpes|Negative Test) = .999

  11. Evidence Decomposition P(Classj | Evidence) A given combination of feature values Medical Diagnosis Text Classification/ Routing

  12. Example in Text Classification / Routing Dog Breeding (collie, groom, show) Prior chance that mail is about dog breeding P(Classi|Evidence) = P(Evidence|Classi) * P(Classi) Observe directly through Training data P(Evidence) Class 1 – Dog Breeding Training Class 2 - Work Fur Collie Compiler X86 C++ Collie Groom Show Lex YACC Computer Poodle Sire Breed Akita Pup Java

  13. Probabalistic IR Target/Goal: For a given model of relevance to user’s needs Document Retrieval Document Routing / Classification

  14. Multiple Binary Splits Q1 B A B2 A1 A2 B1 Flat K-Way Classification Q1 F G C E A B D

  15. Likelihood Ratios P(Class1| Evidence) = P(Evidence|Class1) * P(Class1) P(Evidence) P(Class2| Evidence) = P(Evidence|Class2) * P(Class2) P(Evidence) P(Class1|Evidence) P(Evidence|Class1) P(Class1) P(Class2|Evidence) P(Evidence|Class2) P(Class2) = *

  16. Likelihood Ratios P(Rel|Doci) Document Retrieval options are P(Irrel|Doci) Rel and Irrel 1. BinaryClassifications P(Work|Doci) Binary routing task – P(Personal|Doci) (2 possible classes) 2. Can Treat K-Way classification as a series of binary classifications • P(Classj|Doci) • P(NOT Classj|Doci) • Compute this ratio for all classes • Choose class j for which this ratio is greatest 3.

  17. Independence Assumption Evidence = w1,w2,w3,…wk P(Class1|Evidence) P(Class1) P(Evidence|Class1) P(Class2|Evidence) P(Class2) P(Evidence|Class2) P(Class1) P(wi|Class1) P(Class2) P(wi|Class2) Final Odds Initial Odds = * k Õ = * i = 1

  18. Using Independence Assumption P(Personal|Akita,pup,fur,show) P(Personal) P(Akita|Personal) P(pup|Personal) = * * P(Work|Akita,pup,fur,show) P(Work) P(Akita|Work) P(pup|Work) P(fur|Personal) P(show|Personal) * * P(fur|Work) P(show|Work) P(Personal|Evidence) 1 27 18 36 3 = * * * * P(Work|Evidence) 9 2 0 2 5 Product of likelihood ratios for each word a1= some constant

  19. Note: Ratios (Partially) Self Weighting ( ) P(The|Personal) 1 5137/100,000 » e.g. P(The|Work) 1 5238/100,000 ( ) P(Akita|Personal) 37 37/100,000 » e.g. P(Akita|Work) 1 1/100,000

  20. Bayesian Model ApplicationsAuthorship Identification P(Hamilton|Evidence) P(Evidence|Hamilton) P(Hamilton) = * P(Madison|Evidence) P(Evidence|Madison) P(Madison) Sense Disambiguation P(Tank-Container|Evidence) P(Evidence|Tank-Container) P(Tank-Container) = * P(Tank-Vehicle|Evidence) P(Evidence|Tank-Vehicle) P(Tank-Vehicle)

  21. Dependence Trees P(w1,w2,…,wk) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w2) * P(w5|w2) * P(w6|w5) * P(w6|w5w4) w1 w2 (Hierarchical Bayesian Models) = direction of dependence w3 w5 w4 w6

  22. Full Probability Decomposition Using Simplifying(Markov) Assumption Assume word only condition on prob. of prev. word Assumption of Full Independence (Graphical) Models – Partial Decomposition intoDependence Trees

  23. Full Probability Decomposition P(w) = P(w1) * P(w2|w1) * P(w3|w2w1) * P(w4|w3w2w1) * … Using Simplifying (Markov) Assumptions P(w) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w3) * … (Assume P(word) only conditional upon the probability of the previous word) Assumption of Full Independence P(w) = P(w1) * P(w2) * P(w3) * P(w4) * … Graphical Models – Partial Decomposition into Dependence Trees P(w) = P(w1) * P(w2|w1) * P(w3) * P(w4|w1w3) * …

More Related