1 / 52

Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Information Retrieval Search Engine Technology (5&6) http://tangra.si.umich.edu/clair/ir09. Prof. Dragomir R. Radev radev@umich.edu. Final projects. Two formats:

kordell
Download Presentation

Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information RetrievalSearch Engine Technology(5&6)http://tangra.si.umich.edu/clair/ir09 Prof. Dragomir R. Radev radev@umich.edu

  2. Final projects • Two formats: • A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community. • A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR-related conferences. • Deliverables: • System (code + documentation + examples) or Paper (+ code, data) • Poster (to be presented in class) • Web page that describes the project.

  3. SET/IR – W/S 2009 … 9. Text classification Naïve Bayesian classifiers Decision trees …

  4. Introduction • Text classification: assigning documents to predefined categories: topics, languages, users • A given set of classes C • Given x, determine its class in C • Hierarchical vs. flat • Overlapping (soft) vs non-overlapping (hard)

  5. Introduction • Ideas: manual classification using rules (e.g., Columbia AND University  EducationColumbia AND “South Carolina”  Geography • Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression) • Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) • Discriminative: model p(y|x) directly.

  6. Bayes formula Full probability

  7. Example (performance enhancing drug) • Drug(D) with values y/n • Test(T) with values +/- • P(D=y) = 0.001 • P(T=+|D=y)=0.8 • P(T=+|D=n)=0.01 • Given: athlete tests positive • P(D=y|T=+)= P(T=+|D=y)P(D=y) / (P(T=+|D=y)P(D=y)+P(T=+|D=n)P(D=n)=(0.8x0.001)/(0.8x0.001+0.01x0.999)=0.074

  8. Naïve Bayesian classifiers • Naïve Bayesian classifier • Assuming statistical independence • Features = words (or phrases) typically

  9. Example • p(well)=0.9, p(cold)=0.05, p(allergy)=0.05 • p(sneeze|well)=0.1 • p(sneeze|cold)=0.9 • p(sneeze|allergy)=0.9 • p(cough|well)=0.1 • p(cough|cold)=0.8 • p(cough|allergy)=0.7 • p(fever|well)=0.01 • p(fever|cold)=0.7 • p(fever|allergy)=0.4 Example from Ray Mooney

  10. Example (cont’d) • Features: sneeze, cough, no fever • P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e) • P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e) • P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e) • P(e) = 0.0089+0.01+0.019=0.379 • P(well|e)=.23 • P(cold|e)=.26 • P(allergy|e)=.50 Example from Ray Mooney

  11. Issues with NB • Where do we get the values – use maximum likelihood estimation (Ni/N) • Same for the conditionals – these are based on a multinomial generator and the MLE estimator is (Tji/STji) • Smoothing is needed – why? • Laplace smoothing ((Tji+1)/S(Tji+1)) • Implementation: how to avoid floating point underflow

  12. Spam recognition Return-Path: <ig_esq@rediffmail.com> X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" <ig_esq@rediffmail.com> Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

  13. SpamAssassin • http://spamassassin.apache.org/ • http://spamassassin.apache.org/tests_3_1_x.html

  14. Feature selection: The 2 test • For a term t: • C=class, it = feature • Testing for independence:P(C=0,It=0) should be equal to P(C=0) P(It=0) • P(C=0) = (k00+k01)/n • P(C=1) = 1-P(C=0) = (k10+k11)/n • P(It=0) = (k00+K10)/n • P(It=1) = 1-P(It=0) = (k01+k11)/n

  15. Feature selection: The 2 test • High values of 2 indicate lower belief in independence. • In practice, compute 2 for all words and pick the top k among them.

  16. Feature selection: mutual information • No document length scaling is needed • Documents are assumed to be generated according to the multinomial model • Measures amount of information: if the distribution is the same as the background distribution, then MI=0 • X = word; Y = class

  17. Well-known datasets • 20 newsgroups • http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/ • Reuters-21578 • http://www.daviddlewis.com/resources/testcollections/reuters21578/ • Cats: grain, acquisitions, corn, crude, wheat, trade… • WebKB • http://www-2.cs.cmu.edu/~webkb/ • course, student, faculty, staff, project, dept, other • NB performance (2000) • P=26,43,18,6,13,2,94 • R=83,75,77,9,73,100,35

  18. Evaluation of text classification • Microaveraging – average over classes • Macroaveraging – uses pooled table

  19. Vector space classification x2 topic2 topic1 x1

  20. Decision surfaces x2 topic2 topic1 x1

  21. Decision trees x2 topic2 topic1 x1

  22. Classification usingdecision trees • Expected information need • I (s1, s2, …, sm) = - pi log (pi) • s = data samples • m = number of classes S

  23. Decision tree induction • I(s1,s2)= I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 == 0.940

  24. Entropy and information gain S S1j + … + smj • E(A) = I (s1j,…,smj) s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s1,s2,…,sm) – E(A)

  25. Entropy • Age <= 30s11 = 2, s21 = 3, I(s11, s21) = 0.971 • Age in 31 .. 40s12 = 4, s22 = 0, I (s12,s22) = 0 • Age > 40s13 = 3, s23 = 2, I (s13,s23) = 0.971

  26. Entropy (cont’d) • E (age) =5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694 • Gain (age) = I (s1,s2) – E(age) = 0.246 • Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048

  27. Final decision tree age > 40 31 .. 40 student credit yes excellent fair no yes no yes no yes

  28. Other techniques • Bayesian classifiers • X: age <=30, income = medium, student = yes, credit = fair • P(yes) = 9/14 = 0.643 • P(no) = 5/14 = 0.357

  29. Example • P (age < 30 | yes) = 2/9 = 0.222P (age < 30 | no) = 3/5 = 0.600P (income = medium | yes) = 4/9 = 0.444P (income = medium | no) = 2/5 = 0.400P (student = yes | yes) = 6/9 = 0.667P (student = yes | no) = 1/5 = 0.200P (credit = fair | yes) = 6/9 = 0.667P (credit = fair | no) = 2/5 = 0.400

  30. Example (cont’d) • P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 • P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 • P (X | yes) P (yes) = 0.044 x 0.643 = 0.028 • P (X | no) P (no) = 0.019 x 0.357 = 0.007 • Answer: yes/no?

  31. SET/IR – W/S 2009 … 10. Linear classifiers Kernel methods Support vector machines …

  32. Linear boundary x2 topic2 topic1 x1

  33. Vector space classifiers • Using centroids • Boundary = line that is equidistant from two centroids

  34. Generative models: knn • Assign each element to the closest cluster • K-nearest neighbors • Very easy to program • Tessellation; nonlinearity • Issues: choosing k, b? • Demo: • http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html

  35. Linear separators • Two-dimensional line: w1x1+w2x2=b is the linear separator w1x1+w2x2>b for the positive class • In n-dimensional spaces:

  36. Example 1 x2 topic2 topic1 x1

  37. Example 2 • Classifier for “interest” in Reuters-21578 • b=0 • If the document is “rate discount dlr world”, its score will be0.67*1+0.46*1+(-0.71)*1+(-0.35)*1= 0.05>0 Example from MSR

  38. Example: perceptron algorithm Input: Algorithm: Output:

  39. [Slide from Chris Bishop]

  40. Linear classifiers • What is the major shortcoming of a perceptron? • How to determine the dimensionality of the separator? • Bias-variance tradeoff (example) • How to deal with multiple classes? • Any-of: build multiple classifiers for each class • One-of: harder (as J hyperplanes do not divide RM into J regions), instead: use class complements and scoring

  41. Support vector machines • Introduced by Vapnik in the early 90s.

  42. Issues with SVM • Soft margins (inseparability) • Kernels – non-linearity

  43. The kernel idea before after

  44. Example (mapping to a higher-dimensional space)

  45. The kernel trick Polynomial kernel: Sigmoid kernel: RBF kernel: Many other kernels are useful for IR:e.g., string kernels, subsequence kernels, tree kernels, etc.

  46. SVM (Cont’d) • Evaluation: • SVM > knn > decision tree > NB • Implementation • Quadratic optimization • Use toolkit (e.g., Thorsten Joachims’s svmlight)

  47. Semi-supervised learning • EM • Co-training • Graph-based

  48. Exploiting Hyperlinks – Co-training • Each document instance has two sets of alternate view (Blum and Mitchell 1998) • terms in the document, x1 • terms in the hyperlinks that point to the document, x2 • Each view is sufficient to determine the class of the instance • Labeling function that classifies examples is the same applied to x1 or x2 • x1 and x2 are conditionally independent, given the class [Slide from Pierre Baldi]

  49. Co-training Algorithm • Labeled data are used to infer two Naïve Bayes classifiers, one for each view • Each classifier will • examine unlabeled data • pick the most confidently predicted positive and negative examples • add these to the labeled examples • Classifiers are now retrained on the augmented set of labeled examples [Slide from Pierre Baldi]

More Related