1 / 56

Text Classification with Limited Labeled Data

Text Classification with Limited Labeled Data. Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Center for Automated Learning and Discovery, Carnegie Mellon University.

Download Presentation

Text Classification with Limited Labeled Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Classificationwith Limited Labeled Data Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Center for Automated Learning and Discovery, Carnegie Mellon University Joint work with Kamal Nigam, Tom Mitchell, Sebastian Thrun, Roni Rosenfeld, Andrew Ng, Larry Wasserman, Kristie Seymore, and Jason Rennie

  2. “grow corn tractor…” The Task: Document Classification (also “Document Categorization”, “Routing” or “Tagging”) Automatically placing documents in their correct categories. Testing Data: (Crops) Categories: Crops Botany Evolution Magnetism Relativity Irrigation Training Data: water grating ditch farm tractor... corn wheat silo farm grow... corn tulips splicing grow... selection mutation Darwin Galapagos DNA... ... ...

  3. A Probabilistic Approach to Document Classification Pick the most probable class, given the evidence: - a class (like “Crops”) - a document (like “grow corn tractor...”) Bayes Rule: “Naïve Bayes”: (1) One mixture-component per class (2) Independence assumption - the ith word in d (like “corn”)

  4. Parameter Estimation in Naïve Bayes Naïve Bayes Maximum a posteriori estimate of Pr(w|c), with a Dirichlet prior, (AKA “Laplace smoothing”) whereN(w,d)is number of times word w occurs in document d. Two ways to improve this method: (A) Make less restrictive assumptions about the model (B) Get better estimates of the model parameters, i.e. Pr(w|c)

  5. The Rest of the Talk Two Methods for Improving Parameter Estimation when Labeled Data is Sparse (1) Borrow data from related classes in a hierarchy (2) Use unlabeled data.

  6. Improving Document Classification by Shrinkage in a Hierarchy Andrew McCallum Roni Rosenfeld Tom Mitchell Andrew Ng (Berkeley) Larry Wasserman (CMU Statistics)

  7. “corn grow tractor…” The Idea: “Shrinkage” / “Deleted Interpolation” We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. This represents a tradeoff between reliability and specificity. (Crops) Testing Data: Science Agriculture Biology Physics Categories: Crops Botany Evolution Magnetism Relativity Irrigation Training Data: water grating ditch farm tractor... corn wheat silo farm grow... corn tulips splicing grow... selection mutation Darwin Galapagos DNA... ... ...

  8. “Shrinkage” / “Deleted Interpolation” [James and Stein, 1961] / [Jelinek and Mercer, 1980] (Uniform) Science Agriculture Biology Physics Crops Botany Evolution Magnetism Relativity Irrigation

  9. Learning Mixture Weights Learn the l’s via EM, performing the E-step with leave-one-out cross-validation. Uniform E-step Use the current l’s to estimate the degree to which each node was likely to have generated the words in held out documents. Science Agriculture M-step Use the estimates to recalculate new values for the l’s. Crops corn wheat silo farm grow...

  10. Learning Mixture Weights E-step M-step

  11. Newsgroups Data Set (Subset of Ken Lang’s 20 Newsgroups set) computers religion sport politics motor mac atheism misc guns misc ibm X baseball auto hockey mideast motorcycle graphics christian windows • 15 classes, 15k documents,1.7 million words, 52k vocabulary

  12. Newsgroups HierarchyMixture Weights

  13. Industry Sector Data Set www.marketguide.com … (11) transportation utilities consumer energy services ... ... ... water electric gas coal integrated air misc appliance film furniture communication railroad water trucking oil&gas • 71 classes, 6.5k documents,1.2 million words, 30k vocabulary

  14. Industry Sector Classification Accuracy

  15. Newsgroups Classification Accuracy

  16. Yahoo Science Data Set www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity • 264 classes, 14k documents,3 million words, 76k vocabulary

  17. Yahoo Science Classification Accuracy

  18. Related Work • Shrinkage in Statistics: • [Stein 1955], [James & Stein 1961] • Deleted Interpolation in Language Modeling: • [Jelinek & Mercer 1980], [Seymore & Rosenfeld 1997] • Bayesian Hierarchical Modeling for n-grams • [MacKay & Peto 1994] • Class hierarchies for text classification • [Koller & Sahami 1997] • Using EM to set mixture weights in a hierarchical clustering model for unsupervised learning • [Hofmann & Puzicha 1998]

  19. Future Work • Learning hierarchies that aid classification. • Using more complex generative models. • Capturing word dependancies • Clustering words in each ancestor

  20. Shrinkage Conclusions • Shrinkage in a hierarchy of classes can dramatically improve classification accuracy. • Shrinkage helps especially when training data is sparse. In models more complex than naïve Bayes, it should be even more helpful. • [The hierarchy can be pruned for exponential reduction in computation necessary for classification; only minimal loss in accuracy.]

  21. The Rest of the Talk Two Methods for Improving Parameter Estimation when Labeled Data is Sparse (1) Borrow data from related classes in a hierarchy. (2) Use unlabeled data.

  22. Text Classification with Labeled and Unlabeled Documents Kamal Nigam Andrew McCallum Sebastian Thrun Tom Mitchell

  23. The Scenario Training data with class labels Data available at training time, but without class labels Web pages user says are interesting Web pages user says are uninteresting Web pages user hasn’t seen or said anything about Can we use the unlabeled documents to increase accuracy?

  24. Using the Unlabeled Data Build a classification model using limited labeled data Use model to estimate the labels of the unlabeled documents Use all documents to build a new classification model, which is often more accurate because it is trained using more data.

  25. An Example Labeled Data Unlabeled Data Baseball Ice Skating Tara Lipinski’s substitute ice skates didn’t hurt her performance. She graced the ice with a series of perfect jumps and won the gold medal. Fell on the ice... The new hitter struck out... Perfect triple jump... Struck out in last inning... Katarina Witt’s gold medal performance... Homerun in the first inning... Tara Lipinski bought a new house for her parents. New ice skates... Pete Rose is not as good an athlete as Tara Lipinski... Practice at the ice rink every day... After EM: Pr ( Lipinski | Ice Skating) = 0.02 Before EM: Pr ( Lipinski | Baseball ) = 0.003 Pr ( Lipinski) = 0.01 Pr ( Lipinski) = 0.001

  26. Filling in Missing Labels with EM [Dempster et al ‘77], [Ghahramani & Jordan ‘95], [McLachlan & Krishnan ‘97] • E-step: Use current estimates of model parameters to “guess” value of missing labels. • M-step: Use current “guesses” for missing labels to calculate new estimates of model parameters. • Repeat E- and M-steps until convergence. Expectation Maximization is a class of iterative algorithms for maximum likelihood estimation with incomplete data. Finds the model parameters that locally maximize the probability of both the labeled and the unlabeled data.

  27. EM for Text Classification Expectation-step (estimate the class labels) Maximization-step (new parameters using the estimates)

  28. WebKB Data Set student faculty course project • 4 classes, 4199 documents • from CS academic departments

  29. Word Vector Evolution with EM Iteration 0 intelligence DD artificial understanding DDw dist identical rus arrange games dartmouth natural cognitive logic proving prolog Iteration 1 DD D lecture cc D* DD:DD handout due problem set tay DDam yurtas homework kfoury sec Iteration 2 D DD lecture cc DD:DD due D* homework assignment handout set hw exam problem DDam postscript (D is a digit)

  30. EM as Clustering X X X = unlabeled

  31. EM as Clustering, Gone Wrong X X X

  32. 20 Newsgroups Data Set … sci.med sci.crypt sci.space alt.atheism sci.electronics comp.graphics talk.politics.misc comp.windows.x rec.sport.hockey talk.politics.guns talk.religion.misc rec.sport.baseball talk.politics.mideast comp.sys.mac.hardware comp.os.ms-windows.misc comp.sys.ibm.pc.hardware • 20 class labels, 20,000 documents • 62k unique words

  33. Newsgroups Classification Accuracyvarying # labeled documents

  34. Newsgroups Classification Accuracyvarying # unlabeled documents

  35. WebKB Classification Accuracyvarying # labeled documents

  36. WebKB Classification Accuracyvarying weight of unlabeled data

  37. WebKB Classification Accuracyvarying # labeled documentsand selecting unlabeled weight by CV

  38. Reuters 21578 Data Set earn interest ship acq crude grain wheat … corn • 135 class labels, 12902 documents

  39. EM as Clustering, Salvageable X X

  40. Reuters 21578 Precision-Recall Breakeven # mixture components for negative class

  41. Related Work • Using EM to reduce the need for training examples: • [Miller & Uyar 1997], [Shahshahani & Landgrebe 1994] • Using EM to fill in missing values • [Ghahramani & Jordan 1995] • AutoClass - unsupervised EM with Naïve Bayes: • [Cheeseman et al. 1988] • Co-Training • [Blum & Mitchell COLT’98] • Relevance Feedback for Information Retrieval • [Salton & Buckley 1990]

  42. Unlabeled Data Conclusions & Future Work • Combining labeled and unlabeled data with EM can greatly reduce the need for labeled training data. • Exercise caution: EM can sometimes hurt. • Weight the unlabeled data. • Choose parametric model carefully. • Vary EM likelihood surface for different tasks. • Use similar techniques for other text tasks: e.g. Information Extraction.

  43. Cora Demo

  44. Populating a hierarchy • Naïve Bayes • Simple, robust document classification. • Many principled enhancements (e.g. shrinkage). • Requires a lot of labeled training data. • Keyword matching • Requires no labeled training data. • Human effort to select keywords (acc/cov) • Brittle, breaks easily

More Related