1 / 63

Text Classification from Labeled and Unlabeled Documents using EM

Learn how to improve parameter estimation and classification accuracy in text classification using the Expectation Maximization (EM) algorithm for the Naive Bayes Classifier.

Download Presentation

Text Classification from Labeled and Unlabeled Documents using EM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Classification from Labeled and Unlabeled Documents usingEM by Nigam, McCallum, Thrun,Mitchell Machine Learning,2000 Ablai | Kristie | Cindy | Shayan

  2. ProblemStatement • Text Classification - is a fundamental problem in NLP, which involves assigning tags or categories to text according to itscontent. • Broad applicationarea: • Spamdetection • Sentimentanalysis • Topic labeling and soon

  3. TextClassification Example: spamfiltering Naive BayesClassifier:

  4. Labels inNLP • Labels areexpensive • Labeling isslow • Human iserror-prone • Usuallyfree • Large amount of unlabeleddata • Could be categorized bydomain

  5. Employing UnlabeledData

  6. Employing UnlabeledData

  7. Employing UnlabeledData

  8. Employing UnlabeledData

  9. Overview M-step: train a NB classifier on labeleddata E-step: label unlabeleddata M-step: train a new NB classifier on thedata E-step: relabel thedata Repeat until convergence Contributions: 𝝀-EM, significant performanceimprovements

  10. MixtureModel Data modelled in terms of a mixture of components, where each component has a simple parametric form (such as multinomial, Gaussian) Each cluster (component) is a generativemodel

  11. GenerativeModel • Why called ‘generative’model? • We assume there are underlying models that may have generated the givendata • Each cluster is parameterized by a disjoint subset ofθ • But we don’t know these parameters of the underlying model • So estimate the parameters for all componentclusters

  12. Naive BayesClassifier a generativeclassifier

  13. With training data, a certain probability distribution is assumed • multinomial distributions → Naive Bayes classifier(a mixture ofmultinomials) • Distribution’s required parameters are calculated to be used in theclassifier: • plug parameters into Bayes’ rule(later)

  14. Assumptions data produced by a mixturemodel 1-1 correspondence between mixture components ofthe mixture model and classes of the classificationproblem naive Bayes assumption of word independence → reduces number ofparameters

  15. ModelParameters • Parameters of a mixturecomponent: • word probabilities (probability of a word given a component):P(x|c) • mixture weight, ie class prior probabilities (probability of selecting a component):P(c)

  16. TrainingClassifier Learning = estimating parameters P(X|Y),P(Y) Want to find parameter values that are most probable given the trainingdata How: use ratios of counts from labeled training data +smoothing (example tofollow)

  17. Example Label email that contains text: “Free $$$bonus” Training data: Example adapted from “A practical explanation of a Naive Bayes classifier”, BrunoStecanella

  18. Example Goal: calculate whether the email has higher probability of being spam/not,using Bayes’rule

  19. Example Make naive Bayesassumption

  20. Example In order to estimate parameters, find ratios ofcounts = =

  21. Example Need to apply Laplacesmoothing!

  22. Example Augment numerator and denominator of ratios with“pseudo-count” = =

  23. Example P(spam | free $$$ bonus) = P(free | spam) * P($$$ | spam) * P(bonus | spam) *P(spam) P(not spam | free $$$ bonus) = P(free | notspam) * P($$$ | notspam) * P(bonus | notspam) *P(notspam) P(spam | free $$$ bonus) > P(not spam | free $$$ bonus) → label asspam

  24. UsingClassifier Calculate probability that a particular mixture component generated the document using Bayes’ rule, using estimatedparameters Label: class with the highest posterior probability of generating thedocument Naive Bayes classifier was shown to do a good job at text classification but can do better… (Next: by applying EM to naive Bayes to improve parameter estimation/classificationaccuracy)

  25. ExpectationMaximization

  26. Preface to ExpectationMaximization: RevisitK-Means Assign each point to onecluster based ondistance. Recompute center basedon average of pointsinside. Randomly initializecenters Image:http://stanford.edu/~cpiech/cs221/handouts/kmeans.html

  27. Preface to ExpectationMaximization: RevisitK-Means Iteration Membership: Fix Centers. Assign points to one class. Readjust center: Fix point memberships. Recomputecenter. What if we want to estimate a probability for how likely the point belongs to each class?

  28. Hard Clustering vs. SoftClustering HardClustering Every object is assigned to one clusteri Ai = 0or 1 ∑(Ai) = 1 for all clustersi SoftClustering 0 ≤ Ai≤1 ∑(Ai) =1 for all clustersi Q: How do you do this softclustering?

  29. MixtureModels • Each cluster is a generativemodel • model: Gaussian orMultinomial • Parameters of the model are unknown -- to beestimated How to estimate? ExpectationMaximization!

  30. Expectation Maximization: BasicExample Assume we use 1-D Gaussian model. Assume we know how many clustersk. If we know trueassignments, Images from VictorLavrenko

  31. Expectation Maximization: BasicExample If we don’t know true assignments BUT know Gaussianparameters, You can guess how likely each point belongs to eachcluster Likelihood Prior Posterior Likelihood: Images from VictorLavrenko

  32. Expectation Maximization: BasicExample Issue: What if we don’t know those Gaussianparameters? We need to know those Gaussian parameters… to calculate those posterior probabilities But we need to know cluster posterior probabilities to estimate Gaussian parameters You can guess how likely each point belongs to eachcluster Likelihood Prior Posterior Likelihood: Images from VictorLavrenko

  33. Expectation Maximization: BasicExample • Initialization: Randomly initialize k Gaussians (assume Gaussian) • Each has their own mean andvariance • E-step: Fix model parameters. “Soft” assign points to clusters. • Each point has probability of belonging to eachcluster • M-step: Fix membership probabilities. Adjust parametersthat • maximize the expectedlikelihood

  34. Expectation Maximization: BasicExample Recalculate means and variances for eachcluster. Images from VictorLavrenko

  35. Break When we come back: Wrapping up BasicEM

  36. Why pickEM? • Unlabeled data -- want to discoverclusters • Assume each cluster has underlyingmodel • To estimate, need iterative method forparameters • If interested in which model belongs to which classlabel… Happy Volunteer Cheat Scam (cursewords) Samaritan

  37. Spam?Not spam? EM andText • Words in adocument • Wordcount • Multinomialdistribution Free$$$!! Freewhile supplies last! Given this class label, how likely will you generate this bag ofwords?

  38. Naive Bayes andEM • Initialization: Naive Bayes - estimate classifier parametersfrom • labeleddata • Loop: • Assign probabilistically-weighted class labels to each unlabeled document usingEM • Estimate new classifier parameters using BOTH labeled and unlabeleddata

  39. Limitations to BasicEM • Let’s look at the assumptions… • All data are generated by the mixturemodel • Generated data uses the same parametric model used inclassification 2. 1-to-1 correspondence between mixture components andclasses Unlabeled data helps when there is very limited labeled data… But what if there are lots of labeleddata?

  40. Augmented EM - Part1 1. Assumption: All data are generated by the mixturemodel When enough labeled data is already provided, unlabeled data overwhelms and badly skewsestimates → Introduce a parameter 0 ≤ λ ≤ 1 to decrease unlabeleddocuments contribution Labeleddata Prior Unlabeleddata

  41. Augmented EM - Part1 → By weighing unlabeled documents by λ, you are weighing the word counts of unlabeled documents less by a factor ofλ λ is selected based oncross-validation → When setting 0 < λ < 1 , classificationimproves

  42. Augmented EM - Part2 Assumption: 1-to-1 correspondence between mixture components andclasses → Many-to-onecorrespondence Ex: One class may be comprised of several differentsub-topics. Machine Learning → neural networks, Bayesian, regression,… ReLU ANOVA activation F-statistic One multinomial distribution may not beenough!

  43. Experiments A discussion on the practicalresults of thisapproach

  44. Empirical Validation of the ProposedSystem • Validation of all theirclaims: • Unlabeled data and overallefficacy • Weighting • Multiple mixturecomponents

  45. Datasets • Task: Text Classification • We needdatasets!

  46. Datasets: UseNet - GeneralInformation • Available at:http://qwone.com/~jason/20Newsgroups/ • 20 Different newsgroups (Thelabels) • 20017articles • Not a considerable class imbalance(important)

  47. Datasets: UseNet - In thiswork • They… • 62258 uniquewords • Used a test set of 4000 articles in the latest portion (20%) of thetimeline • The task is usually predicting the future classes not thepast. • Train set is composed of • 10000 randomly selected articles from the rest, asunlabeled • 6000 documents used for labeledexamples

  48. Datasets: WebKB - GeneralInformation • Available athttp://www.cs.cmu.edu/~webkb/ • 8145 Webpages from CSdepratments • Categories: • Student, faculty, staff, course, project, department,other

  49. Datasets: WebKB - In thiswork • Only the four main categories are used (that have moredata) • 4199pages • Numbers are converted to either time or phone numbertoken • Did not perform stemming orstoplist • Showed that actually hurts theperformance • Vocabulary is limited to the main 300 words (most informativewords) • This vocabulary size is selectedempirically • Test using the leave-one-university-outapproach • One complete CS departmentdata • 2500 randomly selected from the rest: unlabeledset • Trainset: same asbefore

  50. Datasets: Reuters - GeneralInformation • Available at http://www.daviddlewis.com/resources/testcollections/reuters21578/ • 12902articles • 90 categories from the Reutersnewswire

More Related