Text Classification from Labeled and Unlabeled Documents using EM

Text Classification from Labeled and Unlabeled Documents usingEM by Nigam, McCallum, Thrun,Mitchell Machine Learning,2000 Ablai | Kristie | Cindy | Shayan

ProblemStatement • Text Classification - is a fundamental problem in NLP, which involves assigning tags or categories to text according to itscontent. • Broad applicationarea: • Spamdetection • Sentimentanalysis • Topic labeling and soon

TextClassification Example: spamfiltering Naive BayesClassifier:

Labels inNLP • Labels areexpensive • Labeling isslow • Human iserror-prone • Usuallyfree • Large amount of unlabeleddata • Could be categorized bydomain

Employing UnlabeledData

Overview M-step: train a NB classifier on labeleddata E-step: label unlabeleddata M-step: train a new NB classifier on thedata E-step: relabel thedata Repeat until convergence Contributions: 𝝀-EM, significant performanceimprovements

MixtureModel Data modelled in terms of a mixture of components, where each component has a simple parametric form (such as multinomial, Gaussian) Each cluster (component) is a generativemodel

GenerativeModel • Why called ‘generative’model? • We assume there are underlying models that may have generated the givendata • Each cluster is parameterized by a disjoint subset ofθ • But we don’t know these parameters of the underlying model • So estimate the parameters for all componentclusters

Naive BayesClassifier a generativeclassifier

With training data, a certain probability distribution is assumed • multinomial distributions → Naive Bayes classifier(a mixture ofmultinomials) • Distribution’s required parameters are calculated to be used in theclassifier: • plug parameters into Bayes’ rule(later)

Assumptions data produced by a mixturemodel 1-1 correspondence between mixture components ofthe mixture model and classes of the classificationproblem naive Bayes assumption of word independence → reduces number ofparameters

ModelParameters • Parameters of a mixturecomponent: • word probabilities (probability of a word given a component):P(x|c) • mixture weight, ie class prior probabilities (probability of selecting a component):P(c)

TrainingClassifier Learning = estimating parameters P(X|Y),P(Y) Want to find parameter values that are most probable given the trainingdata How: use ratios of counts from labeled training data +smoothing (example tofollow)

Example Label email that contains text: “Free $$$bonus” Training data: Example adapted from “A practical explanation of a Naive Bayes classifier”, BrunoStecanella

Example Goal: calculate whether the email has higher probability of being spam/not,using Bayes’rule

Example Make naive Bayesassumption

Example In order to estimate parameters, find ratios ofcounts = =

Example Need to apply Laplacesmoothing!

Example Augment numerator and denominator of ratios with“pseudo-count” = =

UsingClassifier Calculate probability that a particular mixture component generated the document using Bayes’ rule, using estimatedparameters Label: class with the highest posterior probability of generating thedocument Naive Bayes classifier was shown to do a good job at text classification but can do better… (Next: by applying EM to naive Bayes to improve parameter estimation/classificationaccuracy)

ExpectationMaximization

Preface to ExpectationMaximization: RevisitK-Means Assign each point to onecluster based ondistance. Recompute center basedon average of pointsinside. Randomly initializecenters Image:http://stanford.edu/~cpiech/cs221/handouts/kmeans.html

Preface to ExpectationMaximization: RevisitK-Means Iteration Membership: Fix Centers. Assign points to one class. Readjust center: Fix point memberships. Recomputecenter. What if we want to estimate a probability for how likely the point belongs to each class?

Hard Clustering vs. SoftClustering HardClustering Every object is assigned to one clusteri Ai = 0or 1 ∑(Ai) = 1 for all clustersi SoftClustering 0 ≤ Ai≤1 ∑(Ai) =1 for all clustersi Q: How do you do this softclustering?

MixtureModels • Each cluster is a generativemodel • model: Gaussian orMultinomial • Parameters of the model are unknown -- to beestimated How to estimate? ExpectationMaximization!

Expectation Maximization: BasicExample Assume we use 1-D Gaussian model. Assume we know how many clustersk. If we know trueassignments, Images from VictorLavrenko

Expectation Maximization: BasicExample If we don’t know true assignments BUT know Gaussianparameters, You can guess how likely each point belongs to eachcluster Likelihood Prior Posterior Likelihood: Images from VictorLavrenko

Expectation Maximization: BasicExample Issue: What if we don’t know those Gaussianparameters? We need to know those Gaussian parameters… to calculate those posterior probabilities But we need to know cluster posterior probabilities to estimate Gaussian parameters You can guess how likely each point belongs to eachcluster Likelihood Prior Posterior Likelihood: Images from VictorLavrenko

Expectation Maximization: BasicExample • Initialization: Randomly initialize k Gaussians (assume Gaussian) • Each has their own mean andvariance • E-step: Fix model parameters. “Soft” assign points to clusters. • Each point has probability of belonging to eachcluster • M-step: Fix membership probabilities. Adjust parametersthat • maximize the expectedlikelihood

Expectation Maximization: BasicExample Recalculate means and variances for eachcluster. Images from VictorLavrenko

Break When we come back: Wrapping up BasicEM

Why pickEM? • Unlabeled data -- want to discoverclusters • Assume each cluster has underlyingmodel • To estimate, need iterative method forparameters • If interested in which model belongs to which classlabel… Happy Volunteer Cheat Scam (cursewords) Samaritan

Spam?Not spam? EM andText • Words in adocument • Wordcount • Multinomialdistribution Free$$$!! Freewhile supplies last! Given this class label, how likely will you generate this bag ofwords?

Naive Bayes andEM • Initialization: Naive Bayes - estimate classifier parametersfrom • labeleddata • Loop: • Assign probabilistically-weighted class labels to each unlabeled document usingEM • Estimate new classifier parameters using BOTH labeled and unlabeleddata

Limitations to BasicEM • Let’s look at the assumptions… • All data are generated by the mixturemodel • Generated data uses the same parametric model used inclassification 2. 1-to-1 correspondence between mixture components andclasses Unlabeled data helps when there is very limited labeled data… But what if there are lots of labeleddata?

Augmented EM - Part1 1. Assumption: All data are generated by the mixturemodel When enough labeled data is already provided, unlabeled data overwhelms and badly skewsestimates → Introduce a parameter 0 ≤ λ ≤ 1 to decrease unlabeleddocuments contribution Labeleddata Prior Unlabeleddata

Augmented EM - Part1 → By weighing unlabeled documents by λ, you are weighing the word counts of unlabeled documents less by a factor ofλ λ is selected based oncross-validation → When setting 0 < λ < 1 , classificationimproves

Augmented EM - Part2 Assumption: 1-to-1 correspondence between mixture components andclasses → Many-to-onecorrespondence Ex: One class may be comprised of several differentsub-topics. Machine Learning → neural networks, Bayesian, regression,… ReLU ANOVA activation F-statistic One multinomial distribution may not beenough!

Experiments A discussion on the practicalresults of thisapproach

Empirical Validation of the ProposedSystem • Validation of all theirclaims: • Unlabeled data and overallefficacy • Weighting • Multiple mixturecomponents

Datasets • Task: Text Classification • We needdatasets!

Datasets: UseNet - GeneralInformation • Available at:http://qwone.com/~jason/20Newsgroups/ • 20 Different newsgroups (Thelabels) • 20017articles • Not a considerable class imbalance(important)

Datasets: UseNet - In thiswork • They… • 62258 uniquewords • Used a test set of 4000 articles in the latest portion (20%) of thetimeline • The task is usually predicting the future classes not thepast. • Train set is composed of • 10000 randomly selected articles from the rest, asunlabeled • 6000 documents used for labeledexamples

Datasets: WebKB - GeneralInformation • Available athttp://www.cs.cmu.edu/~webkb/ • 8145 Webpages from CSdepratments • Categories: • Student, faculty, staff, course, project, department,other

Datasets: WebKB - In thiswork • Only the four main categories are used (that have moredata) • 4199pages • Numbers are converted to either time or phone numbertoken • Did not perform stemming orstoplist • Showed that actually hurts theperformance • Vocabulary is limited to the main 300 words (most informativewords) • This vocabulary size is selectedempirically • Test using the leave-one-university-outapproach • One complete CS departmentdata • 2500 randomly selected from the rest: unlabeledset • Trainset: same asbefore

Datasets: Reuters - GeneralInformation • Available at http://www.daviddlewis.com/resources/testcollections/reuters21578/ • 12902articles • 90 categories from the Reutersnewswire

Text Classification from Labeled and Unlabeled Documents using EM