Loading in 2 Seconds...
Loading in 2 Seconds...
Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning (2000). Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell. Presented by Andrew Smith, May 12, 2003. Presentation Outline. Motivation and Background The Naive Bayes classifier
Machine Learning (2000)
Andrew K. McCallum
Presented by Andrew Smith, May 12, 2003
- Given a set of news articles, automatically find documents on the same topic.
- We would like to require as few labeled documents as possible, since labeling documents by hand is expensive.
- Existing statistical text learning algorithms require many training examples.
- (Lang 1995) A classifier with 1000 training documents ranked unlabeled documents. Of the top 10% only about 50% were correct.
Can we somehow use unlabeled documents?
- Yes! Unlabeled data provide information about the joint probability distribution.
This is reminiscent of K-Means and EM.
- The data are produced by a mixture model.
Mixture components and class
Probability of class j generating document i
the vocabulary (indexed over t)
indicates a word in the vocabulary.
Documents are ordered word lists.
indicates the word at position j in document i.
The probability of document di is
The probability of mixture component cj generating document di is:
The Naive Bayes assumption: The words of a document are generated independently of their order in the document, given the class.
Now the probability of a document given its class becomes
We can use Bayes Rule to classify documents: find the class with highest probability given a novel document.
To learn the parameters q of the classifier, use ML; find the most likely set of parameters given the data set:
The two parameters we need to find are the word probability estimates and the mixture weights, written
The maximization yields parameters that are word frequency counts:
1 + No. of occurrences of wt in class j
|V| + No. of words in class j
1 + No. of documents in class j
|C| + |D|
Laplace smoothing gives each word a prior probability.
Number of occurrences of word t in document i
This is 1 if document i is in class j, or 0 otherwise.
Using Bayes Rule:
Set of unlabeled documents
Set of labeled documents
The probability of all the data is:
For unlabeled data, the component of the probability is a sum across all mixture components.
Easier to maximize the log-likelihood:
This contains a log of sums, which makes maximization intractable.
Suppose we have access to the labels for the unlabeled documents, expressed as a matrix of indicator variables z, where if document i is in class j, and 0 otherwise (so rows are documents and columns are classes). Then the terms of
are nonzero only when zij = 1; we treat the labeled and unlabeled documents the same.
The complete log-likelihood becomes:
If we replace z with its expected value according to the current classifier, then this equation bounds from below the exact log-likelihood, so iteratively increasing this equation will increase the log-likelihood.
This leads to the basic algorithm:
20 Newsgroups data set:
WebKB data set:
Iteration 0 Iteration 1 Iteration 2
Intelligence DD D
DD D DD
artificial lecture lecture
understanding cc cc
DDw D* DD:DD
dist DD:DD due
identical handout D*
rus due homework
arrange problem assignment
games set handout
dartmouth tay set
natural DDam hw
cognitive yurttas exam
logic homework problem
proving kkfoury DDam
prolog sec postscript
knowledge postscript solution
human exam quiz
representation solution chapter
field assaf ascii
Suppose you have a few labeled documents and many more unlabeled documents.
Then the algorithm almost becomes unsupervised clustering! The only function of the labeled data is to assign class labels to the mixture components.
When the mixture-model assumptions are not true, the basic algorithm will find components that don’t correspond to different class labels.
Modulate the influence of unlabeled data with a parameter And maximize
The E-step is exactly as before, assign probabilistic class labels.
The M-step is modified to reflect l.
as a weighting factor to modify the frequency counts.
The new NB parameter estimates become
Probabilistic class assignment
sum over all words and documents
EM-l reduced the effects of violated assumptions with the l parameter.
Alternatively, we can change our assumptions. Specifically, change the requirement of a one-to-one correspondence between classes and mixture components to a many-to-one correspondence.
For textual data, this corresponds to saying that a class may consist of several different sub-topics, each best characterized by a different word distribution.
now represents only mixture components, not classes.
represents the ath class (“topic”)
is the assignment of mixture components to classes
This assignment is pre-determined, deterministic, and permanent; once assigned to a particular class, mixture components do not change assignment.
M-step: same as before, find estimates for the mixture components using Laplace priors (MAP).
- For unlabeled documents, calculate the probabilistic mixture component memberships exactly as before.
- For labeled documents, we previously considered
to be a fixed indicator (0 or 1) of class membership. Now we allow it to vary between 1 and 0 for mixture components in the same class as di . We set it to zero for mixture components belonging to classes other than the one containing di.
Reuters (21578 Distribution 1.0)data set:
To evaluate the performance, define the two quantities
True Pos. + False Neg.
True Pos. + False Pos.
The recall-precision breakeven point is the value when the two quantities are equal.
The breakeven point is used instead of accuracy (fraction correctly classified). Because the data sets have a much higher frequency of negative examples, the classifier could achieve high accuracy by always predicting negative.
Naive Bayes (multiple components)
EM (multiple components)
Using different numbers of mixture components
Naive Bayes with different numbers of mixture components
Using cross-validation or best-EM to select the number of mixture components