Text Classification from Labeled and Unlabeled Documents using EM

1 / 50

# Text Classification from Labeled and Unlabeled Documents using EM - PowerPoint PPT Presentation

Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning (2000). Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell. Presented by Andrew Smith, May 12, 2003. Presentation Outline. Motivation and Background The Naive Bayes classifier

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Text Classification from Labeled and Unlabeled Documents using EM' - sanura

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning (2000)

Kamal Nigam

Andrew K. McCallum

Sebastian Thrun

Tom Mitchell

Presented by Andrew Smith, May 12, 2003

Presentation Outline
• Motivation and Background
• The Naive Bayes classifier
• Incorporating unlabeled data with EM (basic algorithm)
• Enhancement 1 – Modulating the influence of the unlabeled data
• Enhancement 2 – A different probabilistic model
• Conclusions
Motivation

- Given a set of news articles, automatically find documents on the same topic.

- We would like to require as few labeled documents as possible, since labeling documents by hand is expensive.

Previous work

The problem:

- Existing statistical text learning algorithms require many training examples.

- (Lang 1995) A classifier with 1000 training documents ranked unlabeled documents. Of the top 10% only about 50% were correct.

Motivation

Can we somehow use unlabeled documents?

- Yes! Unlabeled data provide information about the joint probability distribution.

Algorithm Outline
• Train a classifier with only the labeled documents.
• Use it to probabilistically classify the unlabeled documents.
• Use ALL the documents to train a new classifier.
• Iterate steps 2 and 3 to convergence.

This is reminiscent of K-Means and EM.

Presentation Outline
• Motivation and Background
• The Naive Bayes classifier
• Incorporating unlabeled data with EM (basic algorithm)
• Enhancement 1 – Modulating the influence of the unlabeled data
• Enhancement 2 – A different probabilistic model
• Conclusions
Probabilistic Framework

Assumptions:

- The data are produced by a mixture model.

Mixture components and class

labels

• There is a one-to-one correspondence between mixture components and document classes.
• Documents
• Indicator variables. This statement means the i th document belongs to class j.
Probabilistic Framework (2)

Mixture Weights

Probability of class j generating document i

the vocabulary (indexed over t)

indicates a word in the vocabulary.

Documents are ordered word lists.

indicates the word at position j in document i.

Probabilistic Framework (3)

The probability of document di is

The probability of mixture component cj generating document di is:

Probabilistic Framework (4)

The Naive Bayes assumption: The words of a document are generated independently of their order in the document, given the class.

Probabilistic Framework (5)

Now the probability of a document given its class becomes

We can use Bayes Rule to classify documents: find the class with highest probability given a novel document.

Probabilistic Framework (6)

To learn the parameters q of the classifier, use ML; find the most likely set of parameters given the data set:

=

The two parameters we need to find are the word probability estimates and the mixture weights, written

• and
Probabilistic Framework (6)

The maximization yields parameters that are word frequency counts:

1 + No. of occurrences of wt in class j

|V| + No. of words in class j

1 + No. of documents in class j

|C| + |D|

Laplace smoothing gives each word a prior probability.

Probabilistic Framework (7)

Number of occurrences of word t in document i

This is 1 if document i is in class j, or 0 otherwise.

Formally

Presentation Outline
• Motivation and Background
• The Naive Bayes classifier
• Incorporating unlabeled data with EM (basic algorithm)
• Enhancement 1 – Modulating the influence of the unlabeled data
• Enhancement 2 – A different probabilistic model
Application of EM to NB
• Estimate with only labeled data
• Assign probabilistically weighted class-labels to unlabeled data.
• Use all class labels (given and estimated) to find new parameters .
• Repeat 2 and 3 until does not change.
More Notation

Set of unlabeled documents

Set of labeled documents

Deriving the basic Algorithm (1)

The probability of all the data is:

For unlabeled data, the component of the probability is a sum across all mixture components.

Deriving the basic Algorithm (2)

Easier to maximize the log-likelihood:

This contains a log of sums, which makes maximization intractable.

Deriving the basic Algorithm (3)

Suppose we have access to the labels for the unlabeled documents, expressed as a matrix of indicator variables z, where if document i is in class j, and 0 otherwise (so rows are documents and columns are classes). Then the terms of

are nonzero only when zij = 1; we treat the labeled and unlabeled documents the same.

Deriving the basic Algorithm (4)

The complete log-likelihood becomes:

If we replace z with its expected value according to the current classifier, then this equation bounds from below the exact log-likelihood, so iteratively increasing this equation will increase the log-likelihood.

Deriving the basic Algorithm (5)

This leads to the basic algorithm:

E-step:

M-step:

Data sets

20 Newsgroups data set:

• 20017 articles drawn evenly from
• 20 newsgroups
• Many categories fall into confusable clusters.
• Words from a stoplist of common short words are removed.
• 62258 unique words occurring more than once
• Word counts of documents are scaled so each document has the same length.
Data sets

WebKB data set:

• 4199 web pages from university CS departments
• Divided into four categories (student, faculty, course, project) with pages.
• No stoplist or stemming used.
• Only 300 most informative words used (mutual information with class variable).
• Validation with a leave-one-university-out approach to prevent idiosyncrasies of particular universities from inflating success measures.
Predictive words found with EM

Iteration 0 Iteration 1 Iteration 2

Intelligence DD D

DD D DD

artificial lecture lecture

understanding cc cc

DDw D* DD:DD

dist DD:DD due

identical handout D*

rus due homework

arrange problem assignment

games set handout

dartmouth tay set

natural DDam hw

cognitive yurttas exam

logic homework problem

proving kkfoury DDam

prolog sec postscript

knowledge postscript solution

human exam quiz

representation solution chapter

field assaf ascii

Presentation Outline
• Motivation and Background
• The Naive Bayes classifier
• Incorporating unlabeled data with EM (basic algorithm)
• Enhancement 1 – Modulating the influence of the unlabeled data
• Enhancement 2 – A different probabilistic model
• Conclusions
The problem

Suppose you have a few labeled documents and many more unlabeled documents.

Then the algorithm almost becomes unsupervised clustering! The only function of the labeled data is to assign class labels to the mixture components.

When the mixture-model assumptions are not true, the basic algorithm will find components that don’t correspond to different class labels.

The solution: EM-l

Modulate the influence of unlabeled data with a parameter And maximize

labeled documents

Unlabeled Documents

EM-l

The E-step is exactly as before, assign probabilistic class labels.

The M-step is modified to reflect l.

Define:

as a weighting factor to modify the frequency counts.

EM-l

The new NB parameter estimates become

Probabilistic class assignment

Weight

Word count

sum over all words and documents

Presentation Outline
• Motivation and Background
• The Naive Bayes classifier
• Incorporating unlabeled data with EM (basic algorithm)
• Enhancement 1 – Modulating the influence of the unlabeled data
• Enhancement 2 – A different probabilistic model
• Conclusions
The idea

EM-l reduced the effects of violated assumptions with the l parameter.

Alternatively, we can change our assumptions. Specifically, change the requirement of a one-to-one correspondence between classes and mixture components to a many-to-one correspondence.

For textual data, this corresponds to saying that a class may consist of several different sub-topics, each best characterized by a different word distribution.

More Notation

now represents only mixture components, not classes.

represents the ath class (“topic”)

is the assignment of mixture components to classes

This assignment is pre-determined, deterministic, and permanent; once assigned to a particular class, mixture components do not change assignment.

The Algorithm

M-step: same as before, find estimates for the mixture components using Laplace priors (MAP).

E-step:

- For unlabeled documents, calculate the probabilistic mixture component memberships exactly as before.

- For labeled documents, we previously considered

to be a fixed indicator (0 or 1) of class membership. Now we allow it to vary between 1 and 0 for mixture components in the same class as di . We set it to zero for mixture components belonging to classes other than the one containing di.

Algorithm details
• Initialize the mixture components for each class by randomly setting for components in the correct class.
• Documents are classified by summing up the mixture component probabilities of one class to form a class probability:
Another data set

Reuters (21578 Distribution 1.0)data set:

• 12902 news articles in 90 topics from Reuters newswire, only the ten most populous classes are used.
• No stemming used.
• Documents are split into early and late categories (by date). The task is to predict the topics of the later articles with classifiers trained on the early ones.
• For all experiments on Reuters, 10 binary classifiers are trained – one per topic.
Performance Metrics

To evaluate the performance, define the two quantities

True Pos.

True Pos. + False Neg.

True Pos.

True Pos. + False Pos.

Actual value

Pos. Neg.

Recall =

Pos.

Neg.

Precision =

Prediction

The recall-precision breakeven point is the value when the two quantities are equal.

The breakeven point is used instead of accuracy (fraction correctly classified). Because the data sets have a much higher frequency of negative examples, the classifier could achieve high accuracy by always predicting negative.

Classification of Reuters

(breakeven points)

Naive Bayes (multiple components)

EM (multiple components)

Naive Bayes

EM

Classification of Reuters

(breakeven points)

Using different numbers of mixture components

Classification of Reuters

(breakeven points)

Naive Bayes with different numbers of mixture components

Classification of Reuters

(breakeven points)

Using cross-validation or best-EM to select the number of mixture components

Conclusions
• Cross-validation tends to underestimate the best number of mixture components.
• Incorporating unlabeled data into any classifier is important because of the high cost of hand-labeling documents.
• Classifiers based on generative models that make incorrect assumptions can still achieve high accuracy.
• The new algorithm does not produce binary classifiers that are much better than NB.