Text Classification from Labeled and Unlabeled Documents using EM

Text Classification from Labeled and Unlabeled Documents using EM • Kamal Nigam • Andrew Kachites Mccallum • Sebastian Thrun • Tom Mitchell Presented by Yuan Fang, Fengyuan Hu and Sandhya Prabhakaran

Job Hunting?

Roadmap • Part 1 – Text Classification • Part 2 – Incorporating Unlabeled data with EM • Part 3 – Results and Recap

Part I Text Classification

Text Classification – the Definition • “Text classification systems categorize documents into one (or several) of a set of pre-defined topics of interest”

How Are Automatic Text Classifiers Created • Before: Manual construction of rule sets (Painful and time-consuming ) • Present : Supervised learning to construct a classifier (efficient and successful)

What To Provide • An algorithm with an example set of documents for each class and allow it to find a representation or decision rule for classifying future documents automatically • This approach will : - give high-accuracy classifiers - be significantly less expensive

What Data is Available • Key difficulty : A large number of labeled training examples are required to learn accurately - What we need but don't have • One would obviously prefer algorithms that can provide accurate classifications after hand labeling only a dozen articles, rather than thousands • What other sources of information can reduce the need for labeled data?

Unlabeled data • How unlabeled data can be used to increase classification accuracy, especially when labeled data are scarce • An intuitive example

Goal And Merit • The goal – To demonstrate that supervised learning algorithms can use a small number of labeled examples with a large number of unlabeled examples to create high-accuracy text classifiers • The merit – Unlabeled examples are much less expensive and easily available

Parametric Generative Model Overview • Assumption : a statistical process generates the documents (words and class labels) • statistical process - parametric generative model

Incorporating Unlabeled Data withGenerative Models • Using EM to find high-probability parameters of the model given a combination of labeled and unlabeled data • Experimental evidence shows that using unlabeled data with EM can increase classification accuracy

Assumptions In the Model (1) Documents are generated by a mixture of multinomials model, where each mixture component corresponds to a class (1 class to 1 component) (2) The mixture components are multinomial distributions of individual words - the words are produced independently of each other given the class

Two Multisided Dies • Let there be |C| classes and a vocabulary of size |V|; each document d has |d| words in it. • First, we roll a biased |C|-sided die to determine the class of our document. • We roll the biased |V|-sided die that corresponds to the chosen class |d| times and write down the indicated words. These words form the generated document.

Parametric Generative Model • - parameters for the mixture model • - mixture of components • - mixture weights or class probabilities • - document distribution of selected class • Equation (1)

Denotation • - the jth mixture component, as well as the jth class. • - the class label for a particular document ( ) • A document is considered to be an ordered list of word events, • We write for the word in position k of • - a word in the vocabulary • - document length, chosen independently of the component, its own probability

Parametric Generative Model • Expanding the Equation (1) with document length and the words in the document. Equation (2) • The words of a document are generated independently of context Equation (3) • Combining these last two equations gives the naive Bayes expression for the probability of a document given its class Equation (4)

Model Parameters • Collection of word probabilities, each written • Document length is identically distributed, no need to be parameterized for classification • denoted as the mixture weights (class probabilities) • The complete collection of model parameters

Naive Bayes Text Classification • Using a collection of labeled documents for training • Finding the most probable parameters for the statistical model introduced

Training A Naive Bayes Classifier With Labeled Data • Estimating the parameters of the generative model by using a set of labeled training data (the estimate of the parameters is written ) • Finding (MAP), the value of that is most probable given the evidence of the training data and a prior.

Training A Naive Bayes Classifier With Labeled Data • The word probability estimates are given by Equation (6) • Class probabilities Equation (7)

Classifying New Documents with Naive Bayes Equation (8) • If the task is to classify a test document into a single class, then the class with the highest posterior probability is selected.

Part ⅡIncorporating Unlabeled Data with EM

The Problem • The case that given only labeled data is explained already. • MAP– to maximize the posterior probability. • Naïve Bayes– do classification of labeled data. • Now the case is given both labeled and unlabeled data. • Searching for a solution? – Here it is!

Revision of EM • Recall the EM knowledge in PMR – Might be painful, but helpful • Mixture Model • Hidden variable – z to active the components

Revision of EM • EM applied to Gaussian Mixture Model • Maximum Likelihood Estimation Parameters: µandΣ • E step: evaluate the responsibilities using current estimators/parameters • M step: re-estimate by using the maximum a posteriori parameter • Run the demo

Back to the paper

Back to the paper • Collection of labeled and unlabeled documents. • MAP • Try to maximize P(θ|D) • Bayesian method -- P(θ|D) → P(θ) P(D| θ)

Back to the paper • Log likelihood • Incomplete equation

Back to the paper • z – binary indicator variables which is set to be 1 if y = c, else zero. • Then problem of the incomplete log probability can be transferred to complete log probability of parameters.

Back to the paper • Methods used in the paper • Basic EM • Augmented EM (1) Weighting the unlabeled data (2) Multiple mixture components per class

Basic EM • Initialize the NB classifier using MAP parameter estimation, from only labeled dataset. • E step: estimate the component membership by calculating its expected value generated by from only unlabeled data. • M step: re-estimate the classifier for the whole data set, using MAP, loop from E step: • Look at to measure the improvement of the parameters, decide when to stop the loop

Restrictions of Basic EM • Assumptions/Restrictions: • Large unlabeled data set, small labeled data set → if not true, unlabeled data will hurt the accuracy. • One-to-one correspondence of components and classes → not so accurate because subtopics exist.

Augmented EM – weighting unlabeled data • Method: weakening the contribution of unlabeled data while the labeled set is already good enough for classification. • Equation:

Augmented EM – weighting unlabeled data • λis decided by leave-one-out cross validation. • is defined to tell whether it is labeled or unlabeled. • Modified MAP parameters:

Augmented EM -- multiple mixture components per class • Method: Relax the assumption that one-to-one correspondence of components and classes. • Many-to-one relationship between components and classes.

Augmented EM – multiple mixture components per class • How? • Decide the number of components per class by again cross-validation. • Mapping from components to classes:

The complete algorithm • Collections of labeled, unlabeled documents. • Set λby cross-validation. • Set the number of components per class. • Randomly assign for mixture components. • Initialize the parameters θ of NB classifier using MAP. • Loop until complete log likelihood of labeled and unlabeled data is satisfying enough. • E step: estimate the component membership of each doc using θ • M step: re-estimate θgiven the membership, still MAP.

Comparison • Basic EM: performs well comparing with naïve bayes classifier alone, with large unlabeled dataset and small set of labeled data • EM-λ: can apparently improve the accuracy if the assumption above doesn’t fit. • Multiple Components: dramatically outperforms than basic EM.

Part III Results and Recap

Experimental Results • Empirical evidence that on combining labeled with unlabeled data using EM outperforms naive Bayes. • 20 Newsgroups, WebKB, Reuters • Improvements in accuracy due to unlabeled data are dramatic, especially when the number of labeled data is low. • Augmented EM can increase performance even when basic EM performs poor due to large number of unlabeled data.

Data sets and Protocols • 20 Newsgroup • 20017 articles divided evenly among 20 different UseNet discussion groups. • Task - to classify an article into the one newsgroup to which it was posted. • Many categories fall into confusable clusters. • Stop words are removed – 62258 unique words • Word counts are normalized and scaled – each document has constant length.

Data sets and Protocols - WebKB • 8145 Web pages gathered from university computer science departments. • Choosing 4199 pages covering categories: student, faculty, course and project. • Task - to classify a web page into one of the four categories. • Stemming and stoplist are not used. • Vocabulary is limited to 300 most informative words using leave-one-out cross validation.

Data sets and Protocols • Reuters • 12902 articles and 90 topic categories. • Task - to build a binary classifier for each of the ten most populous classes to identify the news topic. • Words inside <TEXT> tags are used – REUTERS and &# not used. • Stoplist are used, but no stemming. • Metrics are Recall and Precision instead of Accuracy.

Precision-Recall breakeven point • Standard information retrieval measure • Recall – number of correct positive predictions number of positive examples • Precision - number of correct positive predictions number of positive predictions

Wall-clock timing • EM usually converges after 10 iterations • Less than 1 minute for the WebKB • Less than 15 minutes for 20 Newsgroups – huge vocabulary and more documents

EM with unlabeled data increases Accuracy Figure 1:- Accuracy versus # of Labeled Documents. (20 Newsgroups)

Effect of varying the # of unlabeled documents Figure 2:- Accuracy versus # of unlabeled documents. (20 Newsgroups)

EM algorithm in action Figure 3:- ‘Course’ class for WebKB dataset

EM performance degradation Figure 4:- As # of Labeled data increases, accuracy of classifier falls with more # of unlabeled data. Importance of weighting factor λ. (WebKB)

Text Classification from Labeled and Unlabeled Documents using EM