Distributional Clustering of Words for Text Classification

Distributional Clustering of Words for Text Classification Andrew Kachites McCallum (Justsystem Pittsburgh Research Center) L.Douglas Baker (Carnegie Mellon University) Presentation by: Thomas Walsh (Rutgers University)

Clustering • Define what it means for words to be “similar”. • “Collapse” the word space by grouping similar words in “clusters”. • Key Idea for Distributional Clustering: • Class probabilities given the words in a labeled document collection P(C|w) provide rules for correlating words to classifications.

Voting • Can be understood by a voting model: • Each word in a document casts a weighted vote for classification. • Words that normally vote similarly can be clustered together and vote with the average of their weighted votes without negatively impacting performance.

Benefits of Word Clustering • Useful Semantic word clustering • Automatically generates a “Thesaurus” • Higher classification accuracy • Sort of, we’ll discuss in the results section • Smaller classification models • size reductions as dramatic as 50000  50

Benefits of Smaller Models • Easier to compute – with the constantly increasing amount of available text, reducing the memory space is clutch. • Memory constrained devices like PDA’s could now use text classification algorithms to organize documents. • More complex algorithms can be unleashed that would be infeasible in 50000 dimensions.

The Framework • Start with Training Data with: • Set of Classes C = {c1, c2… cm} • Set of Documents D ={d1… dn} • Each Document has a class label

Mixture Models • f(xi|q) = Spkh(xi|lk) • Sum of pk’s is 1 • h is a distriution function for x (such as a Gausian) with lk as the parameter (m, S) in the Gausian case. • Thus q = (p1…pk, l1… lk)

What is q in this case? • Assumption: one-to-one correspondence between the mixture model components and the classes. • The class priors are contained in the vector q0 • Instances of each class / number of documents

What is q in this case? • The rest of the entries in q correspond to disjoint sets. The jth entry contains the probability of each word wt in the vocabulary V given the class cj. • N(wt, di) is the number of times a word appears in document di. • P(cj|di) = {0, 1}

Prob. of a given Document in the Model • The mixture model can be used to produce documents with probability: • Just the sum of the probability of generating this document in the model over each class.

Documents as Collections of Words • Treat each document as an ordered collection of word events. • Dik = work in document di at place k. • Each word is dependent on preceding words

Incorporate Expanded Formulae for q • We can calculate the model parameter q from the training data. • Now we wish to calculate P(cj|di; q), the probability of document di belonging to class cj.

Final Equation Class prior * (2)Product of all the probabilities of each word in the document assuming we are in class cj ------------------------------------------------------------- (1/2/3) Sum of all class priors * product of all word probabilities assuming we are in class cr • Maximize and that value of cj is the class for the document

Shortcomings of the Framework • In real world data (documents) there isn’t actually an underlying mixture model and the independence assumption doesn’t actually hold. • But empirical evidence and some theoretical writing (Domingos and Pazzani 1997) indicates the damage from this is negligible.

What about clustering? • So assuming the Framework holds… how does clustering fit into all this?

How Does Clustering affect probabilities? • Fraction of cluster from wt + fraction of cluster from ws

Vs. other forms of learning • Measures similarity based on the property it is trying to estimate (the classes) • Makes the supervision in the training data really important. • Clustering is based on the similarity of the class variable distributions • Key Idea: Clustering preserves the “shape” of the class distributions.

Kullock-Liebler Divergence • Measures the similarity between class distributions • D( P(C | wt) || P(C | ws)) = • If P(cj | wt) = P(cj | ws) then log(1) = 0

Problems with K-L Divergence • Not symmetric • Denominator can be 0 if ws does not appear in any documents of class cj.

K-L Divergence from the Mean • Ratio of each words occurrence in the cluster * K-L divergence of that word within the cluster • New and improved: uses a weighted average instead of just the mean • Justification: fits clustering because independent distributions now form combined statistics.

Minimizing Error in Naïve Bayes Scores • Assuming uniform class priors allows us to drop P(cj | q) and the whole denominator from (6) • Then performing a little algebra gets us the cross entropy: • So error can be measured in the difference in cross-entropy caused by clustering. Minimizing this equation results in equation (9), so clustering in this method minimizes error.

The Clustering Algorithm • Comparing similarity of all possible word clusters would be O(V2) • Instead, a number M is set as the total number of desired clusters • More supervision • M clusters initialized with the M words with the highest mutual information to the class variable • Properties: Greedy, scales efficiently

Algorithm S P(C | wt)

Related Work • Chi Merge / Chi 2 • Use D. Clustering to discretize numbers • Class-based clustering • Uses amount that mutual information is reduced to determine when to cluster • Not effective in text classification • Feature Selection by Mutual Information • cannot capture dependencies between words • Markov-blanket-based Feature Selection • Also attempts to Preserve P(C | wt) shapes • Latent Semantic Indexing • Unsupervised, using PCA

The Experiment : Competitors to Distributional Clustering • Clustering with LSI • Information Gain Based Feature Selection • Mutual-Information Feature Selection • Feature Selection involves cutting out redundant instances • Clustering combines these redundancies

The Experiment: Testbeds • 20 Newsgroups • 20,000 articles from 20 usenet groups (apx 62000 words) • ModApte “Reuters-21578” • 9603 training docs, 3299 testing docs, 135 topics (apx. 16000 words) • Yahoo! Science (July 1997) • 6294 pages in 41 classes (apx. 44000 words) • Very noisy data

20 Newsgroups Results • Averaged over 5-20 trials • Computational constraints forced Markov blanket to a smaller data set (second graph) • LSI uses only 1/3 training ratio

20 Newsgroups Analysis • Distributional Clustering achieves 82.1% accuracy at 50 features, almost as good as having the full vocabulary. • More accurate then all non-clustering approaches • LSI did not add any improvement to clustering (claim: because it is unsupervised) • On the smaller data set, D.C. achieves 80% accuracy far quicker then the others, in some cases doubling their performance for small numbers of features. • Claim: Clustering outperforms Feature selection because it conserves information rather than discarding it.

Speed in 20-Newsgroups Test • Distributional Clustering: 7.5 minutes • LSI: 23 minutes • Makov Blanket: 10 hours • Mutual information feature selection (???): 30 seconds

Reuters-21578 Results • D.C. outperforms others for small numbers of features • Information-Gain based feature selection does better for larger feature sets. • In this data set, documents can have multiple labels.

Yahoo! Results • Feature selection performs almost as well or better in these cases • Claim: The data is so noisy that it is actually beneficial to “lose data” via feature selection.

Performance Summary • Only slight loss in accuraccy despite despite the reduction in feature space • Preserves “redundent” information better than feature selection. • The improvement is not as drastic with noisy data.

Improvements on Earlier D.C. Work • Does not show much improvement on sparse data because the performance measure is related to the data distribution • D.C. preserves class distributions, even if these are poor estimates to begin with. • Thus this whole method relies on accurate values for P(C | wi)

Future Work • Improve D.C.’s handling of sparse data (ensure good estimates of P(C | wi) • Find ways to combine feature selection and D.C. to utilize the strengths of both (perhaps increase performance on noisy data sets?)

Some Thoughts • Extremely supervised • Needs to be retrained when new documents come in • In a paper with a lot of topics, does Naïve Bayes (word independent of context) make sense? • Didn’t work well in noisy data • How can we ensure proper theta values?

Distributional Clustering of Words for Text Classification