Create Presentation
Download Presentation

Download Presentation
## Distributional Clustering of Words for Text Classification

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Distributional Clustering of Words for Text Classification**Andrew Kachites McCallum (Justsystem Pittsburgh Research Center) L.Douglas Baker (Carnegie Mellon University) Presentation by: Thomas Walsh (Rutgers University)**Clustering**• Define what it means for words to be “similar”. • “Collapse” the word space by grouping similar words in “clusters”. • Key Idea for Distributional Clustering: • Class probabilities given the words in a labeled document collection P(C|w) provide rules for correlating words to classifications.**Voting**• Can be understood by a voting model: • Each word in a document casts a weighted vote for classification. • Words that normally vote similarly can be clustered together and vote with the average of their weighted votes without negatively impacting performance.**Benefits of Word Clustering**• Useful Semantic word clustering • Automatically generates a “Thesaurus” • Higher classification accuracy • Sort of, we’ll discuss in the results section • Smaller classification models • size reductions as dramatic as 50000 50**Benefits of Smaller Models**• Easier to compute – with the constantly increasing amount of available text, reducing the memory space is clutch. • Memory constrained devices like PDA’s could now use text classification algorithms to organize documents. • More complex algorithms can be unleashed that would be infeasible in 50000 dimensions.**The Framework**• Start with Training Data with: • Set of Classes C = {c1, c2… cm} • Set of Documents D ={d1… dn} • Each Document has a class label**Mixture Models**• f(xi|q) = Spkh(xi|lk) • Sum of pk’s is 1 • h is a distriution function for x (such as a Gausian) with lk as the parameter (m, S) in the Gausian case. • Thus q = (p1…pk, l1… lk)**What is q in this case?**• Assumption: one-to-one correspondence between the mixture model components and the classes. • The class priors are contained in the vector q0 • Instances of each class / number of documents**What is q in this case?**• The rest of the entries in q correspond to disjoint sets. The jth entry contains the probability of each word wt in the vocabulary V given the class cj. • N(wt, di) is the number of times a word appears in document di. • P(cj|di) = {0, 1}**Prob. of a given Document in the Model**• The mixture model can be used to produce documents with probability: • Just the sum of the probability of generating this document in the model over each class.**Documents as Collections of Words**• Treat each document as an ordered collection of word events. • Dik = work in document di at place k. • Each word is dependent on preceding words**Apply Naïve Bayes Assumption**• Assume each word is independent of both content and position • Where dik = wt • Update Formulas 2 and 1: • (2) P(di | cj ; q) = P P(wt|cj ; q) • (1) P(di| q) = S P(cj|q) P P(wt|cj; q)**Incorporate Expanded Formulae for q**• We can calculate the model parameter q from the training data. • Now we wish to calculate P(cj|di; q), the probability of document di belonging to class cj.**Final Equation**Class prior * (2)Product of all the probabilities of each word in the document assuming we are in class cj ------------------------------------------------------------- (1/2/3) Sum of all class priors * product of all word probabilities assuming we are in class cr • Maximize and that value of cj is the class for the document**Shortcomings of the Framework**• In real world data (documents) there isn’t actually an underlying mixture model and the independence assumption doesn’t actually hold. • But empirical evidence and some theoretical writing (Domingos and Pazzani 1997) indicates the damage from this is negligible.**What about clustering?**• So assuming the Framework holds… how does clustering fit into all this?**How Does Clustering affect probabilities?**• Fraction of cluster from wt + fraction of cluster from ws**Vs. other forms of learning**• Measures similarity based on the property it is trying to estimate (the classes) • Makes the supervision in the training data really important. • Clustering is based on the similarity of the class variable distributions • Key Idea: Clustering preserves the “shape” of the class distributions.**Kullock-Liebler Divergence**• Measures the similarity between class distributions • D( P(C | wt) || P(C | ws)) = • If P(cj | wt) = P(cj | ws) then log(1) = 0**Problems with K-L Divergence**• Not symmetric • Denominator can be 0 if ws does not appear in any documents of class cj.**K-L Divergence from the Mean**• Ratio of each words occurrence in the cluster * K-L divergence of that word within the cluster • New and improved: uses a weighted average instead of just the mean • Justification: fits clustering because independent distributions now form combined statistics.**Minimizing Error in Naïve Bayes Scores**• Assuming uniform class priors allows us to drop P(cj | q) and the whole denominator from (6) • Then performing a little algebra gets us the cross entropy: • So error can be measured in the difference in cross-entropy caused by clustering. Minimizing this equation results in equation (9), so clustering in this method minimizes error.**The Clustering Algorithm**• Comparing similarity of all possible word clusters would be O(V2) • Instead, a number M is set as the total number of desired clusters • More supervision • M clusters initialized with the M words with the highest mutual information to the class variable • Properties: Greedy, scales efficiently**Algorithm**S P(C | wt)**Related Work**• Chi Merge / Chi 2 • Use D. Clustering to discretize numbers • Class-based clustering • Uses amount that mutual information is reduced to determine when to cluster • Not effective in text classification • Feature Selection by Mutual Information • cannot capture dependencies between words • Markov-blanket-based Feature Selection • Also attempts to Preserve P(C | wt) shapes • Latent Semantic Indexing • Unsupervised, using PCA**The Experiment : Competitors to Distributional Clustering**• Clustering with LSI • Information Gain Based Feature Selection • Mutual-Information Feature Selection • Feature Selection involves cutting out redundant instances • Clustering combines these redundancies**The Experiment: Testbeds**• 20 Newsgroups • 20,000 articles from 20 usenet groups (apx 62000 words) • ModApte “Reuters-21578” • 9603 training docs, 3299 testing docs, 135 topics (apx. 16000 words) • Yahoo! Science (July 1997) • 6294 pages in 41 classes (apx. 44000 words) • Very noisy data**20 Newsgroups Results**• Averaged over 5-20 trials • Computational constraints forced Markov blanket to a smaller data set (second graph) • LSI uses only 1/3 training ratio**20 Newsgroups Analysis**• Distributional Clustering achieves 82.1% accuracy at 50 features, almost as good as having the full vocabulary. • More accurate then all non-clustering approaches • LSI did not add any improvement to clustering (claim: because it is unsupervised) • On the smaller data set, D.C. achieves 80% accuracy far quicker then the others, in some cases doubling their performance for small numbers of features. • Claim: Clustering outperforms Feature selection because it conserves information rather than discarding it.**Speed in 20-Newsgroups Test**• Distributional Clustering: 7.5 minutes • LSI: 23 minutes • Makov Blanket: 10 hours • Mutual information feature selection (???): 30 seconds**Reuters-21578 Results**• D.C. outperforms others for small numbers of features • Information-Gain based feature selection does better for larger feature sets. • In this data set, documents can have multiple labels.**Yahoo! Results**• Feature selection performs almost as well or better in these cases • Claim: The data is so noisy that it is actually beneficial to “lose data” via feature selection.**Performance Summary**• Only slight loss in accuraccy despite despite the reduction in feature space • Preserves “redundent” information better than feature selection. • The improvement is not as drastic with noisy data.**Improvements on Earlier D.C. Work**• Does not show much improvement on sparse data because the performance measure is related to the data distribution • D.C. preserves class distributions, even if these are poor estimates to begin with. • Thus this whole method relies on accurate values for P(C | wi)**Future Work**• Improve D.C.’s handling of sparse data (ensure good estimates of P(C | wi) • Find ways to combine feature selection and D.C. to utilize the strengths of both (perhaps increase performance on noisy data sets?)**Some Thoughts**• Extremely supervised • Needs to be retrained when new documents come in • In a paper with a lot of topics, does Naïve Bayes (word independent of context) make sense? • Didn’t work well in noisy data • How can we ensure proper theta values?