1 / 37

Distributional Clustering of Words for Text Classification

Distributional Clustering of Words for Text Classification. Andrew Kachites McCallum (Justsystem Pittsburgh Research Center). L.Douglas Baker (Carnegie Mellon University). Presentation by: Thomas Walsh (Rutgers University). Clustering. Define what it means for words to be “similar”.

holland
Download Presentation

Distributional Clustering of Words for Text Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributional Clustering of Words for Text Classification Andrew Kachites McCallum (Justsystem Pittsburgh Research Center) L.Douglas Baker (Carnegie Mellon University) Presentation by: Thomas Walsh (Rutgers University)

  2. Clustering • Define what it means for words to be “similar”. • “Collapse” the word space by grouping similar words in “clusters”. • Key Idea for Distributional Clustering: • Class probabilities given the words in a labeled document collection P(C|w) provide rules for correlating words to classifications.

  3. Voting • Can be understood by a voting model: • Each word in a document casts a weighted vote for classification. • Words that normally vote similarly can be clustered together and vote with the average of their weighted votes without negatively impacting performance.

  4. Benefits of Word Clustering • Useful Semantic word clustering • Automatically generates a “Thesaurus” • Higher classification accuracy • Sort of, we’ll discuss in the results section • Smaller classification models • size reductions as dramatic as 50000  50

  5. Benefits of Smaller Models • Easier to compute – with the constantly increasing amount of available text, reducing the memory space is clutch. • Memory constrained devices like PDA’s could now use text classification algorithms to organize documents. • More complex algorithms can be unleashed that would be infeasible in 50000 dimensions.

  6. The Framework • Start with Training Data with: • Set of Classes C = {c1, c2… cm} • Set of Documents D ={d1… dn} • Each Document has a class label

  7. Mixture Models • f(xi|q) = Spkh(xi|lk) • Sum of pk’s is 1 • h is a distriution function for x (such as a Gausian) with lk as the parameter (m, S) in the Gausian case. • Thus q = (p1…pk, l1… lk)

  8. What is q in this case? • Assumption: one-to-one correspondence between the mixture model components and the classes. • The class priors are contained in the vector q0 • Instances of each class / number of documents

  9. What is q in this case? • The rest of the entries in q correspond to disjoint sets. The jth entry contains the probability of each word wt in the vocabulary V given the class cj. • N(wt, di) is the number of times a word appears in document di. • P(cj|di) = {0, 1}

  10. Prob. of a given Document in the Model • The mixture model can be used to produce documents with probability: • Just the sum of the probability of generating this document in the model over each class.

  11. Documents as Collections of Words • Treat each document as an ordered collection of word events. • Dik = work in document di at place k. • Each word is dependent on preceding words

  12. Apply Naïve Bayes Assumption • Assume each word is independent of both content and position • Where dik = wt • Update Formulas 2 and 1: • (2) P(di | cj ; q) = P P(wt|cj ; q) • (1) P(di| q) = S P(cj|q) P P(wt|cj; q)

  13. Incorporate Expanded Formulae for q • We can calculate the model parameter q from the training data. • Now we wish to calculate P(cj|di; q), the probability of document di belonging to class cj.

  14. Final Equation Class prior * (2)Product of all the probabilities of each word in the document assuming we are in class cj ------------------------------------------------------------- (1/2/3) Sum of all class priors * product of all word probabilities assuming we are in class cr • Maximize and that value of cj is the class for the document

  15. Shortcomings of the Framework • In real world data (documents) there isn’t actually an underlying mixture model and the independence assumption doesn’t actually hold. • But empirical evidence and some theoretical writing (Domingos and Pazzani 1997) indicates the damage from this is negligible.

  16. What about clustering? • So assuming the Framework holds… how does clustering fit into all this?

  17. How Does Clustering affect probabilities? • Fraction of cluster from wt + fraction of cluster from ws

  18. Vs. other forms of learning • Measures similarity based on the property it is trying to estimate (the classes) • Makes the supervision in the training data really important. • Clustering is based on the similarity of the class variable distributions • Key Idea: Clustering preserves the “shape” of the class distributions.

  19. Kullock-Liebler Divergence • Measures the similarity between class distributions • D( P(C | wt) || P(C | ws)) = • If P(cj | wt) = P(cj | ws) then log(1) = 0

  20. Problems with K-L Divergence • Not symmetric • Denominator can be 0 if ws does not appear in any documents of class cj.

  21. K-L Divergence from the Mean • Ratio of each words occurrence in the cluster * K-L divergence of that word within the cluster • New and improved: uses a weighted average instead of just the mean • Justification: fits clustering because independent distributions now form combined statistics.

  22. Minimizing Error in Naïve Bayes Scores • Assuming uniform class priors allows us to drop P(cj | q) and the whole denominator from (6) • Then performing a little algebra gets us the cross entropy: • So error can be measured in the difference in cross-entropy caused by clustering. Minimizing this equation results in equation (9), so clustering in this method minimizes error.

  23. The Clustering Algorithm • Comparing similarity of all possible word clusters would be O(V2) • Instead, a number M is set as the total number of desired clusters • More supervision • M clusters initialized with the M words with the highest mutual information to the class variable • Properties: Greedy, scales efficiently

  24. Algorithm S P(C | wt)

  25. Related Work • Chi Merge / Chi 2 • Use D. Clustering to discretize numbers • Class-based clustering • Uses amount that mutual information is reduced to determine when to cluster • Not effective in text classification • Feature Selection by Mutual Information • cannot capture dependencies between words • Markov-blanket-based Feature Selection • Also attempts to Preserve P(C | wt) shapes • Latent Semantic Indexing • Unsupervised, using PCA

  26. The Experiment : Competitors to Distributional Clustering • Clustering with LSI • Information Gain Based Feature Selection • Mutual-Information Feature Selection • Feature Selection involves cutting out redundant instances • Clustering combines these redundancies

  27. The Experiment: Testbeds • 20 Newsgroups • 20,000 articles from 20 usenet groups (apx 62000 words) • ModApte “Reuters-21578” • 9603 training docs, 3299 testing docs, 135 topics (apx. 16000 words) • Yahoo! Science (July 1997) • 6294 pages in 41 classes (apx. 44000 words) • Very noisy data

  28. 20 Newsgroups Results • Averaged over 5-20 trials • Computational constraints forced Markov blanket to a smaller data set (second graph) • LSI uses only 1/3 training ratio

  29. 20 Newsgroups Analysis • Distributional Clustering achieves 82.1% accuracy at 50 features, almost as good as having the full vocabulary. • More accurate then all non-clustering approaches • LSI did not add any improvement to clustering (claim: because it is unsupervised) • On the smaller data set, D.C. achieves 80% accuracy far quicker then the others, in some cases doubling their performance for small numbers of features. • Claim: Clustering outperforms Feature selection because it conserves information rather than discarding it.

  30. Speed in 20-Newsgroups Test • Distributional Clustering: 7.5 minutes • LSI: 23 minutes • Makov Blanket: 10 hours • Mutual information feature selection (???): 30 seconds

  31. Reuters-21578 Results • D.C. outperforms others for small numbers of features • Information-Gain based feature selection does better for larger feature sets. • In this data set, documents can have multiple labels.

  32. Yahoo! Results • Feature selection performs almost as well or better in these cases • Claim: The data is so noisy that it is actually beneficial to “lose data” via feature selection.

  33. Performance Summary • Only slight loss in accuraccy despite despite the reduction in feature space • Preserves “redundent” information better than feature selection. • The improvement is not as drastic with noisy data.

  34. Improvements on Earlier D.C. Work • Does not show much improvement on sparse data because the performance measure is related to the data distribution • D.C. preserves class distributions, even if these are poor estimates to begin with. • Thus this whole method relies on accurate values for P(C | wi)

  35. Future Work • Improve D.C.’s handling of sparse data (ensure good estimates of P(C | wi) • Find ways to combine feature selection and D.C. to utilize the strengths of both (perhaps increase performance on noisy data sets?)

  36. Some Thoughts • Extremely supervised • Needs to be retrained when new documents come in • In a paper with a lot of topics, does Naïve Bayes (word independent of context) make sense? • Didn’t work well in noisy data • How can we ensure proper theta values?

More Related