1 / 39

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon Uni

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin. Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela,

janae
Download Presentation

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon Uni

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Theoretic Clustering, Co-clustering and Matrix ApproximationsInderjit S. Dhillon University of Texas, Austin Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu & D. Modha Data Mining Seminar Series, Mar 26, 2004

  2. Clustering: Unsupervised Learning • Grouping together of “similar” objects • Hard Clustering -- Each object belongs to a single cluster • Soft Clustering -- Each object is probabilistically assigned to clusters

  3. Contingency Tables • Let Xand Y be discrete random variables • X and Y take values in {1, 2, …, m} and {1, 2, …, n} • p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data • Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. • Key Obstacles in Clustering Contingency Tables • High Dimensionality, Sparsity, Noise • Need for robust and scalable algorithms

  4. Co-Clustering • Simultaneously • Cluster rows of p(X, Y) into k disjoint groups • Cluster columns of p(X, Y) into l disjoint groups • Key goal is to exploit the “duality” between row and column clustering to overcome sparsity and noise

  5. Co-clustering Example for Text Data • Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix document document clusters word word clusters

  6. Co-clustering and Information Theory • View “co-occurrence” matrix as a joint probability distribution over row & column random variables • We seek a “hard-clustering” of both rows and columns such that “information” in the compressed matrix is maximized.

  7. Information Theory Concepts • Entropy of a random variable X with probability distribution p: • The Kullback-Leibler (KL) Divergence or “Relative Entropy” between two probability distributions p and q: • Mutual Information between random variables X and Y:

  8. “Optimal” Co-Clustering • Seek random variables and taking values in {1, 2, …, k} and {1, 2, …, l} such that mutual information is maximized: where = R(X) is a function of X alone where = C(Y) is a function of Y alone

  9. Related Work • Distributional Clustering • Pereira, Tishby & Lee (1993), Baker & McCallum (1998) • Information Bottleneck • Tishby, Pereira & Bialek(1999), Slonim, Friedman & Tishby (2001), Berkhin & Becher(2002) • Probabilistic Latent Semantic Indexing • Hofmann (1999), Hofmann & Puzicha (1999) • Non-Negative Matrix Approximation • Lee & Seung(2000)

  10. Information-Theoretic Co-clustering • Lemma: “Loss in mutual information” equals • p is the input distribution • q is an approximation to p • Can be shown that q(x,y) is a maximum entropy approximation subject to cluster constraints.

  11. #parameters that determine q(x,y) are:

  12. Decomposition Lemma • Question: How to minimize ? • Following Lemma reveals the Answer: Note that may be thought of as the “prototype” of row cluster. Similarly,

  13. Co-Clustering Algorithm • [Step 1] Set . Start with , Compute . • [Step 2] For every row , assign it to the cluster that minimizes • [Step 3] We have . Compute . • [Step 4] For every column , assign it to the cluster that minimizes • [Step 5] We have . Compute . Iterate 2-5.

  14. Properties of Co-clustering Algorithm • Main Theorem: Co-clustering “monotonically” decreases loss in mutual information • Co-clustering converges to a local minimum • Can be generalized to multi-dimensional contingency tables • q can be viewed as a “low complexity” non-negative matrix approximation • q preserves marginals of p, and co-cluster statistics • Implicit dimensionality reduction at each step helps overcome sparsity & high-dimensionality • Computationally economical

  15. Applications -- Text Classification • Assigning class labels to text documents • Training and Testing Phases New Document Class-1 Document collection Grouped into classes Classifier (Learns from Training data) New Document With Assigned class Class-m Training Data

  16. Feature Clustering (dimensionality reduction) • Feature Selection • Feature Clustering 1 • Select the “best” words • Throw away rest • Frequency based pruning • Information criterion based • pruning Document Bag-of-words Vector Of words Word#1 Word#k m 1 Vector Of words Cluster#1 • Do not throw away words • Cluster words instead • Use clusters as features Document Bag-of-words Cluster#k m

  17. Experiments • Data sets • 20 Newsgroups data • 20 classes, 20000 documents • Classic3 data set • 3 classes (cisi, med and cran), 3893 documents • Dmoz Science HTML data • 49 leaves in the hierarchy • 5000 documents with 14538 words • Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt • Implementation Details • Bow – for indexing,co-clustering, clustering and classifying

  18. Results (20Ng) • Classification Accuracy on 20 Newsgroups data with 1/3-2/3 test-train split • Divisive clustering beats feature selection algorithms by a large margin • The effect is more significant at lower number of features

  19. Results (Dmoz) • Classification Accuracy on Dmoz data with 1/3-2/3 test train split • Divisive Clustering is better at lower number of features • Note contrasting behavior of Naïve Bayes and SVMs

  20. Results (Dmoz) • Naïve Bayes on Dmoz data with only 2% Training data • Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase • Divisive Clustering performs better than IG at lower training data

  21. Hierarchical Classification Science Math Physics Social Science Quantum Theory Number Theory Mechanics Economics Archeology Logic • Flat classifier builds a classifier over the leaf classes in the above hierarchy • Hierarchical Classifier builds a classifier at each internal node of the hierarchy

  22. Results (Dmoz) • Hierarchical Classifier (Naïve Bayes at each node) • Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features) • Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers

  23. Anecdotal Evidence Top few words sorted in Clusters obtained by Divisive and Agglomerative approaches on 20 Newsgroups data

  24. Co-Clustering (0.9835) 1-D Clustering (0.821) 992 4 8 847 142 44 40 1452 7 41 954 405 1 4 1387 275 86 1099 Co-Clustering Results (CLASSIC3)

  25. Binary (0.852,0.67) Binary_subject (0.946,0.648) Co-clustering 1-D Clustering Co-clustering 1-D Clustering 207 31 178 104 234 11 179 94 43 219 72 146 16 239 71 156 Results – Binary (subset of 20Ng data)

  26. Precision – 20Ng data

  27. Results: Sparsity (Binary_subject data)

  28. Results: Sparsity (Binary_subject data)

  29. Results (Monotonicity)

  30. Conclusions • Information-theoretic approach to clustering, co-clustering and matrix approximation • Implicit dimensionality reduction at each step to overcome sparsity & high-dimensionality • Theoretical approach has the potential of extending to other problems: • Multi-dimensional co-clustering • MDL to choose number of co-clusters • Generalized co-clustering by Bregman divergence

  31. More Information • Email: inderjit@cs.utexas.edu • Papers are available at: http://www.cs.utexas.edu/users/inderjit • “Divisive Information-Theoretic Feature Clustering for Text Classification”, Dhillon, Mallela & Kumar, Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002) • “Information-Theoretic Co-clustering”, Dhillon, Mallela & Modha, KDD, 2003. • “Clustering with Bregman Divergences”, Banerjee, Merugu, Dhillon & Ghosh, SIAM Data Mining Proceedings, April, 2004. • “A Generalized Maximum Entropy Approach to Bregman Co-clustering & Matrix Approximation”, Banerjee, Dhillon, Ghosh, Merugu & Modha, working manuscript, 2004.

More Related