Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin

Download Presentation

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin

Loading in 2 Seconds...

- 112 Views
- Uploaded on
- Presentation posted in: General

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Information Theoretic Clustering, Co-clustering and Matrix ApproximationsInderjit S. Dhillon University of Texas, Austin

Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela,

S. Merugu & D. Modha

Data Mining Seminar Series,

Mar 26, 2004

- Grouping together of “similar” objects
- Hard Clustering -- Each object belongs to a single cluster
- Soft Clustering -- Each object is probabilistically assigned to clusters

- Let Xand Y be discrete random variables
- X and Y take values in {1, 2, …, m} and {1, 2, …, n}
- p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data
- Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc.

- Key Obstacles in Clustering Contingency Tables
- High Dimensionality, Sparsity, Noise
- Need for robust and scalable algorithms

- Simultaneously
- Cluster rows of p(X, Y) into k disjoint groups
- Cluster columns of p(X, Y) into l disjoint groups

- Key goal is to exploit the “duality” between row and column clustering to overcome sparsity and noise

- Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix

document

document clusters

word

word

clusters

- View “co-occurrence” matrix as a joint probability distribution over row & column random variables
- We seek a “hard-clustering” of both rows and columns such that “information” in the compressed matrix is maximized.

- Entropy of a random variable X with probability distribution p:
- The Kullback-Leibler (KL) Divergence or “Relative Entropy” between two probability distributions p and q:
- Mutual Information between random variables X and Y:

- Seek random variables and taking values in {1, 2, …, k} and {1, 2, …, l} such that mutual information is maximized:
where = R(X) is a function of X alone

where = C(Y) is a function of Y alone

- Distributional Clustering
- Pereira, Tishby & Lee (1993), Baker & McCallum (1998)

- Information Bottleneck
- Tishby, Pereira & Bialek(1999), Slonim, Friedman & Tishby (2001), Berkhin & Becher(2002)

- Probabilistic Latent Semantic Indexing
- Hofmann (1999), Hofmann & Puzicha (1999)

- Non-Negative Matrix Approximation
- Lee & Seung(2000)

- Lemma: “Loss in mutual information” equals
- p is the input distribution
- q is an approximation to p
- Can be shown that q(x,y) is a maximum entropy approximation subject to cluster constraints.

#parameters that determine q(x,y) are:

- Question: How to minimize ?
- Following Lemma reveals the Answer:
Note that may be thought of as the “prototype” of row cluster.

Similarly,

- [Step 1] Set . Start with , Compute .
- [Step 2] For every row , assign it to the cluster that minimizes
- [Step 3] We have . Compute .
- [Step 4] For every column , assign it to the cluster that minimizes
- [Step 5] We have . Compute . Iterate 2-5.

- Main Theorem: Co-clustering “monotonically” decreases loss in mutual information
- Co-clustering converges to a local minimum
- Can be generalized to multi-dimensional contingency tables
- q can be viewed as a “low complexity” non-negative matrix approximation
- q preserves marginals of p, and co-cluster statistics
- Implicit dimensionality reduction at each step helps overcome sparsity & high-dimensionality
- Computationally economical

- Assigning class labels to text documents
- Training and Testing Phases

New Document

Class-1

Document

collection

Grouped into

classes

Classifier

(Learns from

Training data)

New

Document

With

Assigned

class

Class-m

Training Data

- Feature Selection
- Feature Clustering

1

- Select the “best” words
- Throw away rest
- Frequency based pruning
- Information criterion based
- pruning

Document

Bag-of-words

Vector

Of

words

Word#1

Word#k

m

1

Vector

Of

words

Cluster#1

- Do not throw away words
- Cluster words instead
- Use clusters as features

Document

Bag-of-words

Cluster#k

m

- Data sets
- 20 Newsgroups data
- 20 classes, 20000 documents

- Classic3 data set
- 3 classes (cisi, med and cran), 3893 documents

- Dmoz Science HTML data
- 49 leaves in the hierarchy
- 5000 documents with 14538 words
- Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt

- 20 Newsgroups data
- Implementation Details
- Bow – for indexing,co-clustering, clustering and classifying

- Classification Accuracy on 20 Newsgroups data with 1/3-2/3 test-train split
- Divisive clustering beats feature selection algorithms by a large margin
- The effect is more significant at lower number of features

- Classification Accuracy on Dmoz data with 1/3-2/3 test train split
- Divisive Clustering is better at lower number of features
- Note contrasting behavior of Naïve Bayes and SVMs

- Naïve Bayes on Dmoz data with only 2% Training data
- Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase
- Divisive Clustering performs better than IG at lower training data

Hierarchical Classification

Science

Math

Physics

Social Science

Quantum

Theory

Number

Theory

Mechanics

Economics

Archeology

Logic

- Flat classifier builds a classifier over the leaf classes in the above hierarchy
- Hierarchical Classifier builds a classifier at each internal node of the hierarchy

- Hierarchical Classifier (Naïve Bayes at each node)
- Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features)
- Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers

Top few words sorted in Clusters obtained by Divisive and Agglomerative approaches on 20 Newsgroups data

Co-Clustering

(0.9835)

1-D Clustering

(0.821)

992

4

8

847

142

44

40

1452

7

41

954

405

1

4

1387

275

86

1099

Binary

(0.852,0.67)

Binary_subject

(0.946,0.648)

Co-clustering

1-D Clustering

Co-clustering

1-D Clustering

207

31

178

104

234

11

179

94

43

219

72

146

16

239

71

156

- Information-theoretic approach to clustering, co-clustering and matrix approximation
- Implicit dimensionality reduction at each step to overcome sparsity & high-dimensionality
- Theoretical approach has the potential of extending to other problems:
- Multi-dimensional co-clustering
- MDL to choose number of co-clusters
- Generalized co-clustering by Bregman divergence

- Email: inderjit@cs.utexas.edu
- Papers are available at: http://www.cs.utexas.edu/users/inderjit
- “Divisive Information-Theoretic Feature Clustering for Text Classification”, Dhillon, Mallela & Kumar, Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002)
- “Information-Theoretic Co-clustering”, Dhillon, Mallela & Modha, KDD, 2003.
- “Clustering with Bregman Divergences”, Banerjee, Merugu, Dhillon & Ghosh, SIAM Data Mining Proceedings, April, 2004.
- “A Generalized Maximum Entropy Approach to Bregman Co-clustering & Matrix Approximation”, Banerjee, Dhillon, Ghosh, Merugu & Modha, working manuscript, 2004.