Information Theoretic Clustering, Co-clustering and Matrix Approximations
This presentation is the property of its rightful owner.
Sponsored Links
1 / 39

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on
  • Presentation posted in: General

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin. Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela,

Download Presentation

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information theoretic clustering co clustering and matrix approximations inderjit s dhillon university of texas austin

Information Theoretic Clustering, Co-clustering and Matrix ApproximationsInderjit S. Dhillon University of Texas, Austin

Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela,

S. Merugu & D. Modha

Data Mining Seminar Series,

Mar 26, 2004


Clustering unsupervised learning

Clustering: Unsupervised Learning

  • Grouping together of “similar” objects

    • Hard Clustering -- Each object belongs to a single cluster

    • Soft Clustering -- Each object is probabilistically assigned to clusters


Contingency tables

Contingency Tables

  • Let Xand Y be discrete random variables

    • X and Y take values in {1, 2, …, m} and {1, 2, …, n}

    • p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data

    • Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc.

  • Key Obstacles in Clustering Contingency Tables

    • High Dimensionality, Sparsity, Noise

    • Need for robust and scalable algorithms


Co clustering

Co-Clustering

  • Simultaneously

    • Cluster rows of p(X, Y) into k disjoint groups

    • Cluster columns of p(X, Y) into l disjoint groups

  • Key goal is to exploit the “duality” between row and column clustering to overcome sparsity and noise


Co clustering example for text data

Co-clustering Example for Text Data

  • Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix

document

document clusters

word

word

clusters


Co clustering and information theory

Co-clustering and Information Theory

  • View “co-occurrence” matrix as a joint probability distribution over row & column random variables

  • We seek a “hard-clustering” of both rows and columns such that “information” in the compressed matrix is maximized.


Information theory concepts

Information Theory Concepts

  • Entropy of a random variable X with probability distribution p:

  • The Kullback-Leibler (KL) Divergence or “Relative Entropy” between two probability distributions p and q:

  • Mutual Information between random variables X and Y:


Optimal co clustering

“Optimal” Co-Clustering

  • Seek random variables and taking values in {1, 2, …, k} and {1, 2, …, l} such that mutual information is maximized:

    where = R(X) is a function of X alone

    where = C(Y) is a function of Y alone


Related work

Related Work

  • Distributional Clustering

    • Pereira, Tishby & Lee (1993), Baker & McCallum (1998)

  • Information Bottleneck

    • Tishby, Pereira & Bialek(1999), Slonim, Friedman & Tishby (2001), Berkhin & Becher(2002)

  • Probabilistic Latent Semantic Indexing

    • Hofmann (1999), Hofmann & Puzicha (1999)

  • Non-Negative Matrix Approximation

    • Lee & Seung(2000)


Information theoretic co clustering

Information-Theoretic Co-clustering

  • Lemma: “Loss in mutual information” equals

  • p is the input distribution

  • q is an approximation to p

    • Can be shown that q(x,y) is a maximum entropy approximation subject to cluster constraints.


Information theoretic clustering co clustering and matrix approximations inderjit s dhillon university of texas austin

#parameters that determine q(x,y) are:


Decomposition lemma

Decomposition Lemma

  • Question: How to minimize ?

  • Following Lemma reveals the Answer:

    Note that may be thought of as the “prototype” of row cluster.

    Similarly,


Co clustering algorithm

Co-Clustering Algorithm

  • [Step 1] Set . Start with , Compute .

  • [Step 2] For every row , assign it to the cluster that minimizes

  • [Step 3] We have . Compute .

  • [Step 4] For every column , assign it to the cluster that minimizes

  • [Step 5] We have . Compute . Iterate 2-5.


Properties of co clustering algorithm

Properties of Co-clustering Algorithm

  • Main Theorem: Co-clustering “monotonically” decreases loss in mutual information

  • Co-clustering converges to a local minimum

  • Can be generalized to multi-dimensional contingency tables

  • q can be viewed as a “low complexity” non-negative matrix approximation

  • q preserves marginals of p, and co-cluster statistics

  • Implicit dimensionality reduction at each step helps overcome sparsity & high-dimensionality

  • Computationally economical


Applications text classification

Applications -- Text Classification

  • Assigning class labels to text documents

  • Training and Testing Phases

New Document

Class-1

Document

collection

Grouped into

classes

Classifier

(Learns from

Training data)

New

Document

With

Assigned

class

Class-m

Training Data


Feature clustering dimensionality reduction

Feature Clustering (dimensionality reduction)

  • Feature Selection

  • Feature Clustering

1

  • Select the “best” words

  • Throw away rest

  • Frequency based pruning

  • Information criterion based

  • pruning

Document

Bag-of-words

Vector

Of

words

Word#1

Word#k

m

1

Vector

Of

words

Cluster#1

  • Do not throw away words

  • Cluster words instead

  • Use clusters as features

Document

Bag-of-words

Cluster#k

m


Experiments

Experiments

  • Data sets

    • 20 Newsgroups data

      • 20 classes, 20000 documents

    • Classic3 data set

      • 3 classes (cisi, med and cran), 3893 documents

    • Dmoz Science HTML data

      • 49 leaves in the hierarchy

      • 5000 documents with 14538 words

      • Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt

  • Implementation Details

    • Bow – for indexing,co-clustering, clustering and classifying


Results 20ng

Results (20Ng)

  • Classification Accuracy on 20 Newsgroups data with 1/3-2/3 test-train split

  • Divisive clustering beats feature selection algorithms by a large margin

  • The effect is more significant at lower number of features


Results dmoz

Results (Dmoz)

  • Classification Accuracy on Dmoz data with 1/3-2/3 test train split

  • Divisive Clustering is better at lower number of features

  • Note contrasting behavior of Naïve Bayes and SVMs


Results dmoz1

Results (Dmoz)

  • Naïve Bayes on Dmoz data with only 2% Training data

  • Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase

  • Divisive Clustering performs better than IG at lower training data


Information theoretic clustering co clustering and matrix approximations inderjit s dhillon university of texas austin

Hierarchical Classification

Science

Math

Physics

Social Science

Quantum

Theory

Number

Theory

Mechanics

Economics

Archeology

Logic

  • Flat classifier builds a classifier over the leaf classes in the above hierarchy

  • Hierarchical Classifier builds a classifier at each internal node of the hierarchy


Results dmoz2

Results (Dmoz)

  • Hierarchical Classifier (Naïve Bayes at each node)

  • Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features)

  • Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers


Anecdotal evidence

Anecdotal Evidence

Top few words sorted in Clusters obtained by Divisive and Agglomerative approaches on 20 Newsgroups data


Co clustering results classic3

Co-Clustering

(0.9835)

1-D Clustering

(0.821)

992

4

8

847

142

44

40

1452

7

41

954

405

1

4

1387

275

86

1099

Co-Clustering Results (CLASSIC3)


Results binary subset of 20ng data

Binary

(0.852,0.67)

Binary_subject

(0.946,0.648)

Co-clustering

1-D Clustering

Co-clustering

1-D Clustering

207

31

178

104

234

11

179

94

43

219

72

146

16

239

71

156

Results – Binary (subset of 20Ng data)


Precision 20ng data

Precision – 20Ng data


Results sparsity binary subject data

Results: Sparsity (Binary_subject data)


Results sparsity binary subject data1

Results: Sparsity (Binary_subject data)


Results monotonicity

Results (Monotonicity)


Conclusions

Conclusions

  • Information-theoretic approach to clustering, co-clustering and matrix approximation

  • Implicit dimensionality reduction at each step to overcome sparsity & high-dimensionality

  • Theoretical approach has the potential of extending to other problems:

    • Multi-dimensional co-clustering

    • MDL to choose number of co-clusters

    • Generalized co-clustering by Bregman divergence


More information

More Information

  • Email: [email protected]

  • Papers are available at: http://www.cs.utexas.edu/users/inderjit

  • “Divisive Information-Theoretic Feature Clustering for Text Classification”, Dhillon, Mallela & Kumar, Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002)

  • “Information-Theoretic Co-clustering”, Dhillon, Mallela & Modha, KDD, 2003.

  • “Clustering with Bregman Divergences”, Banerjee, Merugu, Dhillon & Ghosh, SIAM Data Mining Proceedings, April, 2004.

  • “A Generalized Maximum Entropy Approach to Bregman Co-clustering & Matrix Approximation”, Banerjee, Dhillon, Ghosh, Merugu & Modha, working manuscript, 2004.


  • Login