slide1
Download
Skip this Video
Download Presentation
Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin

Loading in 2 Seconds...

play fullscreen
1 / 39

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon Un - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin. Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela,

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon Un' - janae


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Information Theoretic Clustering, Co-clustering and Matrix ApproximationsInderjit S. Dhillon University of Texas, Austin

Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela,

S. Merugu & D. Modha

Data Mining Seminar Series,

Mar 26, 2004

clustering unsupervised learning
Clustering: Unsupervised Learning
  • Grouping together of “similar” objects
    • Hard Clustering -- Each object belongs to a single cluster
    • Soft Clustering -- Each object is probabilistically assigned to clusters
contingency tables
Contingency Tables
  • Let Xand Y be discrete random variables
    • X and Y take values in {1, 2, …, m} and {1, 2, …, n}
    • p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data
    • Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc.
  • Key Obstacles in Clustering Contingency Tables
    • High Dimensionality, Sparsity, Noise
    • Need for robust and scalable algorithms
co clustering
Co-Clustering
  • Simultaneously
    • Cluster rows of p(X, Y) into k disjoint groups
    • Cluster columns of p(X, Y) into l disjoint groups
  • Key goal is to exploit the “duality” between row and column clustering to overcome sparsity and noise
co clustering example for text data
Co-clustering Example for Text Data
  • Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix

document

document clusters

word

word

clusters

co clustering and information theory
Co-clustering and Information Theory
  • View “co-occurrence” matrix as a joint probability distribution over row & column random variables
  • We seek a “hard-clustering” of both rows and columns such that “information” in the compressed matrix is maximized.
information theory concepts
Information Theory Concepts
  • Entropy of a random variable X with probability distribution p:
  • The Kullback-Leibler (KL) Divergence or “Relative Entropy” between two probability distributions p and q:
  • Mutual Information between random variables X and Y:
optimal co clustering
“Optimal” Co-Clustering
  • Seek random variables and taking values in {1, 2, …, k} and {1, 2, …, l} such that mutual information is maximized:

where = R(X) is a function of X alone

where = C(Y) is a function of Y alone

related work
Related Work
  • Distributional Clustering
    • Pereira, Tishby & Lee (1993), Baker & McCallum (1998)
  • Information Bottleneck
    • Tishby, Pereira & Bialek(1999), Slonim, Friedman & Tishby (2001), Berkhin & Becher(2002)
  • Probabilistic Latent Semantic Indexing
    • Hofmann (1999), Hofmann & Puzicha (1999)
  • Non-Negative Matrix Approximation
    • Lee & Seung(2000)
information theoretic co clustering
Information-Theoretic Co-clustering
  • Lemma: “Loss in mutual information” equals
  • p is the input distribution
  • q is an approximation to p
    • Can be shown that q(x,y) is a maximum entropy approximation subject to cluster constraints.
decomposition lemma
Decomposition Lemma
  • Question: How to minimize ?
  • Following Lemma reveals the Answer:

Note that may be thought of as the “prototype” of row cluster.

Similarly,

co clustering algorithm
Co-Clustering Algorithm
  • [Step 1] Set . Start with , Compute .
  • [Step 2] For every row , assign it to the cluster that minimizes
  • [Step 3] We have . Compute .
  • [Step 4] For every column , assign it to the cluster that minimizes
  • [Step 5] We have . Compute . Iterate 2-5.
properties of co clustering algorithm
Properties of Co-clustering Algorithm
  • Main Theorem: Co-clustering “monotonically” decreases loss in mutual information
  • Co-clustering converges to a local minimum
  • Can be generalized to multi-dimensional contingency tables
  • q can be viewed as a “low complexity” non-negative matrix approximation
  • q preserves marginals of p, and co-cluster statistics
  • Implicit dimensionality reduction at each step helps overcome sparsity & high-dimensionality
  • Computationally economical
applications text classification
Applications -- Text Classification
  • Assigning class labels to text documents
  • Training and Testing Phases

New Document

Class-1

Document

collection

Grouped into

classes

Classifier

(Learns from

Training data)

New

Document

With

Assigned

class

Class-m

Training Data

feature clustering dimensionality reduction
Feature Clustering (dimensionality reduction)
  • Feature Selection
  • Feature Clustering

1

  • Select the “best” words
  • Throw away rest
  • Frequency based pruning
  • Information criterion based
  • pruning

Document

Bag-of-words

Vector

Of

words

Word#1

Word#k

m

1

Vector

Of

words

Cluster#1

  • Do not throw away words
  • Cluster words instead
  • Use clusters as features

Document

Bag-of-words

Cluster#k

m

experiments
Experiments
  • Data sets
    • 20 Newsgroups data
      • 20 classes, 20000 documents
    • Classic3 data set
      • 3 classes (cisi, med and cran), 3893 documents
    • Dmoz Science HTML data
      • 49 leaves in the hierarchy
      • 5000 documents with 14538 words
      • Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt
  • Implementation Details
    • Bow – for indexing,co-clustering, clustering and classifying
results 20ng
Results (20Ng)
  • Classification Accuracy on 20 Newsgroups data with 1/3-2/3 test-train split
  • Divisive clustering beats feature selection algorithms by a large margin
  • The effect is more significant at lower number of features
results dmoz
Results (Dmoz)
  • Classification Accuracy on Dmoz data with 1/3-2/3 test train split
  • Divisive Clustering is better at lower number of features
  • Note contrasting behavior of Naïve Bayes and SVMs
results dmoz1
Results (Dmoz)
  • Naïve Bayes on Dmoz data with only 2% Training data
  • Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase
  • Divisive Clustering performs better than IG at lower training data
slide29

Hierarchical Classification

Science

Math

Physics

Social Science

Quantum

Theory

Number

Theory

Mechanics

Economics

Archeology

Logic

  • Flat classifier builds a classifier over the leaf classes in the above hierarchy
  • Hierarchical Classifier builds a classifier at each internal node of the hierarchy
results dmoz2
Results (Dmoz)
  • Hierarchical Classifier (Naïve Bayes at each node)
  • Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features)
  • Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers
anecdotal evidence
Anecdotal Evidence

Top few words sorted in Clusters obtained by Divisive and Agglomerative approaches on 20 Newsgroups data

co clustering results classic3

Co-Clustering

(0.9835)

1-D Clustering

(0.821)

992

4

8

847

142

44

40

1452

7

41

954

405

1

4

1387

275

86

1099

Co-Clustering Results (CLASSIC3)
results binary subset of 20ng data

Binary

(0.852,0.67)

Binary_subject

(0.946,0.648)

Co-clustering

1-D Clustering

Co-clustering

1-D Clustering

207

31

178

104

234

11

179

94

43

219

72

146

16

239

71

156

Results – Binary (subset of 20Ng data)
conclusions
Conclusions
  • Information-theoretic approach to clustering, co-clustering and matrix approximation
  • Implicit dimensionality reduction at each step to overcome sparsity & high-dimensionality
  • Theoretical approach has the potential of extending to other problems:
    • Multi-dimensional co-clustering
    • MDL to choose number of co-clusters
    • Generalized co-clustering by Bregman divergence
more information
More Information
  • Email: [email protected]
  • Papers are available at: http://www.cs.utexas.edu/users/inderjit
  • “Divisive Information-Theoretic Feature Clustering for Text Classification”, Dhillon, Mallela & Kumar, Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002)
  • “Information-Theoretic Co-clustering”, Dhillon, Mallela & Modha, KDD, 2003.
  • “Clustering with Bregman Divergences”, Banerjee, Merugu, Dhillon & Ghosh, SIAM Data Mining Proceedings, April, 2004.
  • “A Generalized Maximum Entropy Approach to Bregman Co-clustering & Matrix Approximation”, Banerjee, Dhillon, Ghosh, Merugu & Modha, working manuscript, 2004.
ad