430 likes | 567 Views
29/6/2005. 2. Intellectual candy ?. Consider a simple model:Discrete caseInference should be easy. . . . . . . 29/6/2005. 3. Intellectual candy ?. Consider a simple model:Discrete caseInference should be easyCompute the jointWhat if the joint is intractable?. . . . . . . 29/6/2005. 4. Intellectual candy ?.
E N D
1. Distributional Clusteringand Graphical Models Ron Bekkerman,
University of Massachusetts
at Amherst
2. 29/6/2005 2 Intellectual candy ? Consider a simple model:
Discrete case
Inference should be easy
3. 29/6/2005 3 Intellectual candy ? Consider a simple model:
Discrete case
Inference should be easy
Compute the joint
What if the joint is intractable?
4. 29/6/2005 4 Intellectual candy ? Consider a simple model:
Discrete case
Inference should be easy
Compute the joint
What if the joint is intractable?
Factorize
What if marginals are intractable?
5. 29/6/2005 5 Intellectual candy ? Consider a simple model:
Discrete case
Inference should be easy
Compute the joint
What if the joint is intractable?
Factorize
What if marginals are intractable?
Make generative assumptions
And what if you cannot make assumptions?
6. 29/6/2005 6 When you cannot make assumptions is your data
is (hard) clustering system of
7. 29/6/2005 7 When you cannot make assumptions is your data
is (hard) clustering system of
is random variable over
How is distributed?
8. 29/6/2005 8 Another example is your data
is partial order on values of
is random variable over
Again, how is distributed?
9. 29/6/2005 9 Why clustering? Straightforward for
Learning intrinsic structure of data
Density estimation
Dimensionality reduction
Thousands of applications
Text processing
Computer vision
Bio-informatics
10. 29/6/2005 10 Why graphical models? Intuitive, pictorial
Easy to make independence assumptions
Emerging in multi-modal clustering
If then
11. 29/6/2005 11 Multi-way clustering Constructing N clustering systems
Of interdependent entities
Simultaneously
Bootstrapping each other
Examples:
Documents, words, authors of documents
Images, features, image captions
Movies, users, ratings
12. 29/6/2005 12 Literature Surprisingly sparse
Multivariate Information Bottleneck
Friedman et al. UAI-2001
Multi-Source Contingency Clustering
Bouvrie 2004
Multi-Modal Clustering
As building one clustering system of N types of entities, e.g. Heer & Chi 2001
13. 29/6/2005 13 Literature (contd) Variety on Double Clustering
Slonim & Tishby SIGIR-2000
El-Yaniv & Souroujon NIPS-2001
Dhillon et al. KDD-2003
Getz et al. PNAS-2000 (bio-informatics)
Whats wrong with multi-way clustering?
Its computationally hard
Well, everything here is NP-hard, but
14. 29/6/2005 14 Informal problem statement Given a graphical model
Over
Find the best assignment
Where is clustering of
Say, for simplicity,
15. 29/6/2005 15 Informal problem statement Given a graphical model
Over
Find the best assignment
Where is clustering of
Say, for simplicity,
This is an unsupervised model
In which we do discriminative inference
16. 29/6/2005 16 Informal problem statement Given a graphical model
Over
Find the best assignment
Where is clustering of
Say, for simplicity,
This is an unsupervised model
In which we do discriminative inference
Also, discriminative density estimation
Of
17. 29/6/2005 17 Simplest case Consider the model
Distribution is known for
Objective: maximize Mutual Information
Subject to
18. 29/6/2005 18 Simplest case Consider the model
Distribution is known for
Objective: maximize Mutual Information
Subject to
This is Information Bottleneck (IB)
Tishby, Pereira & Bialek 1999
Originally:
19. 29/6/2005 19 Information Bottleneck clustering Agglomerative (bottom-up)
Slonim & Tishby NIPS-2000
Conglomerative (top-down)
Bekkerman et al. JMLR-2003
K-means-like
Slonim et al. SIGIR-2002
Pull each x out of its cluster
Put it into s.t. the objective is maximized
20. 29/6/2005 20 More complex case Consider the model
Objective: maximize Mutual Information
Subject to ,
21. 29/6/2005 21 More complex case Consider the model
Objective: maximize Mutual Information
Subject to ,
This is information-theoretic co-clustering
Dhillon, Mallela & Modha 2003
Originally:
Applying K-means
22. 29/6/2005 22 Much more complex case Consider arbitrary model
Over
Objective: maximize Multi-Information
Subject to
23. 29/6/2005 23 Much more complex case Consider arbitrary model
Over
Objective: maximize Multi-Information
Subject to
This is multivariate Information Bottleneck
Friedman, Mosenzon, Slonim & Tishby 2001
Originally:
24. 29/6/2005 24 Why not multivariate IB? Even on shortest loops
Multi-Information can be infeasible:
Multi-Information is good for tree graphs
But bad for graphs with loops
25. 29/6/2005 25 We propose Consider arbitrary model
Objective: maximize weighted sum of MI
Subject to
No multi-dimensional probability tables
Straightforward algorithmic implementation
26. 29/6/2005 26 Example Consider our triangle:
Objective in this case:
Easily broken into 3 parts:
27. 29/6/2005 27 Example (contd) Assume current and are observed:
Find best given all its neighbors
Which means given the rest of the network
28. 29/6/2005 28 Example (contd) Assume current and are observed:
Find best given all its neighbors
Which means given the rest of the network
Roughly correspondent to
29. 29/6/2005 29 Algorithmic setup Which setup to choose?
Top-down? Bottom-up? K-means-like?
Each one has its drawbacks
How about all of them?
Apply bottom-up to some variables
Apply top-down to the rest
Rearrange elements at each level
Maximizing the objective function
30. 29/6/2005 30 Algorithmic setup: discussion Why not plain K-means?
K is arbitrary
Initialization is random
31. 29/6/2005 31 Algorithmic setup: discussion Why not plain K-means?
K is arbitrary
Initialization is random
Why not all top-down?
First split is random
Following splits are then random too
32. 29/6/2005 32 Algorithmic setup: discussion Why not plain K-means?
K is arbitrary
Initialization is random
Why not all top-down?
First split is random
Following splits are then random too
Why not all bottom-up?
Extremely expensive
33. 29/6/2005 33 Multi-way Distributional Clustering Initialization
If , put each in a singleton cluster
If , put all in one common cluster
Main loop
If , merge every two closest clusters
If , split each cluster to two halves
Correction loop
Pull each out of its cluster
Put it into s.t. the objective is maximized
34. 29/6/2005 34 Example: 2-way MDC
35. 29/6/2005 35 Computational complexity General case
At each iteration of the main loop:
Pass over all
Pass over all
Pass over all
If bottom-up system is only one
2-way case
At each iteration is doubled
While is halved
36. 29/6/2005 36 Experimental setup 2-way MDC
Documents and Words
3-way MDC
Documents, Words and Authors
4-way MDC
Documents, Words, Authors and
documents Titles
Documents: bottom-up, the rest: top-down
37. 29/6/2005 37 Evaluation methodology Clustering evaluation
Is generally unintuitive
Is an entire ML research field
We use the accuracy measure
Following Slonim et al. and Dhillon et al.
Ground truth:
Our results:
38. 29/6/2005 38 Datasets Three CALO email datasets:
acheyer: 664 messages, 38 folders
mgervasio: 777 messages, 15 folders
mgondek: 297 messages, 14 folders
Two Enron email datasets:
kitchen-l: 4015 messages, 47 folders
sanders-r: 1188 messages, 30 folders
The 20 Newsgroups: 19997 messages
39. 29/6/2005 39 Results
40. 29/6/2005 40 Improvement over the baseline
41. 29/6/2005 41 Discussion Improvement over Slonim et al.
Which is a 1-way clustering algorithm
Shows that multi-modality helps
Improvement over Dhillon et al.
Which is a 2-way clustering algorithm
Shows that hierarchical setup helps
MDC is an efficient method
Which allows exploring complex models
3-way, 4-way etc.
42. 29/6/2005 42 Conclusion No generative assumptions
Unlike generative models
No training data
Unlike supervised models
Tight link between distributional clustering and graphical models
Finally!
Great empirical results ?
43. 29/6/2005 43 Future work Algorithmic enhancements
Of the correction loop
Inference on optimal number of clusters
Generalization to continuous tasks
Toward unsupervised discriminative models
44. 29/6/2005 44 Future work Algorithmic enhancements
Of the correction loop
Inference on optimal number of clusters
Generalization to continuous tasks
Toward unsupervised discriminative models