1 / 43

Distributional Clustering and Graphical Models

29/6/2005. 2. Intellectual candy ?. Consider a simple model:Discrete caseInference should be easy. . . . . . . 29/6/2005. 3. Intellectual candy ?. Consider a simple model:Discrete caseInference should be easyCompute the jointWhat if the joint is intractable?. . . . . . . 29/6/2005. 4. Intellectual candy ?.

kellsie
Download Presentation

Distributional Clustering and Graphical Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Distributional Clustering and Graphical Models Ron Bekkerman, University of Massachusetts at Amherst

    2. 29/6/2005 2 Intellectual candy ? Consider a simple model: Discrete case Inference should be easy

    3. 29/6/2005 3 Intellectual candy ? Consider a simple model: Discrete case Inference should be easy Compute the joint What if the joint is intractable?

    4. 29/6/2005 4 Intellectual candy ? Consider a simple model: Discrete case Inference should be easy Compute the joint What if the joint is intractable? Factorize What if marginals are intractable?

    5. 29/6/2005 5 Intellectual candy ? Consider a simple model: Discrete case Inference should be easy Compute the joint What if the joint is intractable? Factorize What if marginals are intractable? Make generative assumptions And what if you cannot make assumptions?

    6. 29/6/2005 6 When you cannot make assumptions is your data is (hard) clustering system of

    7. 29/6/2005 7 When you cannot make assumptions is your data is (hard) clustering system of is random variable over How is distributed?

    8. 29/6/2005 8 Another example is your data is partial order on values of is random variable over Again, how is distributed?

    9. 29/6/2005 9 Why clustering? Straightforward for Learning intrinsic structure of data Density estimation Dimensionality reduction Thousands of applications Text processing Computer vision Bio-informatics …

    10. 29/6/2005 10 Why graphical models? Intuitive, pictorial Easy to make independence assumptions Emerging in multi-modal clustering If then

    11. 29/6/2005 11 Multi-way clustering Constructing N clustering systems Of interdependent entities Simultaneously Bootstrapping each other Examples: Documents, words, authors of documents Images, features, image captions Movies, users, ratings

    12. 29/6/2005 12 Literature Surprisingly sparse Multivariate Information Bottleneck Friedman et al. UAI-2001 Multi-Source Contingency Clustering Bouvrie 2004 Multi-Modal Clustering As building one clustering system of N types of entities, e.g. Heer & Chi 2001

    13. 29/6/2005 13 Literature (contd) Variety on Double Clustering Slonim & Tishby SIGIR-2000 El-Yaniv & Souroujon NIPS-2001 Dhillon et al. KDD-2003 Getz et al. PNAS-2000 (bio-informatics) What’s wrong with multi-way clustering? It’s computationally hard Well, everything here is NP-hard, but…

    14. 29/6/2005 14 Informal problem statement Given a graphical model Over Find the “best” assignment Where is clustering of Say, for simplicity,

    15. 29/6/2005 15 Informal problem statement Given a graphical model Over Find the “best” assignment Where is clustering of Say, for simplicity, This is an unsupervised model In which we do “discriminative inference”

    16. 29/6/2005 16 Informal problem statement Given a graphical model Over Find the “best” assignment Where is clustering of Say, for simplicity, This is an unsupervised model In which we do “discriminative inference” Also, discriminative density estimation Of

    17. 29/6/2005 17 Simplest case Consider the model Distribution is known for Objective: maximize Mutual Information Subject to

    18. 29/6/2005 18 Simplest case Consider the model Distribution is known for Objective: maximize Mutual Information Subject to This is Information Bottleneck (IB) Tishby, Pereira & Bialek 1999 Originally:

    19. 29/6/2005 19 Information Bottleneck clustering Agglomerative (bottom-up) Slonim & Tishby NIPS-2000 Conglomerative (top-down) Bekkerman et al. JMLR-2003 K-means-like Slonim et al. SIGIR-2002 Pull each x out of its cluster Put it into s.t. the objective is maximized

    20. 29/6/2005 20 More complex case Consider the model Objective: maximize Mutual Information Subject to ,

    21. 29/6/2005 21 More complex case Consider the model Objective: maximize Mutual Information Subject to , This is information-theoretic co-clustering Dhillon, Mallela & Modha 2003 Originally: Applying K-means

    22. 29/6/2005 22 Much more complex case Consider arbitrary model Over Objective: maximize Multi-Information Subject to

    23. 29/6/2005 23 Much more complex case Consider arbitrary model Over Objective: maximize Multi-Information Subject to This is multivariate Information Bottleneck Friedman, Mosenzon, Slonim & Tishby 2001 Originally:

    24. 29/6/2005 24 Why not multivariate IB? Even on shortest loops Multi-Information can be infeasible: Multi-Information is good for tree graphs But bad for graphs with loops

    25. 29/6/2005 25 We propose Consider arbitrary model Objective: maximize weighted sum of MI Subject to No multi-dimensional probability tables Straightforward algorithmic implementation

    26. 29/6/2005 26 Example Consider our triangle: Objective in this case: Easily broken into 3 parts:

    27. 29/6/2005 27 Example (contd) Assume current and are observed: Find best given all its neighbors Which means “given the rest of the network”

    28. 29/6/2005 28 Example (contd) Assume current and are observed: Find best given all its neighbors Which means “given the rest of the network” Roughly correspondent to

    29. 29/6/2005 29 Algorithmic setup Which setup to choose? Top-down? Bottom-up? K-means-like? Each one has its drawbacks How about all of them? Apply bottom-up to some variables Apply top-down to the rest Rearrange elements at each level Maximizing the objective function

    30. 29/6/2005 30 Algorithmic setup: discussion Why not plain K-means? K is arbitrary Initialization is random

    31. 29/6/2005 31 Algorithmic setup: discussion Why not plain K-means? K is arbitrary Initialization is random Why not all top-down? First split is random Following splits are then random too

    32. 29/6/2005 32 Algorithmic setup: discussion Why not plain K-means? K is arbitrary Initialization is random Why not all top-down? First split is random Following splits are then random too Why not all bottom-up? Extremely expensive

    33. 29/6/2005 33 Multi-way Distributional Clustering Initialization If , put each in a singleton cluster If , put all in one common cluster Main loop If , merge every two closest clusters If , split each cluster to two halves Correction loop Pull each out of its cluster Put it into s.t. the objective is maximized

    34. 29/6/2005 34 Example: 2-way MDC

    35. 29/6/2005 35 Computational complexity General case At each iteration of the main loop: Pass over all Pass over all Pass over all If bottom-up system is only one 2-way case At each iteration is doubled While is halved

    36. 29/6/2005 36 Experimental setup 2-way MDC Documents and Words 3-way MDC Documents, Words and Authors 4-way MDC Documents, Words, Authors and documents’ Titles Documents: bottom-up, the rest: top-down

    37. 29/6/2005 37 Evaluation methodology Clustering evaluation Is generally unintuitive Is an entire ML research field We use the “accuracy” measure Following Slonim et al. and Dhillon et al. Ground truth: Our results:

    38. 29/6/2005 38 Datasets Three CALO email datasets: acheyer: 664 messages, 38 folders mgervasio: 777 messages, 15 folders mgondek: 297 messages, 14 folders Two Enron email datasets: kitchen-l: 4015 messages, 47 folders sanders-r: 1188 messages, 30 folders The 20 Newsgroups: 19997 messages

    39. 29/6/2005 39 Results

    40. 29/6/2005 40 Improvement over the baseline

    41. 29/6/2005 41 Discussion Improvement over Slonim et al. Which is a 1-way clustering algorithm Shows that multi-modality helps Improvement over Dhillon et al. Which is a 2-way clustering algorithm Shows that hierarchical setup helps MDC is an efficient method Which allows exploring complex models 3-way, 4-way etc.

    42. 29/6/2005 42 Conclusion No generative assumptions Unlike generative models No training data Unlike supervised models Tight link between distributional clustering and graphical models Finally! Great empirical results ?

    43. 29/6/2005 43 Future work Algorithmic enhancements Of the correction loop Inference on “optimal” number of clusters Generalization to continuous tasks Toward unsupervised discriminative models

    44. 29/6/2005 44 Future work Algorithmic enhancements Of the correction loop Inference on “optimal” number of clusters Generalization to continuous tasks Toward unsupervised discriminative models

More Related