Create Presentation
Download Presentation

Download Presentation
## Clustering with Bregman Divergences

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Clustering with Bregman Divergences**Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning**Outline**• Bregman Divergences – Basics and Examples • Bregman Information • Bregman Hard Clustering • The Exponential Family and connection to Bregman Divergence • Bregman Soft Clustering • Experiments and Results • Conclusions**Bregman Hard and Soft Clustering**• Most existing parametric clustering methods partition the data into pre-specified number of partitions with cluster representative corresponding to every partition/cluster • Hard Clustering – disjoint partitioning of the data such that each data point belongs to exactly one of the partitions • Soft Clustering – each data point has a certain probability of belonging to each of the partitions • Hard Clustering can be seen as Soft Clustering when probabilities are either 0 or 1**Distortion or Loss Functions**• Squared euclidean distance is the most commonly used loss function • Extensive literature • Easy to use – leads to simple calculations • Not appropriate for some domains • Difficult to compute for sparse data (missing dimensions) • Example: Iterative K-means algorithm • Question: How to choose a distortion/loss function for a given problem?**Bregman Divergences**• Ref: Definition 1 in the paper: • Examples: • Squared distance • Relative Entropy (KL divergence) • Itakura Saito distance**Few Take Home Points on Bregman Divergence**1. 2. 3. Three Point Property 4. Strictly convex in the first argument but not necessarily so in the second argument**Bregman Information of a random variable X is given by**• The optimal vector that achieves the minimal value will be called Bregman representative of X • For squared loss, minimum loss is variance • Best predictor of the random variable is the mean Bregman Information**Bregman Information is the minimum loss that corresponds to**• Points to note: • representative defined above always exists • uniquely determined • does not depend on the choice of Bregman divergence • expectation of the random variable, X defines the minimizer Bregman Information**Bregman Hard Clustering**• This problem is posed as a quantization problem that involves minimizing the loss in Bregman information • Very similar to squared distance based iterative K-means – except that distortion function is general class of Bregman Divergence • Expected Bregman Divergence of the data points from their Bregman representatives is minimized • Procedure: • Initialize the representatives • Assign points to them • Re-estimate the representatives**Bregman Hard Clustering**• Algorithm:**Take home points**• Exhaustiveness: Bregman hard clustering algorithm works for all Bregman divergences and in fact only for Bregman Divergences • Arithmetic mean is the best predictor for Bregman Divergences only • Possible to design clustering algorithms based on distortion functions that are not Bregman divergences, but in that case, cluster representative would not be the arithmetic mean or the expectation • Linear Separators: Clusters obtained are separated by hyperplanes**Take home points**• Scalability: Each iteration of Bregman hard clustering algorithm is linear in the number of data points and the number of desired clusters • Applicability to mixed data types: Allows choosing different Bregman divergence that are meaningful and appropriate for different subsets of features • Also guarantees that the objective function will monotonically decrease till convergence**Exponential families and Bregman Divergences**• [Forster & Warmuth] remarked that the log-likelihood of the density of an exponential family distribution can be written as follows: • Points to note:**Bregman Soft Clustering**• Problem is posed as a parameter estimation problem for mixture models based on exponential family distributions • EM algorithm is used to design Bregman Soft Clustering algorithm • Maximizing log likelihood of data in the EM algorithm would be equivalent to minimizing the Bregman Divergence in the Bregman Soft Clustering algorithm (refer to the previous slide) • There is a Bregman Divergence for a defined exponential family**Bregman Soft Clustering**• Algorithm:**Experiments and Results**• Question: How the quality of clustering would depend on the appropriateness of Bregman divergence? • Experiments performed on synthetic data proved that cluster quality is better when matching Bregman divergence is used than the non-matching one • Experiment 1: • Three 1-dimensional datasets of 100 samples each are generated based on mixture models of Gaussian, Poisson, and Binomial distributions respectively • datasets were clustered using three versions of Bregman hard clustering corresponding to different Bregman divergences**Experiments and Results**• Mutual information is used to compare the results • Table 3 in the paper shows large numbers along the diagonals, which shows the importance of using appropriate Bregman divergence • Experiment 2: • Similar as experiment 1 except that this is for multi-dimensional data. • Table 4 in the paper shows the results, which again indicate the same observation as above**Conclusions**• Hard and Soft clustering algorithms are presented that minimize the loss function based on Bregman Divergences • It was shown that there is a one-to-one mapping between regular exponential families and regular Bregman Divergences – this helped formulating soft clustering algorithm • Connection of Bregman divergences to shannon’s rate distortion theory is also established • Experiments on synthetic data showed the importance of choosing right Bregman divergence for the corresponding family of exponential distributions