Clustering by soft-constraint affinity propagation: applications to gene-expression data

Presentation Description

126 Views

Download Presentation
## Clustering by soft-constraint affinity propagation: applications to gene-expression data

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Clustering by soft-constraint affinity propagation:**applications to gene-expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007**Outline**• Introduction • The Algorithm and Method Analysis • Experimental results • Discussion**Introduction**• Affinity Propagation seeks to identify each cluster by one of its elements, exemplar. • each point in the cluster refers to this exemplar. • each exemplar is required to refer to itself as a self-exemplar. • However, it forces clusters to appear as stars. There’s only one central node, and all other nodes are directly connected to it.**Introduction**• Some drawbacks in Affinity Propagation: • The hard constraint in AP relies strongly on cluster-shape regularity. • All information about the internal structure and the hierarchical merging/dissociation of cluster is lost. • AP has robustness limitations. • AP forces each exemplar to point to itself.**Introduction**• How to improve it? • The hard constraint: exemplars would be self-exemplars. • We relax the hard constraint by introducing a finite penalty term for each constraint violation.**The Algorithm and Method Analysis**• The Soft Constraint Affinity Propagation(SCAP) equations. • Efficient implementation of the algorithm. • Extracting cluster signatures.**The SCAP equations**• We write the constraint attached to a given data point as follows, with : The first case assigns a penalty if data point is chosen as exemplar by some other data point , without being a self-exemplar.**The SCAP equations**• The penalty presents a compromise between the minimization the cost function and the search of compact clusters. • Then, we introduce a positive real-valued parameter weighing the relative importance of the cost minimization with respect to the constraints.**The SCAP equations**• So, we can define the probability of an arbitrary clustering as: • Original AP is recovered by taking since any violated constraint sets to zero.**The SCAP equations**• For general , the optimal clustering can be determined by maximizing the marginal probabilities for all data points :**The SCAP equations**• Assume , we find the SCAP equations: • The exemplar of any data point can be computed as:**The SCAP equations**• Compared to original AP, SCAP amounts to anadditional threshold on the self-availabilities and the self-responsibilities . • For small enough , in many case. • The self-responsibility is substituted with . • For (i.e. ), the original AP equations are recovered.**The SCAP equations**• This means that variables are discouraged to be self-exemplars beyond a given threshold, even in the case someone is already pointing at them.**Efficient implementation**• The iterative solution:**Efficient implementation**• Difference between the original AP: • Step 3 is formulated as a sequential update. • The original AP used damped parallel update.**Extracting cluster signatures**• Only a few components carry useful information about the cluster structure, they are called cluster signatures. • We assume the similarity between data points and to be additive in single-gene contributions:**Extracting cluster signatures**• Having found a clustering given by the exemplar selection , we can calculate the similarity of a cluster C defined as a connected component of the directed graph:as a sum over single-gene contributions**Extracting cluster signatures**• Then, we compare to random exemplar choices which are characterized by their mean: and variance**Extracting cluster signatures**• The relevance of a gene can be ranked by which measures the distance of the actual from the distribution of random exemplar mappings. • Genes can be ranked according to , highest-ranking genes are considered a cluster signature.**Experimental results**• Iris data • Brain cancer data • Other benchmark cancer data • Lymphoma cancer data • SRBCT cancer data • Leukemia**Iris data**• Three clusters: setosa, versicolor, virginica. • Four features for 150 flowers: • sepal length • sepal width • petal length • petal width**Iris data**• Experimental results: • Affinity Propagation: 16 errors. • SCAP: 9 errors with Manhattan distance measure for the similarity. • On increasing the value of , the clusters for Versicolor and Virginica merge with each other, reflecting the fact that they are closer to each other than to Setosa.**Brain cancer data**• Five diagnosis types for 42 patients: • 10 medulloblastoma • 10 malignant glioma • 10 atypical teratoid/rhabdoid tumors • 4 normal cerebella • 8 primitive neuroectodermal tumors – PNET**Brain cancer data**• Clustering with AP(for ): Five clusters for lowest errors. There are three well-distinguishable clusters.**Brain cancer data**• Clustering with SCAP: The SCAP identifies four clusters with 8 errors.**Brain cancer data**• Eight errors are due to misclassifications of the fifth diagnosis(PNET). • We use the procedure to extract cluster signatures in the case of four clusters: • No. 34~41 are the fifth diagnosis.**Other benchmark cancer data**• Lymphoma cancer data • Three diagnoses for 62 patients. • SRBCT cancer data • Four expression diagnosis patterns for 63 samples. • Leukemia • Two diagnoses for 72 samples.**Other benchmark cancer data**• Lymphoma cancer data • AP: 3 errors with 3 clusters. • SCAP: 1 error with 3 clusters. • SRBCT cancer data • AP: 22 errors with 5 clusters. • SCAP: 7 errors with 4 clusters. • Leukemia • AP: 4 errors with 2 clusters. • SCAP: 2 errors with 2 clusters.**Discussion**• If clusters cannot be well represented by a single cluster exemplar, AP has to fail. • SCAP is more efficient than AP in particular in the case of noisy, irregularly organized data and thus in biological applications concerning microarray data. • The cluster structure can be efficiently probed.