Clustering by soft-constraint affinity propagation: applications to gene-expression data - PowerPoint PPT Presentation

clustering by soft constraint affinity propagation applications to gene expression data n.
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering by soft-constraint affinity propagation: applications to gene-expression data PowerPoint Presentation
Download Presentation
Clustering by soft-constraint affinity propagation: applications to gene-expression data

play fullscreen
1 / 29
Download Presentation
Presentation Description
Download Presentation

Clustering by soft-constraint affinity propagation: applications to gene-expression data

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Clustering by soft-constraint affinity propagation: applications to gene-expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007

  2. Outline • Introduction • The Algorithm and Method Analysis • Experimental results • Discussion

  3. Introduction • Affinity Propagation seeks to identify each cluster by one of its elements, exemplar. • each point in the cluster refers to this exemplar. • each exemplar is required to refer to itself as a self-exemplar. • However, it forces clusters to appear as stars. There’s only one central node, and all other nodes are directly connected to it.

  4. Introduction • Some drawbacks in Affinity Propagation: • The hard constraint in AP relies strongly on cluster-shape regularity. • All information about the internal structure and the hierarchical merging/dissociation of cluster is lost. • AP has robustness limitations. • AP forces each exemplar to point to itself.

  5. Introduction • How to improve it? • The hard constraint: exemplars would be self-exemplars. • We relax the hard constraint by introducing a finite penalty term for each constraint violation.

  6. The Algorithm and Method Analysis • The Soft Constraint Affinity Propagation(SCAP) equations. • Efficient implementation of the algorithm. • Extracting cluster signatures.

  7. The SCAP equations • We write the constraint attached to a given data point as follows, with : The first case assigns a penalty if data point is chosen as exemplar by some other data point , without being a self-exemplar.

  8. The SCAP equations • The penalty presents a compromise between the minimization the cost function and the search of compact clusters. • Then, we introduce a positive real-valued parameter weighing the relative importance of the cost minimization with respect to the constraints.

  9. The SCAP equations • So, we can define the probability of an arbitrary clustering as: • Original AP is recovered by taking since any violated constraint sets to zero.

  10. The SCAP equations • For general , the optimal clustering can be determined by maximizing the marginal probabilities for all data points :

  11. The SCAP equations • Assume , we find the SCAP equations: • The exemplar of any data point can be computed as:

  12. The SCAP equations • Compared to original AP, SCAP amounts to anadditional threshold on the self-availabilities and the self-responsibilities . • For small enough , in many case. • The self-responsibility is substituted with . • For (i.e. ), the original AP equations are recovered.

  13. The SCAP equations • This means that variables are discouraged to be self-exemplars beyond a given threshold, even in the case someone is already pointing at them.

  14. Efficient implementation • The iterative solution:

  15. Efficient implementation • Difference between the original AP: • Step 3 is formulated as a sequential update. • The original AP used damped parallel update.

  16. Extracting cluster signatures • Only a few components carry useful information about the cluster structure, they are called cluster signatures. • We assume the similarity between data points and to be additive in single-gene contributions:

  17. Extracting cluster signatures • Having found a clustering given by the exemplar selection , we can calculate the similarity of a cluster C defined as a connected component of the directed graph:as a sum over single-gene contributions

  18. Extracting cluster signatures • Then, we compare to random exemplar choices which are characterized by their mean: and variance

  19. Extracting cluster signatures • The relevance of a gene can be ranked by which measures the distance of the actual from the distribution of random exemplar mappings. • Genes can be ranked according to , highest-ranking genes are considered a cluster signature.

  20. Experimental results • Iris data • Brain cancer data • Other benchmark cancer data • Lymphoma cancer data • SRBCT cancer data • Leukemia

  21. Iris data • Three clusters: setosa, versicolor, virginica. • Four features for 150 flowers: • sepal length • sepal width • petal length • petal width

  22. Iris data • Experimental results: • Affinity Propagation: 16 errors. • SCAP: 9 errors with Manhattan distance measure for the similarity. • On increasing the value of , the clusters for Versicolor and Virginica merge with each other, reflecting the fact that they are closer to each other than to Setosa.

  23. Brain cancer data • Five diagnosis types for 42 patients: • 10 medulloblastoma • 10 malignant glioma • 10 atypical teratoid/rhabdoid tumors • 4 normal cerebella • 8 primitive neuroectodermal tumors – PNET

  24. Brain cancer data • Clustering with AP(for ): Five clusters for lowest errors. There are three well-distinguishable clusters.

  25. Brain cancer data • Clustering with SCAP: The SCAP identifies four clusters with 8 errors.

  26. Brain cancer data • Eight errors are due to misclassifications of the fifth diagnosis(PNET). • We use the procedure to extract cluster signatures in the case of four clusters: • No. 34~41 are the fifth diagnosis.

  27. Other benchmark cancer data • Lymphoma cancer data • Three diagnoses for 62 patients. • SRBCT cancer data • Four expression diagnosis patterns for 63 samples. • Leukemia • Two diagnoses for 72 samples.

  28. Other benchmark cancer data • Lymphoma cancer data • AP: 3 errors with 3 clusters. • SCAP: 1 error with 3 clusters. • SRBCT cancer data • AP: 22 errors with 5 clusters. • SCAP: 7 errors with 4 clusters. • Leukemia • AP: 4 errors with 2 clusters. • SCAP: 2 errors with 2 clusters.

  29. Discussion • If clusters cannot be well represented by a single cluster exemplar, AP has to fail. • SCAP is more efficient than AP in particular in the case of noisy, irregularly organized data and thus in biological applications concerning microarray data. • The cluster structure can be efficiently probed.