Clustering by soft-constraint affinity propagation: applications to gene-expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007
Outline • Introduction • The Algorithm and Method Analysis • Experimental results • Discussion
Introduction • Affinity Propagation seeks to identify each cluster by one of its elements, exemplar. • each point in the cluster refers to this exemplar. • each exemplar is required to refer to itself as a self-exemplar. • However, it forces clusters to appear as stars. There’s only one central node, and all other nodes are directly connected to it.
Introduction • Some drawbacks in Affinity Propagation: • The hard constraint in AP relies strongly on cluster-shape regularity. • All information about the internal structure and the hierarchical merging/dissociation of cluster is lost. • AP has robustness limitations. • AP forces each exemplar to point to itself.
Introduction • How to improve it? • The hard constraint: exemplars would be self-exemplars. • We relax the hard constraint by introducing a finite penalty term for each constraint violation.
The Algorithm and Method Analysis • The Soft Constraint Affinity Propagation(SCAP) equations. • Efficient implementation of the algorithm. • Extracting cluster signatures.
The SCAP equations • We write the constraint attached to a given data point as follows, with : The first case assigns a penalty if data point is chosen as exemplar by some other data point , without being a self-exemplar.
The SCAP equations • The penalty presents a compromise between the minimization the cost function and the search of compact clusters. • Then, we introduce a positive real-valued parameter weighing the relative importance of the cost minimization with respect to the constraints.
The SCAP equations • So, we can define the probability of an arbitrary clustering as: • Original AP is recovered by taking since any violated constraint sets to zero.
The SCAP equations • For general , the optimal clustering can be determined by maximizing the marginal probabilities for all data points :
The SCAP equations • Assume , we find the SCAP equations: • The exemplar of any data point can be computed as:
The SCAP equations • Compared to original AP, SCAP amounts to anadditional threshold on the self-availabilities and the self-responsibilities . • For small enough , in many case. • The self-responsibility is substituted with . • For (i.e. ), the original AP equations are recovered.
The SCAP equations • This means that variables are discouraged to be self-exemplars beyond a given threshold, even in the case someone is already pointing at them.
Efficient implementation • The iterative solution:
Efficient implementation • Difference between the original AP: • Step 3 is formulated as a sequential update. • The original AP used damped parallel update.
Extracting cluster signatures • Only a few components carry useful information about the cluster structure, they are called cluster signatures. • We assume the similarity between data points and to be additive in single-gene contributions:
Extracting cluster signatures • Having found a clustering given by the exemplar selection , we can calculate the similarity of a cluster C defined as a connected component of the directed graph:as a sum over single-gene contributions
Extracting cluster signatures • Then, we compare to random exemplar choices which are characterized by their mean: and variance
Extracting cluster signatures • The relevance of a gene can be ranked by which measures the distance of the actual from the distribution of random exemplar mappings. • Genes can be ranked according to , highest-ranking genes are considered a cluster signature.
Experimental results • Iris data • Brain cancer data • Other benchmark cancer data • Lymphoma cancer data • SRBCT cancer data • Leukemia
Iris data • Three clusters: setosa, versicolor, virginica. • Four features for 150 flowers: • sepal length • sepal width • petal length • petal width
Iris data • Experimental results: • Affinity Propagation: 16 errors. • SCAP: 9 errors with Manhattan distance measure for the similarity. • On increasing the value of , the clusters for Versicolor and Virginica merge with each other, reflecting the fact that they are closer to each other than to Setosa.
Brain cancer data • Five diagnosis types for 42 patients: • 10 medulloblastoma • 10 malignant glioma • 10 atypical teratoid/rhabdoid tumors • 4 normal cerebella • 8 primitive neuroectodermal tumors – PNET
Brain cancer data • Clustering with AP(for ): Five clusters for lowest errors. There are three well-distinguishable clusters.
Brain cancer data • Clustering with SCAP: The SCAP identifies four clusters with 8 errors.
Brain cancer data • Eight errors are due to misclassifications of the fifth diagnosis(PNET). • We use the procedure to extract cluster signatures in the case of four clusters: • No. 34~41 are the fifth diagnosis.
Other benchmark cancer data • Lymphoma cancer data • Three diagnoses for 62 patients. • SRBCT cancer data • Four expression diagnosis patterns for 63 samples. • Leukemia • Two diagnoses for 72 samples.
Other benchmark cancer data • Lymphoma cancer data • AP: 3 errors with 3 clusters. • SCAP: 1 error with 3 clusters. • SRBCT cancer data • AP: 22 errors with 5 clusters. • SCAP: 7 errors with 4 clusters. • Leukemia • AP: 4 errors with 2 clusters. • SCAP: 2 errors with 2 clusters.
Discussion • If clusters cannot be well represented by a single cluster exemplar, AP has to fail. • SCAP is more efficient than AP in particular in the case of noisy, irregularly organized data and thus in biological applications concerning microarray data. • The cluster structure can be efficiently probed.