Clustering by soft-constraint affinity propagation: applications to gene-expression data

Clustering by soft-constraint affinity propagation: applications to gene-expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007

Outline • Introduction • The Algorithm and Method Analysis • Experimental results • Discussion

Introduction • Affinity Propagation seeks to identify each cluster by one of its elements, exemplar. • each point in the cluster refers to this exemplar. • each exemplar is required to refer to itself as a self-exemplar. • However, it forces clusters to appear as stars. There’s only one central node, and all other nodes are directly connected to it.

Introduction • Some drawbacks in Affinity Propagation: • The hard constraint in AP relies strongly on cluster-shape regularity. • All information about the internal structure and the hierarchical merging/dissociation of cluster is lost. • AP has robustness limitations. • AP forces each exemplar to point to itself.

Introduction • How to improve it? • The hard constraint: exemplars would be self-exemplars. • We relax the hard constraint by introducing a finite penalty term for each constraint violation.

The Algorithm and Method Analysis • The Soft Constraint Affinity Propagation(SCAP) equations. • Efficient implementation of the algorithm. • Extracting cluster signatures.

The SCAP equations • We write the constraint attached to a given data point as follows, with : The first case assigns a penalty if data point is chosen as exemplar by some other data point , without being a self-exemplar.

The SCAP equations • The penalty presents a compromise between the minimization the cost function and the search of compact clusters. • Then, we introduce a positive real-valued parameter weighing the relative importance of the cost minimization with respect to the constraints.

The SCAP equations • So, we can define the probability of an arbitrary clustering as: • Original AP is recovered by taking since any violated constraint sets to zero.

The SCAP equations • For general , the optimal clustering can be determined by maximizing the marginal probabilities for all data points :

The SCAP equations • Assume , we find the SCAP equations: • The exemplar of any data point can be computed as:

The SCAP equations • Compared to original AP, SCAP amounts to anadditional threshold on the self-availabilities and the self-responsibilities . • For small enough , in many case. • The self-responsibility is substituted with . • For (i.e. ), the original AP equations are recovered.

The SCAP equations • This means that variables are discouraged to be self-exemplars beyond a given threshold, even in the case someone is already pointing at them.

Efficient implementation • The iterative solution:

Efficient implementation • Difference between the original AP: • Step 3 is formulated as a sequential update. • The original AP used damped parallel update.

Extracting cluster signatures • Only a few components carry useful information about the cluster structure, they are called cluster signatures. • We assume the similarity between data points and to be additive in single-gene contributions:

Extracting cluster signatures • Having found a clustering given by the exemplar selection , we can calculate the similarity of a cluster C defined as a connected component of the directed graph:as a sum over single-gene contributions

Extracting cluster signatures • Then, we compare to random exemplar choices which are characterized by their mean: and variance

Extracting cluster signatures • The relevance of a gene can be ranked by which measures the distance of the actual from the distribution of random exemplar mappings. • Genes can be ranked according to , highest-ranking genes are considered a cluster signature.

Experimental results • Iris data • Brain cancer data • Other benchmark cancer data • Lymphoma cancer data • SRBCT cancer data • Leukemia

Iris data • Three clusters: setosa, versicolor, virginica. • Four features for 150 flowers: • sepal length • sepal width • petal length • petal width

Iris data • Experimental results: • Affinity Propagation: 16 errors. • SCAP: 9 errors with Manhattan distance measure for the similarity. • On increasing the value of , the clusters for Versicolor and Virginica merge with each other, reflecting the fact that they are closer to each other than to Setosa.

Brain cancer data • Five diagnosis types for 42 patients: • 10 medulloblastoma • 10 malignant glioma • 10 atypical teratoid/rhabdoid tumors • 4 normal cerebella • 8 primitive neuroectodermal tumors – PNET

Brain cancer data • Clustering with AP(for ): Five clusters for lowest errors. There are three well-distinguishable clusters.

Brain cancer data • Clustering with SCAP: The SCAP identifies four clusters with 8 errors.

Brain cancer data • Eight errors are due to misclassifications of the fifth diagnosis(PNET). • We use the procedure to extract cluster signatures in the case of four clusters: • No. 34~41 are the fifth diagnosis.

Other benchmark cancer data • Lymphoma cancer data • Three diagnoses for 62 patients. • SRBCT cancer data • Four expression diagnosis patterns for 63 samples. • Leukemia • Two diagnoses for 72 samples.

Other benchmark cancer data • Lymphoma cancer data • AP: 3 errors with 3 clusters. • SCAP: 1 error with 3 clusters. • SRBCT cancer data • AP: 22 errors with 5 clusters. • SCAP: 7 errors with 4 clusters. • Leukemia • AP: 4 errors with 2 clusters. • SCAP: 2 errors with 2 clusters.

Discussion • If clusters cannot be well represented by a single cluster exemplar, AP has to fail. • SCAP is more efficient than AP in particular in the case of noisy, irregularly organized data and thus in biological applications concerning microarray data. • The cluster structure can be efficiently probed.

Clustering by soft-constraint affinity propagation: applications to gene-expression data

Clustering by soft-constraint affinity propagation: applications to gene-expression data

Presentation Transcript

Gene Expression Profiling

Classification of Microarray Gene Expression Data

Gene Expression Arrays (Haverford College, Fall 2001)

Introduction to Constraint Programming

Clustering Documents

Weak and Strong Constraint 4D variational data assimilation: Methods and Applications

BIOINFORMATICS Datamining #1

Clustering and NLP

Carlo Colantuoni carlo@illuminatobiotech

Regulation of Gene Expression Chapter 18

Gene Expression

Classification of Microarray Gene Expression Data

Plant Propagation

Gene Expression Data and Cluster Analysis

Regulation of Gene Expression

Regulation of Gene Expression

Chapter 5: DNA, Gene Expression, and Biotechnology

Chapter 13 (Sections 13.1-13.3) Gene Expression

From DNA to Protein: Gene Expression