1 / 9

Soft clustering of gene expression data

Matthias E. Futschik Institute for Theoretical Biology Humboldt-University, Berlin, Germany. Soft clustering of gene expression data. Clustering methodology. Hierachical clustering can be divisive or agglomerative producing nested clusters.

mreavis
Download Presentation

Soft clustering of gene expression data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Matthias E. Futschik Institute for Theoretical Biology Humboldt-University, Berlin, Germany Soft clustering of gene expression data

  2. Clustering methodology • Hierachical clustering • can be divisive or agglomerative producing nested clusters. • Results are usually visualised by tree structures dendrogram. • Clustering depends on the linkage procedure used: single, complete, average,... • Partitional clustering • divides data into a (pre-)chosen number of classes. • Examples: k-means, SOMs, fuzzy c-means, simulated annealing, model-based clustering, HMMs,... • Setting the number of clusters is problematic • Cluster validity: • Most cluster algorithms always detect clusters, even in random data. • Cluster validation approaches address the number of existing clusters. • Approaches are based on objective functions, figures of merits, resampling, adding noise ....

  3. Hard clustering: Based on classical set theory Assigns a gene to exactly one cluster No differentiation how well gene is represented by cluster centroid Examples: hierachical clustering, k-means, SOMs, ... Soft clustering: Can assign a gene to several cluster Differentiate grade of representation (cluster membership) Example: Fuzzy c-means, HMMs, ... Hard clustering vs. soft clustering

  4. Example data set: Yeast cell cylce data by Cho et al. Hard clustering is sensitive to noise Standard deviation of expression Standard procedure is pre-filtering of genes based on variation due to noise sensitvity of hard clustering. However, no obvious threshold exists! (Heyer et al.: ca. 4000 genes, Tavazoe et al.: 3000 genes, Tamayo et al.: 823 genes) => Risk of essential losing information => Need of noise robust clustering method

  5. Soft clustering is more noise robust Hard clustering always detects clusters, even in random data Soft clustering differentiates cluster strength and, thus, can avoid detection of 'random' clusters Genes with high membership values cluster together inspite of added noise

  6. Differentiation in cluster membership allows profiling of cluster cores • A gene can be assigned to several clusters • Each gene is assigned to a cluster with a membership value between 0 and 1 • The membership values of a gene add up to one • Genes with lower membership values are not well represented by the cluster centroid • Expression of genes with high membership values are close to cluster centroid => Clusters have internal structures Membership value > 0.5 Hard clustering Membership value > 0.7

  7. Varitation in cluster parameter reveals cluster stability m=1.3 m=1.1 Variation of fuzzification parameter m determines 'hardness' of clustering: m → 1: Fuzzy c-means clustering becomes equivalent to k-means m → ∞: All genes are equivally assigned to all clusters. Strong clusters maintain their core for increasing m By variation of m clusters can be distinguished by their stability. Weak cluster lose their core

  8. Periodic and aperiodic clusters Periodic clusters of yeast cell cycle: Aperiodic clusters: => Aperiodic clusters were generally weaker than periodic clusters

  9. Global clustering structure Non-linear 2D-projection by Sammon's Mapping c-means clustering allows definition of overlap of clusters i.e. how many genes are shared by two clusters. This enables to define a similarity measure between clusters. Global clustering structures can be visualised by graphs i.e. edges representing overlap. Increasing number of clusters => Sub-clustering reveals sub-structures M. Futschik and B. Carlisle, Noise robust, soft clustering of gene expression data (JBCB, Vol. 3, No. 4, 965-988, 2005)

More Related