Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 5Clustering Exploratory Data Analysis and Essential Statistics using R Boris Steipe Toronto, September 8–9 2011 † Herakles and Iolaos battle the Hydra. Classical (450-400 BCE) DEPARTMENT OF BIOCHEMISTRY DEPARTMENT OF MOLECULAR GENETICS Includes material originally developed by Sohrab Shah †

Introduction to clustering • What is clustering? • unsupervised learning • discovery of patterns in data • class discovery • Grouping together “objects” that are most similar (or least dissimilar) • objects may be genes, or samples, or both • Example question: Are there samples in my cohort that can be subgrouped based on molecular profiling? • Do these groups have correlation to clinical outcome?

Distance metrics • In order to perform clustering, we need to have a way to measure how similar (or dissimilar) two objects are • Euclidean distance: • Manhattan distance: • 1-correlation • proportional to Euclidean distance, but invariant to range of measurement from one sample to the next dissimilar similar

Distance metrics compared Euclidean Manhattan 1-Pearson Conclusion: distance matters!

Other distance metrics • Hamming distance for ordinal, binary or categorical data:

Approaches to clustering • Partitioning methods • K-means • K-medoids (partitioning around medoids) • Model based approaches • Hierarchical methods • nested clusters • start with pairs • build a tree up to the root

Partitioning methods • Anatomy of a partitioning based method • data matrix • distance function • number of groups • Output • group assignment of every object

Partitioning based methods Choose K groups initialise group centers aka centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes

K-means vs K-medoids

Partitioning based methods

Agglomerative hierarchical clustering

Hierarchical clustering Anatomy of hierarchical clustering distance matrix linkage method Output dendrogram a tree that defines the relationships between objects and the distance between clusters a nested sequence of clusters

Linkage methods single complete distance between centroids average

Linkage methods Ward (1963) form partitions that minimizes the loss associated with each grouping loss defined as error sum of squares (ESS) consider 10 objects with scores (2, 6, 5, 6, 2, 2, 2, 2, 0, 0, 0) ESSOnegroup = (2 -2.5)2 + (6 -2.5)2 + ....... + (0 -2.5)2 = 50.5 On the other hand, if the 10 objects are classified according to their scores into four sets, {0,0,0}, {2,2,2,2}, {5}, {6,6} The ESS can be evaluated as the sum of squares of four separate error sums of squares: ESSOnegroup = ESSgroup1 + ESSgroup2 + ESSgroup3 + ESSgroup4 = 0.0 Thus, clustering the 10 scores into 4 clusters results in no loss of information.

Linkage methods in action clustering based on single linkage single <- hclust(dist(t(exprMatSub),method="euclidean"), method=”single"); plot(single);

Linkage methods in action clustering based on complete linkage complete <- hclust(dist(t(exprMatSub),method="euclidean"), method="complete"); plot(complete)

Linkage methods in action clustering based on centroid linkage centroid <- hclust(dist(t(exprMatSub),method="euclidean"), method=”centroid"); plot(centroid);

Linkage methods in action • clustering based on average linkage • average <- hclust(dist(t(exprMatSub),method="euclidean"), method=”average"); • plot(average);

Linkage methods in action • clustering based on Ward linkage • ward <- hclust(dist(t(exprMatSub),method="euclidean"), method=”ward"); • plot(ward);

Linkage methods in action Conclusion: linkage matters!

Hierarchical clustering analyzed

Model based approaches Assume the data are ‘generated’ from a mixture of K distributions What cluster assignment and parameters of the K distributions best explain the data? ‘Fit’ a model to the data Try to get the best fit Classical example: mixture of Gaussians (mixture of normals) Take advantage of probability theory and well-defined distributions in statistics

Model based clustering: array CGH

Model based clustering of aCGH Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared CNAs hard to detect Approach: Cluster the data by extending the profiling to the multi-group setting Shah et al (Bioinformatics, 2009) A mixture of HMMs: HMM-Mix Group g Sparse profiles … … Distribution of calls in a group Profile State c CNA calls Raw data Patient p State k

Advantages of model based approaches In addition to clustering patients into groups, we output a ‘model’ that best represents the patients in a group We can then associate each model with clinical variables and simply output a classifier to be used on new patients Choosing the number of groups becomes a model selection problem (cf. the Bayesian Information Criterion) see Yeung et al Bioinformatics (2001)

Clustering 106 follicular lymphoma patients with HMM-Mix Initialisation Profiles Clinical Converged • Recapitulates known FL subgroups • Subgroups have clinical relevance

Feature selection Most features (genes, SNP probesets, BAC clones) in high dimensional datasets will be uninformative examples: unexpressed genes, housekeeping genes, ‘passenger alterations’ Clustering (and classification) has a much higher chance of success if uninformative features are removed Simple approaches: select intrinsically variable genes require a minimum level of expression in a proportion of samples genefilter package (Bioonductor): Lab1 Return to feature selection in the context of classification

Advanced topics in clustering Top down clustering Bi-clustering or ‘two-way’ clustering Principal components analysis Choosing the number of groups model selection AIC, BIC Silhouette coefficient The Gap curve Joint clustering and feature selection

What Have We Learned? There are three main types of clustering approaches hierarchical partitioning model based Feature selection is important reduces computational time more likely to identify well-separated groups The distance metric matters The linkage method matters in hierarchical clustering Model based approaches offer principled probabilistic methods

We are on a Coffee Break & Networking Session

Canadian Bioinformatics Workshops