canadian bioinformatics workshops n.
Skip this Video
Loading SlideShow in 5 Seconds..
Canadian Bioinformatics Workshops PowerPoint Presentation
Download Presentation
Canadian Bioinformatics Workshops

Loading in 2 Seconds...

play fullscreen
1 / 32

Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

  • Uploaded on

Canadian Bioinformatics Workshops. Module #: Title of Module. 2. Module 5 Clustering. Exploratory Data Analysis and Essential Statistics using R Boris Steipe Toronto, September 8–9 2011. †. Herakles and Iolaos battle the Hydra. Classical (450-400 BCE).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Canadian Bioinformatics Workshops' - thad

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Module 5Clustering

Exploratory Data Analysis and Essential Statistics using R

Boris Steipe

Toronto, September 8–9 2011

Herakles and Iolaos battle the Hydra. Classical (450-400 BCE)





Includes material originally developed by

Sohrab Shah


Introduction to clustering

  • What is clustering?
    • unsupervised learning
    • discovery of patterns in data
    • class discovery
  • Grouping together “objects” that are most similar (or least dissimilar)
    • objects may be genes, or samples, or both
  • Example question: Are there samples in my cohort that can be subgrouped based on molecular profiling?
    • Do these groups have correlation to clinical outcome?

Distance metrics

  • In order to perform clustering, we need to have a way to measure how similar (or dissimilar) two objects are
  • Euclidean distance:
  • Manhattan distance:
  • 1-correlation
    • proportional to Euclidean distance, but invariant to range of measurement from one sample to the next




Distance metrics compared




Conclusion: distance matters!


Other distance metrics

  • Hamming distance for ordinal, binary or categorical data:

Approaches to clustering

  • Partitioning methods
    • K-means
    • K-medoids (partitioning around medoids)
    • Model based approaches
  • Hierarchical methods
    • nested clusters
      • start with pairs
      • build a tree up to the root

Partitioning methods

  • Anatomy of a partitioning based method
    • data matrix
    • distance function
    • number of groups
  • Output
    • group assignment of every object
partitioning based methods
Partitioning based methods

Choose K groups

initialise group centers

aka centroid, medoid

assign each object to the nearest centroid according to the distance metric

reassign (or recompute) centroids

repeat last 2 steps until assignment stabilizes

hierarchical clustering
Hierarchical clustering

Anatomy of hierarchical clustering

distance matrix

linkage method



a tree that defines the relationships between objects and the distance between clusters

a nested sequence of clusters

linkage methods
Linkage methods



distance between centroids


linkage methods1
Linkage methods

Ward (1963)

form partitions that minimizes the loss associated with each grouping

loss defined as error sum of squares (ESS)

consider 10 objects with scores (2, 6, 5, 6, 2, 2, 2, 2, 0, 0, 0)

ESSOnegroup = (2 -2.5)2 + (6 -2.5)2 + ....... + (0 -2.5)2 = 50.5

On the other hand, if the 10 objects are classified according to their scores into four sets,

{0,0,0}, {2,2,2,2}, {5}, {6,6}

The ESS can be evaluated as the sum of squares of four separate error sums of squares:

ESSOnegroup = ESSgroup1 + ESSgroup2 + ESSgroup3 + ESSgroup4 = 0.0

Thus, clustering the 10 scores into 4 clusters results in no loss of information.

linkage methods in action
Linkage methods in action

clustering based on single linkage

single <- hclust(dist(t(exprMatSub),method="euclidean"), method=”single");


linkage methods in action1
Linkage methods in action

clustering based on complete linkage

complete <- hclust(dist(t(exprMatSub),method="euclidean"), method="complete");


linkage methods in action2
Linkage methods in action

clustering based on centroid linkage

centroid <- hclust(dist(t(exprMatSub),method="euclidean"), method=”centroid");


linkage methods in action3
Linkage methods in action
  • clustering based on average linkage
  • average <- hclust(dist(t(exprMatSub),method="euclidean"), method=”average");
  • plot(average);
linkage methods in action4
Linkage methods in action
  • clustering based on Ward linkage
  • ward <- hclust(dist(t(exprMatSub),method="euclidean"), method=”ward");
  • plot(ward);
linkage methods in action5
Linkage methods in action

Conclusion: linkage matters!

model based approaches
Model based approaches

Assume the data are ‘generated’ from a mixture of K distributions

What cluster assignment and parameters of the K distributions best explain the data?

‘Fit’ a model to the data

Try to get the best fit

Classical example: mixture of Gaussians (mixture of normals)

Take advantage of probability theory and well-defined distributions in statistics

model based clustering of acgh
Model based clustering of aCGH

Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared CNAs hard to detect

Approach: Cluster the data by extending the profiling to the multi-group setting

Shah et al (Bioinformatics, 2009)

A mixture of HMMs: HMM-Mix

Group g

Sparse profiles

Distribution of calls in a group


State c

CNA calls

Raw data

Patient p

State k

advantages of model based approaches
Advantages of model based approaches

In addition to clustering patients into groups, we output a ‘model’ that best represents the patients in a group

We can then associate each model with clinical variables and simply output a classifier to be used on new patients

Choosing the number of groups becomes a model selection problem (cf. the Bayesian Information Criterion)

see Yeung et al Bioinformatics (2001)

clustering 106 follicular lymphoma patients with hmm mix
Clustering 106 follicular lymphoma patients with HMM-Mix





  • Recapitulates known FL subgroups
  • Subgroups have clinical relevance
feature selection
Feature selection

Most features (genes, SNP probesets, BAC clones) in high dimensional datasets will be uninformative

examples: unexpressed genes, housekeeping genes, ‘passenger alterations’

Clustering (and classification) has a much higher chance of success if uninformative features are removed

Simple approaches:

select intrinsically variable genes

require a minimum level of expression in a proportion of samples

genefilter package (Bioonductor): Lab1

Return to feature selection in the context of classification

advanced topics in clustering
Advanced topics in clustering

Top down clustering

Bi-clustering or ‘two-way’ clustering

Principal components analysis

Choosing the number of groups

model selection


Silhouette coefficient

The Gap curve

Joint clustering and feature selection

what have we learned
What Have We Learned?

There are three main types of clustering approaches



model based

Feature selection is important

reduces computational time

more likely to identify well-separated groups

The distance metric matters

The linkage method matters in hierarchical clustering

Model based approaches offer principled probabilistic methods


We are on a

Coffee Break &

Networking Session