Approaches to clustering based analysis and validation
Download
1 / 78

Zheng2 - PowerPoint PPT Presentation


  • 354 Views
  • Uploaded on

Approaches to clustering-based analysis and validation Dr. Huiru Zheng Dr. Francisco Azuaje School of Computing and Mathematics Faculty of Engineering University of Ulster Gene Expression Data Genes vs perturbations Tissues vs genes

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Zheng2' - Audrey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Approaches to clustering based analysis and validation l.jpg

Approaches to clustering-based analysis and validation

Dr. Huiru Zheng Dr. Francisco Azuaje

School of Computing and Mathematics

Faculty of Engineering

University of Ulster


Gene expression data l.jpg
Gene Expression Data

Genes vs perturbations

Tissues vs genes

Expression matrices illustrating two distinct genetic profiles.

genes (G), biological conditions (C), types of tissue (T).


Clustering l.jpg

y

Cluster A

Cluster B

Clustering

  • At the end of an unsupervised recognition process (learning), we obtain a number of classes or CLUSTERS;

  • Each cluster groups a number of cases/input vectors;

  • When new input vectors are presented to the system, they are categorised into one of the existing classes or CLUSTERS

x



Clustering approaches to classification traditional approach l.jpg
Clustering approaches toclassification: Traditional approach

a1 a2 a3 a4 a5

s1 x y z y y

s2 x x x y y

s3 y y x y y

s4 y y y y y

s5 y x x y y

C1: {s3, s4}

C2: {s1, s2,s5}

Exclusive clusters


Clustering approaches to classification direct sample attribute correlation l.jpg
Clustering approaches to classification: Direct sample/attribute correlation

a1 a2 a3 a4 a5

s1 x y z y y

s2 x x x y y

s3 y y x y y

s4 y y y y y

s5 y x x y y

C1: {s3, s4}, {a1, a2}

C2: {s1, s2,s5 }, {a4, a5}

Exclusive biclusters


Clustering approaches to classification multiple cluster membership l.jpg
Clustering approaches to classification: sample/attribute correlationMultiple cluster membership

a1 a2 a3 a4 a5

s1 x y z y y

s2 x x x y y

s3 y y x y y

s4 y y y y y

s5 y x x y y

C1: { s3, s4}, {a1, a2}

C2: {s1, s2,s3,s5}, {a4, a5}


Key algorithms l.jpg
Key Algorithms sample/attribute correlation

  • Hierarchical Clustering

  • K-means

  • Kohonen Maps

  • Self-adaptive methods

  • ……


Hierarchical clustering 1 l.jpg
Hierarchical Clustering(1) sample/attribute correlation

  • Organizes the data into larger groups, which contain smaller groups, like a tree or dendrogram.

  • They avoid specifying how many clusters are appropriate by providing a partition for each K. The partitions are obtained from cutting the tree at different levels.

  • The tree can be built in two distinct ways

    • bottom-up: agglomerative clustering;

    • top-down: divisive clustering.

  • Algorithms:

    • Agglomerative (Single-linkage, complete-linkage, average-linkage) ….


Hierarchical clustering 2 l.jpg
Hierarchical Clustering(2) sample/attribute correlation

Degrees of

dissimilarity

genes


Hierarchical clustering 3 l.jpg
Hierarchical Clustering (3) sample/attribute correlation

  • P = set of genes

  • While more than one subtree in P

    • Pick the most similar pair i, j in P

    • Define a new subtree k joining i and j

    • Remove i and j from P and insert k


Figures of hierarchical clustering l.jpg
Figures of Hierarchical Clustering sample/attribute correlation

1‘

1 2 3 4 5


Figures of hierarchical clustering13 l.jpg
Figures of Hierarchical Clustering sample/attribute correlation

2‘

1 2 3 4 5


Figures of hierarchical clustering14 l.jpg
Figures of Hierarchical Clustering sample/attribute correlation

2‘

3‘

1 2 3 4 5


Figures of hierarchical clustering15 l.jpg
Figures of Hierarchical Clustering sample/attribute correlation

1 2 3 4 5


Hierarchical clustering l.jpg
Hierarchical Clustering sample/attribute correlation


Slide17 l.jpg

An Example of sample/attribute correlation

Hierarchical Clustering


Partitional clustering l.jpg
Partitional clustering sample/attribute correlation

  • To create one set of clusters that partitions the data into similar groups.

  • Algorithms: Forgy’s, k-means, Isodata…


K means clustering l.jpg
K-means Clustering sample/attribute correlation

  • A value for k is selected up front; # of expected cluster

  • The algorithm divides the data into k many clusters in such a way that the profiles within each cluster are more similar than those across clusters.


K mean approach l.jpg
K-mean approach sample/attribute correlation

  • One more input k is required. There are many variants of k-mean.

  • Sum-of squares criterion

  • minimize


An example of k mean approach l.jpg
An example of k-mean approach sample/attribute correlation

  • Two passes

    • Begin with k clusters, each consisting of one of the first k samples. For the remaining n-k samples, find the centroid nearest it. After each sample is assigned, re-compute the centroid of the altered cluster.

    • For each sample, find the centroid nearest it. Put the sample in the cluster identified with this nearest centroid. ( do not need to re-compute.)


Examples l.jpg
Examples sample/attribute correlation


Kohonen self organising map s som l.jpg
Kohonen Self-Organising Map sample/attribute correlations (SOM)

  • The aim of Kohonen learning is to map similar signals/input-vectors/cases to similar neurone positions;

  • Neurones or nodes that are physically adjacent in the network encode patterns or inputs that are similar


Som architecture 1 l.jpg
SOM: architecture (1) sample/attribute correlation

neurone i

Kohonen layer

wi

Winning neurone

Input vector X

X=[x1,x2,…xn]  Rn

wi=[wi1,wi2,…,win]  Rn


Som architecture 2 l.jpg
SOM: architecture (2) sample/attribute correlation

A rectangular grid of neurones representing a Kohonen map. Lines are used to link neighbour neurons.


Som architecture 3 l.jpg
SOM: architecture (3) sample/attribute correlation

2-dimensional representation of random weight vectors.The lines are drawn to connect neurones which are physically adjacent.


Som architecture 4 l.jpg
SOM: architecture (4) sample/attribute correlation

2-dimensional representation of 6 input vectors (a training data set)


Som architecture 5 l.jpg
SOM: architecture (5) sample/attribute correlation

In a well trained (ordered) network the diagram in the weight space should have the same topology as that in physical space and will reflect the properties of the training data set.


Som architecture 6 l.jpg
SOM: architecture (6) sample/attribute correlation

Input space (training data set)

Weight vector representations after training


Som based clustering 1 l.jpg
SOM-based Clustering (1) sample/attribute correlation

  • Type of input

    a) The input a neural network can process is a vector of fixed length.

    b) This means that only numbers can be used as input and that one must setup the network in such a way that the longest input vector can be processed. This also means that to all vectors with less elements, elements must be added until they have the same size as the longest vector.


Som based clustering 2 l.jpg
SOM-based Clustering (2) sample/attribute correlation

  • Classification of inputs

    a) In a Kohonen network, each neurone is represented by a so-called weight vector;

    b) During training these vectors are adjusted to match the input vectors in such a way that after training each of the weight vectors represents a certain class of input vectors;

    c) If in the test phase a vector is presented as input, the weight vector which represents the class this input vector belongs to, is given as output, i.e. the neurone is activated.


Som based clustering 3 l.jpg
SOM-based Clustering (3) sample/attribute correlation

  • Learning (training) behaviour.

    a) During training (learning) the neurones of a Kohonen network are adjusted in such a way, that on the map there will form regions which consist of neurones with similar weight vectors.

    b) This means that in a well-trained map, a class will not be represented by one single neurone, but by a group of neurons.

    c) In this group there is one central neurone which can be said to represent the most prototypical member of this class, while the surrounding neurons represent less prototypical members.


Som learning alogorithm l.jpg
SOM: Learning Alogorithm sample/attribute correlation

SOMs define a mapping from a m-dimensional input data space onto a one- or two-dimensional array of nodes;

Algorithm:

1. initialize the network with n nodes;

2. select one case from the set of training cases;

3. find the node in the network that is closest (according to some measure of distance) to the selected case;

4. adjust the set of weights of the closest node and of the nodes around it;

5. repeat from 2. until some termination criterion is reached.


Som one single learning cycle 1 l.jpg
SOM : sample/attribute correlationOne single learning cycle (1)

1) The weights are initialised to random values (between the interval -0.1 to 0.1, for instance) and the neighbourhood sizes set to cover over half of the network;

2) a m-dimensional input vector Xs (scaled between -1 and +1, for instance) enters the network;

3) The distances di(Wi, Xs) between all the weight vectors on the SOM and Xs are calculated by using (for instance):

where:

Wi denotes the ith weight vector;

wj and xj represent the jth elements of Wi and Xi respectively


Som one single learning cycle 2 l.jpg
SOM: sample/attribute correlationOne single learning cycle (2)

4) Find the best matching neurone or “winning” neurone whose weight vector Wk is closest to the current input vector Xi ;

5) Modify the weights of the winning neurone and all the neurones in the neighbourhood Nk by applying:

Wjnew = Wjold + (Xi - Wjold)

Where represents the learning rate;

6) Next input vectorX(i+1), the process is repeated.


Som learning parameters l.jpg
SOM: Learning Parameters sample/attribute correlation

  • If a data set consists of P input vectors or cases, then 1 learning epoch is equal to P single learning cycles

  • After a number of N learning epochs, the size of the neighbourhood is decreased.

  • After a number of M learning epochs, the learning rate, , may be decreased;


Som neighbourhood schemes 1 l.jpg

First neighbourhood sample/attribute correlation

Second neighbourhood

SOM: Neighbourhood Schemes(1)

  • Linear


Som neighbourhood schemes 2 l.jpg
SOM: Neighbourhood Schemes(2) sample/attribute correlation

  • Rectangular

First neighbourhood

Second neighbourhood


Som neighbourhood schemes 3 l.jpg
SOM: Neighbourhood Schemes(3) sample/attribute correlation

Why do we have to modify the size of neighbourhood ?

  • We need to induce map formation by adapting regions according to the similarity between weights and input vectors;

  • We need to ensure that neighbourhoods are adjacent;

  • Thus, a neighbourhood will represent a number of similar clusters or neurones;

  • By starting with a large neighbourhood we guarantee that a GLOBAL ordering takes place, otherwise there may be more than one region on the map encoding a given part of the input space.


Som neighbourhood schemes 4 l.jpg
SOM: Neighbourhood Schemes(4) sample/attribute correlation

  • One good strategy is to gradually reduce the size of the neighbourhood for each neurone to zero over a first part of the learning phase, during the formation of the map topography;

  • and then continue to modify only the weight vectors of the winning neurones to pick up the fine details of the input space


Som learning rate l.jpg
SOM: Learning rate sample/attribute correlation

Why do we need to decrease the learning rate  ?

  • If the learning rate  is kept constant, it is possible for weight vectors to oscillate back and forth between two nearby positions;

  • Lowering ensures that this does not occur and the network is stable.


Visualising data and clusters with kohonen maps l.jpg
Visualising data and clusters with Kohonen maps sample/attribute correlation

U-matrix and median distance matrix maps for leukaemia data

The U-matrix holds distances between neighbouring map units


Basic criteria for the selection of clustering techniques 1 l.jpg
Basic Criteria For The Selection sample/attribute correlation Of Clustering Techniques (1)

  • Which clustering algorithm should I use?

  • Should I apply an alternative solution?

  • How can results be improved by using different methods?


Basic criteria for the selection of clustering techniques 2 l.jpg
Basic Criteria For The Selection sample/attribute correlation Of Clustering Techniques (2)

  • There are multiple clustering techniques that can be used to analyse expression data.

  • Choosing “the best” algorithm for a particular problem may represent a challenging task.

  • Advantages and limitations may depend on factors such as the statistical nature of the data, pre-processing procedures, number of features etc.

  • It is not uncommon to observe inconsistent results when different clustering methods are tested on a particular data set


Basic criteria for the selection of clustering techniques 3 l.jpg
Basic Criteria For The Selection sample/attribute correlation Of Clustering Techniques (3)

  • In order to make an appropriate choice, it is important to have a good understanding of:

    • the problem domain under study, and

    • the clustering options available.


Basic criteria for the selection of clustering techniques 4 l.jpg
Basic Criteria For The Selection sample/attribute correlation Of Clustering Techniques (4)

  • Knowledge on the underlying biological problem may allow a scientist to choose a tool that satisfies certain requirements, such as the capacity to detect overlapping classes.

  • Knowledge on the mathematical properties of a clustering technique may support the selection process.

    • How does this algorithm represent similarity (or dissimilarity)?,

    • How much relevance does it assign to cluster heterogeneity?,

    • How does it implement the process of measuring cluster isolation?.

  • Answers to these questions may indicate crucial directions for the selection of an adequate clustering algorithm.


Basic criteria for the selection of clustering techniques 5 l.jpg
Basic Criteria For The Selection sample/attribute correlation Of Clustering Techniques (5)

  • Empirical studies have defined several mathematical criteria ofacceptability

    • For example, there may be clustering algorithms that are capable of guaranteeing the generation of partitions whose cluster structures do not intersect.

  • Several algorithms indirectly assume that the cluster structure of the data under consideration exhibits particular characteristics.

    • For instance, the k-means algorithm assumes that the shape of the clusters is spherical; and single-linkage hierarchical clustering assumes that the clusters are well separated


Basic criteria for the selection of clustering techniques 6 l.jpg
Basic Criteria For The Selection sample/attribute correlation Of Clustering Techniques (6)

  • Unfortunately, this type of knowledge may not always be available in an expression data study.

  • In this situation a solution may be to test a number of techniques on related data sets, which have previously been classified (a reference data set).

  • Thus, a user may choose a clustering method if it produced consistent categorisation results in relation to such reference data set.


Basic criteria for the selection of clustering techniques 7 l.jpg
Basic Criteria For The Selection sample/attribute correlation Of Clustering Techniques (7)

  • Specific user requirements may also influence a selection decision.

    • For example, a scientist may be interested in observing direct relationships between classes and subclasses in a data partition. In this case, a hierarchical clustering approach may represent a basic solution.

  • But in some studies hierarchical clustering results could be difficult to visualise because of the number of samples and features involved. Thus, for instance, a SOM may be considered to guide an exploratory analysis of the data.


Basic criteria for the selection of clustering techniques 8 l.jpg
Basic Criteria For The Selection sample/attribute correlation Of Clustering Techniques (8)

  • In general the application of two or more clustering techniques may provide the basis for the synthesis of accurate and reliable results.

  • A scientist may be more confident about the clustering experiments if very similar results are obtained by using different techniques.


Clustering approaches to classification key experimental factors l.jpg
Clustering approaches to classification: Key experimental factors

  • Type of clustering algorithm

  • Number of experiments (partitions)

  • Number of clustering (learning) cycles in an algorithm

  • In KD applications the number of classes may not be known a priori

  • The number of clusters in each experiment


The problem of assessing cluster validity and evaluation l.jpg
The problem of assessing cluster validity and evaluation factors

Partition 1 (2 clusters)

Partition 2 (3 clusters)

Partition 3

(4 clusters)

Partition n

( 3 clusters)


Clustering and cluster validity assessment l.jpg
Clustering and cluster validity assessment factors

Data on a 2D space


Cluster validity assessment l.jpg
Cluster validity assessment factors

c = 2

c = 4

Method 1

Method 2


Cluster validity assessment key questions l.jpg
Cluster validity assessment – Key questions: factors

  • Is this a relevant partition?

  • Should we analyse these clusters?

  • Is there a better partition?

  • Which clustering method should we apply ?

  • Is this relevant information from a biological point of view ?


Cluster validity assessment56 l.jpg
Cluster validity assessment factors

  • Quality indices

  • Maximize or minimize indices

  • Quality factors:compactness, heterogeneity, isolation, shape…

  • The best or correct partition is the one that maximize or minimize an index


Cluster validity assessment based on a quality index i l.jpg
Cluster validity assessment based on a quality index factorsI

Partition 1

(2 clusters)

Partition 2

(3 clusters)

Partition 3

(4 clusters)

P1: I1 = 0.1

P2: I2 = 0.9

P3: I3 = 0.5

P2 is the best/correct partition

Analyze and interpret P2


Cluster validity assessment58 l.jpg
Cluster Validity assessment factors

(Xi, Xj): intercluster distance – (Xk) intracluster distance

d

(

X

,

X

)

1

2

D

(

X

)

1


Cluster validity assessment59 l.jpg

factors1

2

Mean vector

Cluster Validity assessment

There are different ways to calculate (Xi, Xj) and (Xk)


Cluster validity assessment i the dunn s index l.jpg
Cluster validity assessment (I) - factorsThe Dunn’s index

  • This index aims at identifying sets of clusters that are compact and well separated

  • For any partition UX: X1... Xi… X, the Dunn‘s validation index is defined as:

  • (Xi, Xj): intercluster distance between clusters Xi and Xj (); (Xk) intracluster distance of cluster Xk; c: number of clusters of partition U

  • Large values of V correspond to good clusters

  • The number of clusters that maximises V is taken as the optimal number of clusters, c.


Cluster validity assessment ii the davies bouldin index l.jpg
Cluster validity assessment (II) – factorsthe Davies-Bouldin index

  • Small values of DB correspond to good clusters (the clusters are compact and their centres are far away from each other)

  • The cluster configuration that minimizes DB is taken as the optimall number of clusters, c.


Cluster validity assessment obtaining the partitions l.jpg
Cluster Validity assessment : factorsObtaining the partitions

Leukemia data: 2 clusters

A B

AML ALL


Cluster validity assessment obtaining the partitions63 l.jpg
Cluster Validity assessment : factorsObtaining the partitions

Leukemia data: 4 clusters

A B C D

AML T-ALL B-ALL


Cluster validity assessment davies bouldin indexes leukemia data l.jpg
Cluster Validity assessment : Davies-Bouldin factorsindexes - leukemia data

  • DB11 : using inter-cluster distance 1 and intra-cluster distance 1 (complete diameter)

  • DB21 : using inter-cluster distance 2 and intra-cluster distance 1

Bold entries represent the optimal number of clusters, c, predicted by each index.


Cluster validity assessment dunn s indexes leukemia data l.jpg
Cluster validity assessment factors: Dunn’sindexes - leukemia data

Bold entries represent the optimal number of clusters, c, predicted by each index.




Cluster validity assessment68 l.jpg
Cluster validity assessment factors

  • Different intercluster/intracluster distance combinations may produce validation indices of different scale ranges.

  • Indices with higher values may have a stronger effect on the calculation of the average index values.

  • This may result in a biased prediction of the optimal number of clusters.


Cluster validity assessment69 l.jpg
Cluster validity assessment factors

An approach for the prediction of the optimal partition is the implementation of an aggregation method based on a weighed voting strategy.

Leukaemia data and Davies-Bouldin validation index:


Cluster validity assessment70 l.jpg
Cluster validity assessment factors

Effect of the distance metric on the prediction process

Dunn’s validity indexes and leukemia data


Cluster validity assessment71 l.jpg
Cluster validity assessment - factors

Effect of the distance metric on the prediction process

Dunn’s validity indexes and DLBCL data


Visualisation dendrograms l.jpg
Visualisation - Dendrograms factors

  • Dendrograms are often used to visualize the nested sequence of clusters resulting from hierarchical clustering.

  • While dendrograms are quite appealing because of their apparent ease of interpretation, they can be misleading.

    • first, the dendrogram corresponding to a given hierarchical clustering is not unique, since for each merge one needs to specify which subtree should go on the left and which on the right – there are 2^(n-1) different dendrograms.

    • Second, and perhaps less recognized shortcoming of dendrograms, is that they impose structure on the data, instead of revealing structure in these data.


Dendrograms l.jpg
Dendrograms factors

  • Such a representation will be valid only to the extent that the pairwise distances possess the hierarchical structure imposed by the clustering algorithm.


Dendrograms example l.jpg
Dendrograms - example factors

  • Genes correspond to the rows, and the time points of each experiment are the columns.

  • The ratio of expression is color coded:

    • Red: upregulated

    • Green: downregulated

    • Black: no change

    • Grey: missing data


Visualisation maps l.jpg
Visualisation - maps factors

  • The 828 genes were grouped into 30 clusters.

  • Each cluster is represented by the centroid for genes in the clusters.

  • Expression levels are shown on y-axis and time points on x-axis


Key problems challenges recent advances l.jpg
Key problems, challenges, recent advances factors

Azuaje F, “Clustering-based approaches to discovering and visualizing expression patterns”, Briefings in Bioinformatics, 4 (1), pp. 31- 42, 2003.

(course material)


Conclusions l.jpg
Conclusions factors

  • Clustering is a basic computational approach to pattern recognition and classification in expression studies

  • Several methods are available, it is fundamental to understand the biological problem and statistical requirements

  • There is the need for systematic evaluation and validation frameworks to guide humans and computers to reach their classification goals

  • These techniques may support knowledge discovery processes in complex domains such as the molecular classification of cancers


Acknowledgement l.jpg
Acknowledgement factors

Dr. Haiying Wang

School of Computing and Mathematics

Faculty of Engineering

University of Ulster


ad