Loading in 2 Seconds...

Download Presentation

Approaches to clustering-based analysis and validation

Loading in 2 Seconds...

- 359 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Approaches to clustering-based analysis and validation' - Audrey

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Approaches to clustering-based analysis and validation

Dr. Huiru Zheng Dr. Francisco Azuaje

School of Computing and Mathematics

Faculty of Engineering

University of Ulster

Gene Expression Data

Genes vs perturbations

Tissues vs genes

Expression matrices illustrating two distinct genetic profiles.

genes (G), biological conditions (C), types of tissue (T).

Cluster A

Cluster B

Clustering- At the end of an unsupervised recognition process (learning), we obtain a number of classes or CLUSTERS;
- Each cluster groups a number of cases/input vectors;
- When new input vectors are presented to the system, they are categorised into one of the existing classes or CLUSTERS

x

Clustering approaches toclassification: Traditional approach

a1 a2 a3 a4 a5

s1 x y z y y

s2 x x x y y

s3 y y x y y

s4 y y y y y

s5 y x x y y

C1: {s3, s4}

C2: {s1, s2,s5}

Exclusive clusters

Clustering approaches to classification: Direct sample/attribute correlation

a1 a2 a3 a4 a5

s1 x y z y y

s2 x x x y y

s3 y y x y y

s4 y y y y y

s5 y x x y y

C1: {s3, s4}, {a1, a2}

C2: {s1, s2,s5 }, {a4, a5}

Exclusive biclusters

Clustering approaches to classification:Multiple cluster membership

a1 a2 a3 a4 a5

s1 x y z y y

s2 x x x y y

s3 y y x y y

s4 y y y y y

s5 y x x y y

C1: { s3, s4}, {a1, a2}

C2: {s1, s2,s3,s5}, {a4, a5}

Key Algorithms

- Hierarchical Clustering
- K-means
- Kohonen Maps
- Self-adaptive methods
- ……

Hierarchical Clustering(1)

- Organizes the data into larger groups, which contain smaller groups, like a tree or dendrogram.
- They avoid specifying how many clusters are appropriate by providing a partition for each K. The partitions are obtained from cutting the tree at different levels.
- The tree can be built in two distinct ways
- bottom-up: agglomerative clustering;
- top-down: divisive clustering.
- Algorithms:
- Agglomerative (Single-linkage, complete-linkage, average-linkage) ….

Hierarchical Clustering (3)

- P = set of genes
- While more than one subtree in P
- Pick the most similar pair i, j in P
- Define a new subtree k joining i and j
- Remove i and j from P and insert k

Figures of Hierarchical Clustering

1 2 3 4 5

Hierarchical Clustering

Partitional clustering

- To create one set of clusters that partitions the data into similar groups.
- Algorithms: Forgy’s, k-means, Isodata…

K-means Clustering

- A value for k is selected up front; # of expected cluster
- The algorithm divides the data into k many clusters in such a way that the profiles within each cluster are more similar than those across clusters.

K-mean approach

- One more input k is required. There are many variants of k-mean.
- Sum-of squares criterion
- minimize

An example of k-mean approach

- Two passes
- Begin with k clusters, each consisting of one of the first k samples. For the remaining n-k samples, find the centroid nearest it. After each sample is assigned, re-compute the centroid of the altered cluster.
- For each sample, find the centroid nearest it. Put the sample in the cluster identified with this nearest centroid. ( do not need to re-compute.)

Kohonen Self-Organising Maps (SOM)

- The aim of Kohonen learning is to map similar signals/input-vectors/cases to similar neurone positions;
- Neurones or nodes that are physically adjacent in the network encode patterns or inputs that are similar

SOM: architecture (1)

neurone i

Kohonen layer

wi

Winning neurone

Input vector X

X=[x1,x2,…xn] Rn

wi=[wi1,wi2,…,win] Rn

SOM: architecture (2)

A rectangular grid of neurones representing a Kohonen map. Lines are used to link neighbour neurons.

SOM: architecture (3)

2-dimensional representation of random weight vectors.The lines are drawn to connect neurones which are physically adjacent.

SOM: architecture (4)

2-dimensional representation of 6 input vectors (a training data set)

SOM: architecture (5)

In a well trained (ordered) network the diagram in the weight space should have the same topology as that in physical space and will reflect the properties of the training data set.

SOM-based Clustering (1)

- Type of input

a) The input a neural network can process is a vector of fixed length.

b) This means that only numbers can be used as input and that one must setup the network in such a way that the longest input vector can be processed. This also means that to all vectors with less elements, elements must be added until they have the same size as the longest vector.

SOM-based Clustering (2)

- Classification of inputs

a) In a Kohonen network, each neurone is represented by a so-called weight vector;

b) During training these vectors are adjusted to match the input vectors in such a way that after training each of the weight vectors represents a certain class of input vectors;

c) If in the test phase a vector is presented as input, the weight vector which represents the class this input vector belongs to, is given as output, i.e. the neurone is activated.

SOM-based Clustering (3)

- Learning (training) behaviour.

a) During training (learning) the neurones of a Kohonen network are adjusted in such a way, that on the map there will form regions which consist of neurones with similar weight vectors.

b) This means that in a well-trained map, a class will not be represented by one single neurone, but by a group of neurons.

c) In this group there is one central neurone which can be said to represent the most prototypical member of this class, while the surrounding neurons represent less prototypical members.

SOM: Learning Alogorithm

SOMs define a mapping from a m-dimensional input data space onto a one- or two-dimensional array of nodes;

Algorithm:

1. initialize the network with n nodes;

2. select one case from the set of training cases;

3. find the node in the network that is closest (according to some measure of distance) to the selected case;

4. adjust the set of weights of the closest node and of the nodes around it;

5. repeat from 2. until some termination criterion is reached.

SOM :One single learning cycle (1)

1) The weights are initialised to random values (between the interval -0.1 to 0.1, for instance) and the neighbourhood sizes set to cover over half of the network;

2) a m-dimensional input vector Xs (scaled between -1 and +1, for instance) enters the network;

3) The distances di(Wi, Xs) between all the weight vectors on the SOM and Xs are calculated by using (for instance):

where:

Wi denotes the ith weight vector;

wj and xj represent the jth elements of Wi and Xi respectively

SOM:One single learning cycle (2)

4) Find the best matching neurone or “winning” neurone whose weight vector Wk is closest to the current input vector Xi ;

5) Modify the weights of the winning neurone and all the neurones in the neighbourhood Nk by applying:

Wjnew = Wjold + (Xi - Wjold)

Where represents the learning rate;

6) Next input vectorX(i+1), the process is repeated.

SOM: Learning Parameters

- If a data set consists of P input vectors or cases, then 1 learning epoch is equal to P single learning cycles
- After a number of N learning epochs, the size of the neighbourhood is decreased.
- After a number of M learning epochs, the learning rate, , may be decreased;

SOM: Neighbourhood Schemes(3)

Why do we have to modify the size of neighbourhood ?

- We need to induce map formation by adapting regions according to the similarity between weights and input vectors;
- We need to ensure that neighbourhoods are adjacent;
- Thus, a neighbourhood will represent a number of similar clusters or neurones;
- By starting with a large neighbourhood we guarantee that a GLOBAL ordering takes place, otherwise there may be more than one region on the map encoding a given part of the input space.

SOM: Neighbourhood Schemes(4)

- One good strategy is to gradually reduce the size of the neighbourhood for each neurone to zero over a first part of the learning phase, during the formation of the map topography;
- and then continue to modify only the weight vectors of the winning neurones to pick up the fine details of the input space

SOM: Learning rate

Why do we need to decrease the learning rate ?

- If the learning rate is kept constant, it is possible for weight vectors to oscillate back and forth between two nearby positions;
- Lowering ensures that this does not occur and the network is stable.

Visualising data and clusters with Kohonen maps

U-matrix and median distance matrix maps for leukaemia data

The U-matrix holds distances between neighbouring map units

Basic Criteria For The Selection Of Clustering Techniques (1)

- Which clustering algorithm should I use?
- Should I apply an alternative solution?
- How can results be improved by using different methods?

Basic Criteria For The Selection Of Clustering Techniques (2)

- There are multiple clustering techniques that can be used to analyse expression data.
- Choosing “the best” algorithm for a particular problem may represent a challenging task.
- Advantages and limitations may depend on factors such as the statistical nature of the data, pre-processing procedures, number of features etc.
- It is not uncommon to observe inconsistent results when different clustering methods are tested on a particular data set

Basic Criteria For The Selection Of Clustering Techniques (3)

- In order to make an appropriate choice, it is important to have a good understanding of:
- the problem domain under study, and
- the clustering options available.

Basic Criteria For The Selection Of Clustering Techniques (4)

- Knowledge on the underlying biological problem may allow a scientist to choose a tool that satisfies certain requirements, such as the capacity to detect overlapping classes.
- Knowledge on the mathematical properties of a clustering technique may support the selection process.
- How does this algorithm represent similarity (or dissimilarity)?,
- How much relevance does it assign to cluster heterogeneity?,
- How does it implement the process of measuring cluster isolation?.
- Answers to these questions may indicate crucial directions for the selection of an adequate clustering algorithm.

Basic Criteria For The Selection Of Clustering Techniques (5)

- Empirical studies have defined several mathematical criteria ofacceptability
- For example, there may be clustering algorithms that are capable of guaranteeing the generation of partitions whose cluster structures do not intersect.
- Several algorithms indirectly assume that the cluster structure of the data under consideration exhibits particular characteristics.
- For instance, the k-means algorithm assumes that the shape of the clusters is spherical; and single-linkage hierarchical clustering assumes that the clusters are well separated

Basic Criteria For The Selection Of Clustering Techniques (6)

- Unfortunately, this type of knowledge may not always be available in an expression data study.
- In this situation a solution may be to test a number of techniques on related data sets, which have previously been classified (a reference data set).
- Thus, a user may choose a clustering method if it produced consistent categorisation results in relation to such reference data set.

Basic Criteria For The Selection Of Clustering Techniques (7)

- Specific user requirements may also influence a selection decision.
- For example, a scientist may be interested in observing direct relationships between classes and subclasses in a data partition. In this case, a hierarchical clustering approach may represent a basic solution.
- But in some studies hierarchical clustering results could be difficult to visualise because of the number of samples and features involved. Thus, for instance, a SOM may be considered to guide an exploratory analysis of the data.

Basic Criteria For The Selection Of Clustering Techniques (8)

- In general the application of two or more clustering techniques may provide the basis for the synthesis of accurate and reliable results.
- A scientist may be more confident about the clustering experiments if very similar results are obtained by using different techniques.

Clustering approaches to classification: Key experimental factors

- Type of clustering algorithm
- Number of experiments (partitions)
- Number of clustering (learning) cycles in an algorithm
- In KD applications the number of classes may not be known a priori
- The number of clusters in each experiment

The problem of assessing cluster validity and evaluation

Partition 1 (2 clusters)

Partition 2 (3 clusters)

Partition 3

(4 clusters)

Partition n

( 3 clusters)

Clustering and cluster validity assessment

Data on a 2D space

Cluster validity assessment – Key questions:

- Is this a relevant partition?
- Should we analyse these clusters?
- Is there a better partition?
- Which clustering method should we apply ?
- Is this relevant information from a biological point of view ?

Cluster validity assessment

- Quality indices
- Maximize or minimize indices
- Quality factors:compactness, heterogeneity, isolation, shape…
- The best or correct partition is the one that maximize or minimize an index

Cluster validity assessment based on a quality index I

Partition 1

(2 clusters)

Partition 2

(3 clusters)

Partition 3

(4 clusters)

P1: I1 = 0.1

P2: I2 = 0.9

P3: I3 = 0.5

P2 is the best/correct partition

Analyze and interpret P2

Cluster Validity assessment

(Xi, Xj): intercluster distance – (Xk) intracluster distance

d

(

X

,

X

)

1

2

D

(

X

)

1

Cluster validity assessment (I) - The Dunn’s index

- This index aims at identifying sets of clusters that are compact and well separated
- For any partition UX: X1... Xi… X, the Dunn‘s validation index is defined as:

- (Xi, Xj): intercluster distance between clusters Xi and Xj (); (Xk) intracluster distance of cluster Xk; c: number of clusters of partition U
- Large values of V correspond to good clusters
- The number of clusters that maximises V is taken as the optimal number of clusters, c.

Cluster validity assessment (II) – the Davies-Bouldin index

- Small values of DB correspond to good clusters (the clusters are compact and their centres are far away from each other)
- The cluster configuration that minimizes DB is taken as the optimall number of clusters, c.

Cluster Validity assessment : Obtaining the partitions

Leukemia data: 4 clusters

A B C D

AML T-ALL B-ALL

Cluster Validity assessment : Davies-Bouldin indexes - leukemia data

- DB11 : using inter-cluster distance 1 and intra-cluster distance 1 (complete diameter)
- DB21 : using inter-cluster distance 2 and intra-cluster distance 1

Bold entries represent the optimal number of clusters, c, predicted by each index.

Cluster validity assessment: Dunn’sindexes - leukemia data

Bold entries represent the optimal number of clusters, c, predicted by each index.

Cluster validity assessment : Aggregation of Dunn’sindices - leukemia data

Cluster validity assessment : Aggregation of Davies-Bouldinindexes - leukemia data

Cluster validity assessment

- Different intercluster/intracluster distance combinations may produce validation indices of different scale ranges.
- Indices with higher values may have a stronger effect on the calculation of the average index values.
- This may result in a biased prediction of the optimal number of clusters.

Cluster validity assessment

An approach for the prediction of the optimal partition is the implementation of an aggregation method based on a weighed voting strategy.

Leukaemia data and Davies-Bouldin validation index:

Cluster validity assessment

Effect of the distance metric on the prediction process

Dunn’s validity indexes and leukemia data

Cluster validity assessment -

Effect of the distance metric on the prediction process

Dunn’s validity indexes and DLBCL data

Visualisation - Dendrograms

- Dendrograms are often used to visualize the nested sequence of clusters resulting from hierarchical clustering.
- While dendrograms are quite appealing because of their apparent ease of interpretation, they can be misleading.
- first, the dendrogram corresponding to a given hierarchical clustering is not unique, since for each merge one needs to specify which subtree should go on the left and which on the right – there are 2^(n-1) different dendrograms.
- Second, and perhaps less recognized shortcoming of dendrograms, is that they impose structure on the data, instead of revealing structure in these data.

Dendrograms

- Such a representation will be valid only to the extent that the pairwise distances possess the hierarchical structure imposed by the clustering algorithm.

Dendrograms - example

- Genes correspond to the rows, and the time points of each experiment are the columns.
- The ratio of expression is color coded:
- Red: upregulated
- Green: downregulated
- Black: no change
- Grey: missing data

Visualisation - maps

- The 828 genes were grouped into 30 clusters.
- Each cluster is represented by the centroid for genes in the clusters.
- Expression levels are shown on y-axis and time points on x-axis

Key problems, challenges, recent advances

Azuaje F, “Clustering-based approaches to discovering and visualizing expression patterns”, Briefings in Bioinformatics, 4 (1), pp. 31- 42, 2003.

(course material)

Conclusions

- Clustering is a basic computational approach to pattern recognition and classification in expression studies
- Several methods are available, it is fundamental to understand the biological problem and statistical requirements
- There is the need for systematic evaluation and validation frameworks to guide humans and computers to reach their classification goals
- These techniques may support knowledge discovery processes in complex domains such as the molecular classification of cancers

Acknowledgement

Dr. Haiying Wang

School of Computing and Mathematics

Faculty of Engineering

University of Ulster

Download Presentation

Connecting to Server..