Loading in 5 sec....

Approaches to clustering-based analysis and validationPowerPoint Presentation

Approaches to clustering-based analysis and validation

Download Presentation

Approaches to clustering-based analysis and validation

Loading in 2 Seconds...

- 338 Views
- Uploaded on
- Presentation posted in: Sports / GamesEducation / CareerFashion / BeautyGraphics / DesignNews / Politics

Approaches to clustering-based analysis and validation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Approaches to clustering-based analysis and validation

Dr. Huiru Zheng Dr. Francisco Azuaje

School of Computing and Mathematics

Faculty of Engineering

University of Ulster

Genes vs perturbations

Tissues vs genes

Expression matrices illustrating two distinct genetic profiles.

genes (G), biological conditions (C), types of tissue (T).

y

Cluster A

Cluster B

- At the end of an unsupervised recognition process (learning), we obtain a number of classes or CLUSTERS;
- Each cluster groups a number of cases/input vectors;
- When new input vectors are presented to the system, they are categorised into one of the existing classes or CLUSTERS

x

a1a2a3a4a5

s1xyzyy

s2 xxxyy

s3yyxyy

s4 yyyyy

s5 yxxyy

C1: {s3, s4}

C2: {s1, s2,s5}

Exclusive clusters

a1a2a3a4a5

s1xyzyy

s2 xxxyy

s3yyxyy

s4 yyyyy

s5 yxxyy

C1: {s3, s4}, {a1, a2}

C2: {s1, s2,s5 }, {a4, a5}

Exclusive biclusters

a1a2a3a4a5

s1xyzyy

s2 xxxyy

s3yyxyy

s4 yyyyy

s5 yxxyy

C1: { s3, s4}, {a1, a2}

C2: {s1, s2,s3,s5}, {a4, a5}

- Hierarchical Clustering
- K-means
- Kohonen Maps
- Self-adaptive methods
- ……

- Organizes the data into larger groups, which contain smaller groups, like a tree or dendrogram.
- They avoid specifying how many clusters are appropriate by providing a partition for each K. The partitions are obtained from cutting the tree at different levels.
- The tree can be built in two distinct ways
- bottom-up: agglomerative clustering;
- top-down: divisive clustering.

- Algorithms:
- Agglomerative (Single-linkage, complete-linkage, average-linkage) ….

Degrees of

dissimilarity

genes

- P = set of genes
- While more than one subtree in P
- Pick the most similar pair i, j in P
- Define a new subtree k joining i and j
- Remove i and j from P and insert k

1‘

1 2 3 4 5

2‘

1 2 3 4 5

2‘

3‘

1 2 3 4 5

1 2 3 4 5

An Example of

Hierarchical Clustering

- To create one set of clusters that partitions the data into similar groups.
- Algorithms: Forgy’s, k-means, Isodata…

- A value for k is selected up front; # of expected cluster
- The algorithm divides the data into k many clusters in such a way that the profiles within each cluster are more similar than those across clusters.

- One more input k is required. There are many variants of k-mean.
- Sum-of squares criterion
- minimize

- Two passes
- Begin with k clusters, each consisting of one of the first k samples. For the remaining n-k samples, find the centroid nearest it. After each sample is assigned, re-compute the centroid of the altered cluster.
- For each sample, find the centroid nearest it. Put the sample in the cluster identified with this nearest centroid. ( do not need to re-compute.)

- The aim of Kohonen learning is to map similar signals/input-vectors/cases to similar neurone positions;
- Neurones or nodes that are physically adjacent in the network encode patterns or inputs that are similar

neurone i

Kohonen layer

wi

Winning neurone

Input vector X

X=[x1,x2,…xn] Rn

wi=[wi1,wi2,…,win] Rn

A rectangular grid of neurones representing a Kohonen map. Lines are used to link neighbour neurons.

2-dimensional representation of random weight vectors.The lines are drawn to connect neurones which are physically adjacent.

2-dimensional representation of 6 input vectors (a training data set)

In a well trained (ordered) network the diagram in the weight space should have the same topology as that in physical space and will reflect the properties of the training data set.

Input space (training data set)

Weight vector representations after training

- Type of input
a) The input a neural network can process is a vector of fixed length.

b) This means that only numbers can be used as input and that one must setup the network in such a way that the longest input vector can be processed. This also means that to all vectors with less elements, elements must be added until they have the same size as the longest vector.

- Classification of inputs
a) In a Kohonen network, each neurone is represented by a so-called weight vector;

b) During training these vectors are adjusted to match the input vectors in such a way that after training each of the weight vectors represents a certain class of input vectors;

c) If in the test phase a vector is presented as input, the weight vector which represents the class this input vector belongs to, is given as output, i.e. the neurone is activated.

- Learning (training) behaviour.
a) During training (learning) the neurones of a Kohonen network are adjusted in such a way, that on the map there will form regions which consist of neurones with similar weight vectors.

b) This means that in a well-trained map, a class will not be represented by one single neurone, but by a group of neurons.

c) In this group there is one central neurone which can be said to represent the most prototypical member of this class, while the surrounding neurons represent less prototypical members.

SOMs define a mapping from a m-dimensional input data space onto a one- or two-dimensional array of nodes;

Algorithm:

1. initialize the network with n nodes;

2. select one case from the set of training cases;

3. find the node in the network that is closest (according to some measure of distance) to the selected case;

4. adjust the set of weights of the closest node and of the nodes around it;

5. repeat from 2. until some termination criterion is reached.

1) The weights are initialised to random values (between the interval -0.1 to 0.1, for instance) and the neighbourhood sizes set to cover over half of the network;

2) a m-dimensional input vector Xs (scaled between -1 and +1, for instance) enters the network;

3) The distances di(Wi, Xs) between all the weight vectors on the SOM and Xs are calculated by using (for instance):

where:

Wi denotes the ith weight vector;

wj and xj represent the jth elements of Wi and Xi respectively

4) Find the best matching neurone or “winning” neurone whose weight vector Wk is closest to the current input vector Xi ;

5) Modify the weights of the winning neurone and all the neurones in the neighbourhood Nk by applying:

Wjnew = Wjold + (Xi - Wjold)

Where represents the learning rate;

6) Next input vectorX(i+1), the process is repeated.

- If a data set consists of P input vectors or cases, then 1 learning epoch is equal to P single learning cycles
- After a number of N learning epochs, the size of the neighbourhood is decreased.
- After a number of M learning epochs, the learning rate, , may be decreased;

First neighbourhood

Second neighbourhood

- Linear

- Rectangular

First neighbourhood

Second neighbourhood

Why do we have to modify the size of neighbourhood ?

- We need to induce map formation by adapting regions according to the similarity between weights and input vectors;
- We need to ensure that neighbourhoods are adjacent;
- Thus, a neighbourhood will represent a number of similar clusters or neurones;
- By starting with a large neighbourhood we guarantee that a GLOBAL ordering takes place, otherwise there may be more than one region on the map encoding a given part of the input space.

- One good strategy is to gradually reduce the size of the neighbourhood for each neurone to zero over a first part of the learning phase, during the formation of the map topography;
- and then continue to modify only the weight vectors of the winning neurones to pick up the fine details of the input space

Why do we need to decrease the learning rate ?

- If the learning rate is kept constant, it is possible for weight vectors to oscillate back and forth between two nearby positions;
- Lowering ensures that this does not occur and the network is stable.

U-matrix and median distance matrix maps for leukaemia data

The U-matrix holds distances between neighbouring map units

- Which clustering algorithm should I use?
- Should I apply an alternative solution?
- How can results be improved by using different methods?

- There are multiple clustering techniques that can be used to analyse expression data.
- Choosing “the best” algorithm for a particular problem may represent a challenging task.
- Advantages and limitations may depend on factors such as the statistical nature of the data, pre-processing procedures, number of features etc.
- It is not uncommon to observe inconsistent results when different clustering methods are tested on a particular data set

- In order to make an appropriate choice, it is important to have a good understanding of:
- the problem domain under study, and
- the clustering options available.

- Knowledge on the underlying biological problem may allow a scientist to choose a tool that satisfies certain requirements, such as the capacity to detect overlapping classes.
- Knowledge on the mathematical properties of a clustering technique may support the selection process.
- How does this algorithm represent similarity (or dissimilarity)?,
- How much relevance does it assign to cluster heterogeneity?,
- How does it implement the process of measuring cluster isolation?.

- Answers to these questions may indicate crucial directions for the selection of an adequate clustering algorithm.

- Empirical studies have defined several mathematical criteria ofacceptability
- For example, there may be clustering algorithms that are capable of guaranteeing the generation of partitions whose cluster structures do not intersect.

- Several algorithms indirectly assume that the cluster structure of the data under consideration exhibits particular characteristics.
- For instance, the k-means algorithm assumes that the shape of the clusters is spherical; and single-linkage hierarchical clustering assumes that the clusters are well separated

- Unfortunately, this type of knowledge may not always be available in an expression data study.
- In this situation a solution may be to test a number of techniques on related data sets, which have previously been classified (a reference data set).
- Thus, a user may choose a clustering method if it produced consistent categorisation results in relation to such reference data set.

- Specific user requirements may also influence a selection decision.
- For example, a scientist may be interested in observing direct relationships between classes and subclasses in a data partition. In this case, a hierarchical clustering approach may represent a basic solution.

- But in some studies hierarchical clustering results could be difficult to visualise because of the number of samples and features involved. Thus, for instance, a SOM may be considered to guide an exploratory analysis of the data.

- In general the application of two or more clustering techniques may provide the basis for the synthesis of accurate and reliable results.
- A scientist may be more confident about the clustering experiments if very similar results are obtained by using different techniques.

- Type of clustering algorithm
- Number of experiments (partitions)
- Number of clustering (learning) cycles in an algorithm
- In KD applications the number of classes may not be known a priori
- The number of clusters in each experiment

Partition 1 (2 clusters)

Partition 2 (3 clusters)

Partition 3

(4 clusters)

Partition n

( 3 clusters)

Data on a 2D space

c = 2

c = 4

Method 1

Method 2

- Is this a relevant partition?
- Should we analyse these clusters?
- Is there a better partition?
- Which clustering method should we apply ?
- Is this relevant information from a biological point of view ?

- Quality indices
- Maximize or minimize indices
- Quality factors:compactness, heterogeneity, isolation, shape…
- The best or correct partition is the one that maximize or minimize an index

Partition 1

(2 clusters)

Partition 2

(3 clusters)

Partition 3

(4 clusters)

P1: I1 = 0.1

P2: I2 = 0.9

P3: I3 = 0.5

P2 is the best/correct partition

Analyze and interpret P2

(Xi, Xj): intercluster distance – (Xk) intracluster distance

d

(

X

,

X

)

1

2

D

(

X

)

1

1

2

Mean vector

There are different ways to calculate (Xi, Xj) and (Xk)

- This index aims at identifying sets of clusters that are compact and well separated
- For any partition UX: X1... Xi… X, the Dunn‘s validation index is defined as:

- (Xi, Xj): intercluster distance between clusters Xi and Xj (); (Xk) intracluster distance of cluster Xk; c: number of clusters of partition U
- Large values of V correspond to good clusters
- The number of clusters that maximises V is taken as the optimal number of clusters, c.

- Small values of DB correspond to good clusters (the clusters are compact and their centres are far away from each other)
- The cluster configuration that minimizes DB is taken as the optimall number of clusters, c.

Leukemia data: 2 clusters

A B

AML ALL

Leukemia data: 4 clusters

A B C D

AML T-ALL B-ALL

- DB11 : using inter-cluster distance 1 and intra-cluster distance 1 (complete diameter)
- DB21 : using inter-cluster distance 2 and intra-cluster distance 1

Bold entries represent the optimal number of clusters, c, predicted by each index.

Bold entries represent the optimal number of clusters, c, predicted by each index.

- Different intercluster/intracluster distance combinations may produce validation indices of different scale ranges.
- Indices with higher values may have a stronger effect on the calculation of the average index values.
- This may result in a biased prediction of the optimal number of clusters.

An approach for the prediction of the optimal partition is the implementation of an aggregation method based on a weighed voting strategy.

Leukaemia data and Davies-Bouldin validation index:

Effect of the distance metric on the prediction process

Dunn’s validity indexes and leukemia data

Effect of the distance metric on the prediction process

Dunn’s validity indexes and DLBCL data

- Dendrograms are often used to visualize the nested sequence of clusters resulting from hierarchical clustering.
- While dendrograms are quite appealing because of their apparent ease of interpretation, they can be misleading.
- first, the dendrogram corresponding to a given hierarchical clustering is not unique, since for each merge one needs to specify which subtree should go on the left and which on the right – there are 2^(n-1) different dendrograms.
- Second, and perhaps less recognized shortcoming of dendrograms, is that they impose structure on the data, instead of revealing structure in these data.

- Such a representation will be valid only to the extent that the pairwise distances possess the hierarchical structure imposed by the clustering algorithm.

- Genes correspond to the rows, and the time points of each experiment are the columns.
- The ratio of expression is color coded:
- Red: upregulated
- Green: downregulated
- Black: no change
- Grey: missing data

- The 828 genes were grouped into 30 clusters.
- Each cluster is represented by the centroid for genes in the clusters.
- Expression levels are shown on y-axis and time points on x-axis

Azuaje F, “Clustering-based approaches to discovering and visualizing expression patterns”, Briefings in Bioinformatics, 4 (1), pp. 31- 42, 2003.

(course material)

- Clustering is a basic computational approach to pattern recognition and classification in expression studies
- Several methods are available, it is fundamental to understand the biological problem and statistical requirements
- There is the need for systematic evaluation and validation frameworks to guide humans and computers to reach their classification goals
- These techniques may support knowledge discovery processes in complex domains such as the molecular classification of cancers

Dr. Haiying Wang

School of Computing and Mathematics

Faculty of Engineering

University of Ulster