- 220 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Clustering' - Samuel

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Clustering

Petter Mostad

Clustering vs. class prediction

- Class prediction:
- A learning set of objects with known classes
- Goal: put new objects into existing classes
- Also called: Supervised learning, or classification
- Clustering:
- No learning set, no given classes
- Goal: discover the ”best” classes or groupings
- Also called: Unsupervised learning, or class discovery

Overview

- General clustering theory
- Steps, methods, algorithms, issues...
- Clustering microarray data
- Recommendations for this kind of data
- Programs for clustering
- Some other visualization techniques

Issues in clustering

- Used to explore and visualize data, with few preconceptions
- Many subjective choices must be made, so a clustering output tends to be subjective
- It is difficult to get truly statistically ”significant” conclusions
- Algorithms will always produce clusters, whether any exist in the data or not

Steps in clustering

- Feature selection and extraction
- Defining and computing similarities
- Clustering or grouping objects
- Assessing, presenting, and using the result

1. Feature selection and extraction

- Deciding which measurements matter for similarity
- Data reduction
- Filtering away objects
- Normalization of measurements

The data matrix

- Every row contains the measurements for one object.
- Similarities are computed between all pairs of rows
- If measurements are of same type, one can instead cluster them!

measurements

objects

2. Defining and computing similarities

- Similarity measures for continuous data vectors:
- Euclidean distance
- Minkowski distance (including Manhattan metric)
- Mahalanobis distance where S is a covariance matrix

Centered and non-centered (absolute) Pearson correlation

- centered:
- non-centered:

where

- Spearman rank correlation
- Compute the ranking of the numbers in each vector
- Find correlation between ranking numbers
- ....

Geometrical view of clustering

- If measurements are coordinates, objects become points in some space
- If the simiarity measure is Euclidean distance, the goal is to group nearby points
- Note: When we have only 2 or 3 measurements per object, we can do better than most algorithms using visual inspection

Similarity measures for discrete data

- Comparing two binary vectors, count the numbers a,b,c,d of 1-1’s, 1-0’s, 0-1’s, and 0-0’s, respectively
- Construct different similarity measurements based on these numbers:
- Similarity of for example trees or other objects can be defined in reasonable ways

Similarities using contexts

- Mutual Neighbour Distance:

where is the neighbour number of x with respect to y

- This is not a metric, but similarities do not need to be based on metrics.

3. Clustering or grouping

- Hierarchical clusterings
- Divisive: Starts with one big cluster and subdivides on cluster in each step
- Agglomerative: Starts with each object in separate cluster. In each step, joins the two closest clusters
- Partitional clusterings
- Probabilistic or fuzzy clusterings

Hierarchical clustering

- Agglomerative clustering depends on type of linkage, i.e., how to compute the distance between merged cluster (UV) and old cluster (W):
- d(UV, W) = min(d(U, W), d(V,W)) (single linkage)
- d(UV, W) = max(d(U,W), d(V,W)) (complete linkage)
- d(UV, W) = average over all distances between objects in (UV) and objects in W (average linkage, or UPGMA: Unweighted Pair Group Method with Arithmetic mean)
- The output is a dendrogram
- A simplification of average linkage is often implemented (“average group linkage”): It may lead to inverted dendrograms!

Dendrograms, visualizations

- The data matrix is often visualized using three colors, representing positive, negative, and zero values.
- Hierarchical clustering results often represented with a dendrogram. The similarity at which clusters merge should correspond to height of corresponding horizontal line in dendrogram!
- To display the dendrogram, the objects (lines or columns) need to be sorted, this can be done in two ways at every time when two clusters are merged.

Ward’s hierarchical clustering

- Agglomerative.
- Goal: minimize ”Error Sum of Squares” (ESS) at every step.
- ESS = The sum over all clusters, of the sum of the squares of the distances from the objects to the cluster centroid.
- When joining two clusters, find the pair that results in the smallest increase in ESS.

Partitional clusterings

- The number of desired clusters is fixed at the start
- K-means clustering:
- Partition into k initial clusters
- Iteratively, reassign points to groups with the closest centroid. Recompute centroids.
- Repeat until stability
- The result may depend on initial clusters
- May include a procedure joining or splitting clusters according to size
- The choice of number of clusters may not be obvious

Probabilistic or fuzzy clustering

- The output is, for each object and each cluster, a probability or weight that the object belongs to the cluster
- Example: The observations are modelled as produced by drawing from a number of probability densities (often multivariate normal). Parameters are then estimated with Maximum Likelihood (for example using EM algorithm).
- Example: A ”fuzzy” version of k-means, where weights for objects are changed iteratively

Neural networks for clustering

- Neural networks are mathematical models made to be similar to actual neural networks
- They consist of layers of nodes that send out ”signals” based probabilistically on input signals
- Most known uses are classifications, i.e., with learning sets

Clustering as optimization

- Given similarity definition and definition of what is an ”optimal” clustering, it can often be a huge algorithmic challenge to find the optimum.
- Example: Subdivide many thousand objects into 50 clusters, minimizing e.g. the sum of the squared distances to centroids.
- Then, algorithms for optimization are central.

Genetic algorithms

- Tries to use ”evolution” to obtain good solutions to a problem
- A number of solutions are kept at every step: They may then mate or mutate, to produce new solutions. The ”fittest” solutions are kept.
- Can be seen as an optimization algorithm
- A great challenge to design ways of mating and mutating that produce an efficient algorithm

Simulated annealing

- A general optimization technique
- Iterative: At every step, nearby solutions are chosen with probabilities depending on their optimality (so even less optimal solutions may be chosen)
- As the algorithm proceeds, and the ”temperature” sinks, the probability of choosing less optimal solutions also sinks.
- Is a good general way to avoid local optima.

4. Assessing and using the result

- Visualization and summarization of the clusters
- Note: You should always investigate the dependence of your results on the choices you have made for the clustering!

Examples of applications of clustering

- Image analysis
- Speech recognition
- Data mining
- ....

Clustering microarray data

samples

- Samples are columns, genes are rows, in data matrix
- What values to cluster?
- What is a biologically relevant measure of similarity?
- One can cluster genes and/or samples

genes

Clustering microarray data

- Use logged data, usually
- Data should be on same scale (but usually is if you use data that is already normalized)
- You may have to filter away genes that show too little variation over samples.
- Use an appropriate distance measure for the question you want to focus on (Pearson correlation often works OK).
- Use appropriate clustering algorithm (Hierarchical average linkage usually works OK).
- If you draw some conclusion from the clustering results, try to vary your clustering choices to see how stable these results are.
- Clustering works best as a tool to generate hypotheses and ideas, which may then be tested in other ways.

Clustering to confirm or reject hypotheses?

- A clustering may appear to validate, or be validated by, a grouping derived by using other data
- Caution: The many different ways to do a clustering may make it possible to tweak it to produce the clusters you want
- There is a huge and complex multiple testing problem
- Note that small changes in data can change result dramatically
- If you insist on trying to get ”significance”:
- Using permutations of data
- Using resampling of data (bootstrapping)

How to do clustering: Programs

- A good program for clustering and visualization: HCE
- Great visualization options
- Adapted to microarray data
- http://www.cs.umd.edu/hcil/hce/
- Can import similarity matrices
- Classic for microarray data: Cluster & TreeView (Eisen)
- R/BioConductor: package cluster, hclust function, heatmap function, ...
- Many other programs/packages

Other visualization techniques: Principal Components

- The principal components can be viewed as the axes of a “better” coordinate system for the data.
- “Better” in the sense that the data is maximally spread out along the first principal components.
- The principal components correspond to eigenvectors of the covariance matrix of the data.
- The eigenvalues represent the part of the total variance explained by each of the principal components.

Other visualization techniques: Multidimensional scaling

- Start with some points in a very high dimension.
- Goal: Display these points in a lower dimension, so that distances between them are similar to distances in original dimension.
- May also try to preserve only the ranking of the pairwise distances.
- Makes it possible to use powerful visual inspection, in 2 or 3 dimensions.
- Can sometimes give very convincing pictures separating samples in a predicted way.

Download Presentation

Connecting to Server..