clustering l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering PowerPoint Presentation
Download Presentation
Clustering

Loading in 2 Seconds...

play fullscreen
1 / 35

Clustering - PowerPoint PPT Presentation


  • 219 Views
  • Uploaded on

Clustering. Petter Mostad. Clustering vs. class prediction. Class prediction: A learning set of objects with known classes Goal: put new objects into existing classes Also called: Supervised learning, or classification Clustering: No learning set, no given classes

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Clustering' - Samuel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
clustering

Clustering

Petter Mostad

clustering vs class prediction
Clustering vs. class prediction
  • Class prediction:
    • A learning set of objects with known classes
    • Goal: put new objects into existing classes
    • Also called: Supervised learning, or classification
  • Clustering:
    • No learning set, no given classes
    • Goal: discover the ”best” classes or groupings
    • Also called: Unsupervised learning, or class discovery
overview
Overview
  • General clustering theory
    • Steps, methods, algorithms, issues...
  • Clustering microarray data
    • Recommendations for this kind of data
  • Programs for clustering
  • Some other visualization techniques
issues in clustering
Issues in clustering
  • Used to explore and visualize data, with few preconceptions
  • Many subjective choices must be made, so a clustering output tends to be subjective
  • It is difficult to get truly statistically ”significant” conclusions
  • Algorithms will always produce clusters, whether any exist in the data or not
steps in clustering
Steps in clustering
  • Feature selection and extraction
  • Defining and computing similarities
  • Clustering or grouping objects
  • Assessing, presenting, and using the result
1 feature selection and extraction
1. Feature selection and extraction
  • Deciding which measurements matter for similarity
  • Data reduction
  • Filtering away objects
  • Normalization of measurements
the data matrix
The data matrix
  • Every row contains the measurements for one object.
  • Similarities are computed between all pairs of rows
  • If measurements are of same type, one can instead cluster them!

measurements

objects

2 defining and computing similarities
2. Defining and computing similarities
  • Similarity measures for continuous data vectors:
    • Euclidean distance
    • Minkowski distance (including Manhattan metric)
    • Mahalanobis distance where S is a covariance matrix
slide9
Centered and non-centered (absolute) Pearson correlation
    • centered:
    • non-centered:

where

  • Spearman rank correlation
    • Compute the ranking of the numbers in each vector
    • Find correlation between ranking numbers
  • ....
geometrical view of clustering
Geometrical view of clustering
  • If measurements are coordinates, objects become points in some space
  • If the simiarity measure is Euclidean distance, the goal is to group nearby points
  • Note: When we have only 2 or 3 measurements per object, we can do better than most algorithms using visual inspection
similarity measures for discrete data
Similarity measures for discrete data
  • Comparing two binary vectors, count the numbers a,b,c,d of 1-1’s, 1-0’s, 0-1’s, and 0-0’s, respectively
  • Construct different similarity measurements based on these numbers:
  • Similarity of for example trees or other objects can be defined in reasonable ways
similarities using contexts
Similarities using contexts
  • Mutual Neighbour Distance:

where is the neighbour number of x with respect to y

  • This is not a metric, but similarities do not need to be based on metrics.
3 clustering or grouping
3. Clustering or grouping
  • Hierarchical clusterings
    • Divisive: Starts with one big cluster and subdivides on cluster in each step
    • Agglomerative: Starts with each object in separate cluster. In each step, joins the two closest clusters
  • Partitional clusterings
  • Probabilistic or fuzzy clusterings
hierarchical clustering
Hierarchical clustering
  • Agglomerative clustering depends on type of linkage, i.e., how to compute the distance between merged cluster (UV) and old cluster (W):
    • d(UV, W) = min(d(U, W), d(V,W)) (single linkage)
    • d(UV, W) = max(d(U,W), d(V,W)) (complete linkage)
    • d(UV, W) = average over all distances between objects in (UV) and objects in W (average linkage, or UPGMA: Unweighted Pair Group Method with Arithmetic mean)
  • The output is a dendrogram
  • A simplification of average linkage is often implemented (“average group linkage”): It may lead to inverted dendrograms!
dendrograms visualizations
Dendrograms, visualizations
  • The data matrix is often visualized using three colors, representing positive, negative, and zero values.
  • Hierarchical clustering results often represented with a dendrogram. The similarity at which clusters merge should correspond to height of corresponding horizontal line in dendrogram!
  • To display the dendrogram, the objects (lines or columns) need to be sorted, this can be done in two ways at every time when two clusters are merged.
ward s hierarchical clustering
Ward’s hierarchical clustering
  • Agglomerative.
  • Goal: minimize ”Error Sum of Squares” (ESS) at every step.
    • ESS = The sum over all clusters, of the sum of the squares of the distances from the objects to the cluster centroid.
  • When joining two clusters, find the pair that results in the smallest increase in ESS.
partitional clusterings
Partitional clusterings
  • The number of desired clusters is fixed at the start
  • K-means clustering:
    • Partition into k initial clusters
    • Iteratively, reassign points to groups with the closest centroid. Recompute centroids.
    • Repeat until stability
    • The result may depend on initial clusters
    • May include a procedure joining or splitting clusters according to size
  • The choice of number of clusters may not be obvious
probabilistic or fuzzy clustering
Probabilistic or fuzzy clustering
  • The output is, for each object and each cluster, a probability or weight that the object belongs to the cluster
  • Example: The observations are modelled as produced by drawing from a number of probability densities (often multivariate normal). Parameters are then estimated with Maximum Likelihood (for example using EM algorithm).
  • Example: A ”fuzzy” version of k-means, where weights for objects are changed iteratively
neural networks for clustering
Neural networks for clustering
  • Neural networks are mathematical models made to be similar to actual neural networks
  • They consist of layers of nodes that send out ”signals” based probabilistically on input signals
  • Most known uses are classifications, i.e., with learning sets
clustering as optimization
Clustering as optimization
  • Given similarity definition and definition of what is an ”optimal” clustering, it can often be a huge algorithmic challenge to find the optimum.
  • Example: Subdivide many thousand objects into 50 clusters, minimizing e.g. the sum of the squared distances to centroids.
  • Then, algorithms for optimization are central.
genetic algorithms
Genetic algorithms
  • Tries to use ”evolution” to obtain good solutions to a problem
  • A number of solutions are kept at every step: They may then mate or mutate, to produce new solutions. The ”fittest” solutions are kept.
  • Can be seen as an optimization algorithm
  • A great challenge to design ways of mating and mutating that produce an efficient algorithm
simulated annealing
Simulated annealing
  • A general optimization technique
  • Iterative: At every step, nearby solutions are chosen with probabilities depending on their optimality (so even less optimal solutions may be chosen)
  • As the algorithm proceeds, and the ”temperature” sinks, the probability of choosing less optimal solutions also sinks.
  • Is a good general way to avoid local optima.
4 assessing and using the result
4. Assessing and using the result
  • Visualization and summarization of the clusters
  • Note: You should always investigate the dependence of your results on the choices you have made for the clustering!
examples of applications of clustering
Examples of applications of clustering
  • Image analysis
  • Speech recognition
  • Data mining
  • ....
clustering microarray data
Clustering microarray data

samples

  • Samples are columns, genes are rows, in data matrix
  • What values to cluster?
  • What is a biologically relevant measure of similarity?
  • One can cluster genes and/or samples

genes

clustering microarray data28
Clustering microarray data
  • Use logged data, usually
  • Data should be on same scale (but usually is if you use data that is already normalized)
  • You may have to filter away genes that show too little variation over samples.
  • Use an appropriate distance measure for the question you want to focus on (Pearson correlation often works OK).
  • Use appropriate clustering algorithm (Hierarchical average linkage usually works OK).
  • If you draw some conclusion from the clustering results, try to vary your clustering choices to see how stable these results are.
  • Clustering works best as a tool to generate hypotheses and ideas, which may then be tested in other ways.
clustering to confirm or reject hypotheses
Clustering to confirm or reject hypotheses?
  • A clustering may appear to validate, or be validated by, a grouping derived by using other data
  • Caution: The many different ways to do a clustering may make it possible to tweak it to produce the clusters you want
  • There is a huge and complex multiple testing problem
  • Note that small changes in data can change result dramatically
  • If you insist on trying to get ”significance”:
    • Using permutations of data
    • Using resampling of data (bootstrapping)
how to do clustering programs
How to do clustering: Programs
  • A good program for clustering and visualization: HCE
    • Great visualization options
    • Adapted to microarray data
    • http://www.cs.umd.edu/hcil/hce/
    • Can import similarity matrices
  • Classic for microarray data: Cluster & TreeView (Eisen)
  • R/BioConductor: package cluster, hclust function, heatmap function, ...
  • Many other programs/packages
other visualization techniques principal components
Other visualization techniques: Principal Components
  • The principal components can be viewed as the axes of a “better” coordinate system for the data.
  • “Better” in the sense that the data is maximally spread out along the first principal components.
  • The principal components correspond to eigenvectors of the covariance matrix of the data.
  • The eigenvalues represent the part of the total variance explained by each of the principal components.
other visualization techniques multidimensional scaling
Other visualization techniques: Multidimensional scaling
  • Start with some points in a very high dimension.
  • Goal: Display these points in a lower dimension, so that distances between them are similar to distances in original dimension.
  • May also try to preserve only the ranking of the pairwise distances.
  • Makes it possible to use powerful visual inspection, in 2 or 3 dimensions.
  • Can sometimes give very convincing pictures separating samples in a predicted way.