Statistical analysis of array data: Dimensionality reduction, Clustering

Download Presentation

Statistical analysis of array data: Dimensionality reduction, Clustering

Loading in 2 Seconds...

- 65 Views
- Uploaded on
- Presentation posted in: General

Statistical analysis of array data: Dimensionality reduction, Clustering

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Statistical analysis of array data: Dimensionality reduction, Clustering

Katja Astikainen, Riikka Kaven

25.2.2005

- Problems and approaches
- Dimensionality reduction by PCA
- Clustering overview
- Hierarchical clustering
- K-means
- Mixture models and EM

- Basic idea is to find patterns of expression across multiple genes and experiments
- Models of expression are utilized in e.g. classification of diseases more precisely (tautiluokitus,sairausaste)
- Expression patterns can be utilized to exploring cellular pathways
- With help of gene expression modeling and also condition (experiment) clustering one can find genes that are co-regulated
- clustering methods can also be used for sequens alignments

- There are several methods for this, but we are going introduce:
- Principal Component Analysis (PCA)
- Clustering (hierarchical, K-means, EM)

PCA is statistical data analysis technique

- method to reduce dimensionality
- method to identify new meaningful underlying variables
- method to compress the data
- method to visualize the data

- We have N data points xi,…,xn in M dimensional space, where values x are genes expression vectors.
- With PCA we can reduct the dimension to K which is usually much lower than M.
- Imagine taking three-dimensional cloud of datapoints and rotating it so you can view it from different perspectives. You might imagine that certain views would allow you to better separate the data into groups than others.
- With PCA we can ignore some of the redundant experiments (low variance), or use some average of the information without loss of information.

- We are looking for unit vector u1 such that, on average the squared length of of the projection of the xs along the u1 is maximal (vectors are column vectors)
- Generally if the first u1,…,uk-1 components have been determined the next component is the one that maximize the residual variance
- The principal components for the expression vectors are given by ci=uix

- How can we find the eigenvectors ui
- Find such eigenvectoctors wich shows the most informative part of the data; vectors that show the direction of maximal variance of the data.

- Fist we calculate the covariance matrix
- Find out the eigenvalues and eigenvectors uk from the covariance matrix
- eigen value is a measure of the proportion of the variance explained by the corresponding eigenvector
- Select the uis wich are the eigenvectors of the sample covariance matrix associated with the K largest eigenvalues
- eigenvectors wich explains the most of the variance in the data
- discovers the important features and patterns in the data
- for datavisualization use two or three dimensional spaces

- Data analysis methods for discovering patterns and underlying cluster structures
- Different kind of methods such as Hierarchical clustering, partitioning based k-means and Self Organizing map (SOM)
- There’s no single method that is best for every data
- clustering methods are unsuperviced methods (like k-means)
- there is no information about the true clusters or their amount
- clustering algorithms are used for analysing the data
- discovered clusters are just estimations of the truth (often the result is local optimum)

- Data types
- Typically the clustered data is numerical vector data like gene expression data (expression vectors)
- Numerical data can also be represented in relative coordinates
- Data might also be qualitative (nominal) which brings challenge for comparing the data elements

- Number of clusters is often unknown
- One way to estimate the number of clusters is analysing the data by PCA
- you might use the eigenvectors to estimate the number of clusters

- Other way is to make guesses and justify the number of cluster by good results (what ever they are)

- Similarity measures
- Pearson correlation (normalized vectors dot product)

- Distance measures
- euclidean (natural distance between two vectors)

- It is important to use appropriate distance/similarity measures
- in euclidean space vectors might be close to each other but their correlation could be 0

1000000000

0000000001

Cost function and probabilististic interpretation:

- For comparing different ways of clustering the same data, we need some kind of cost function for the clustering algorithm
- The goal of clustering is to try to minimize such cost function
- Generally cost function depends on some quantities:
- Centers of the clusters
- The distance of each point in a cluster to the cluster center
- The average degree of similarity of a points in a cluster

- Cost functions are algorithm spesific, so comparing the results of different clustering algorithms might be almost impossible

Cost function and probabilististic interpretation:

- There are some advantages associated
with probabilistic models

they are often utilized in cost functions

- It is popular method to use in the clustering cost function the negative log-likelihood of an underlying probabilistic model

- The basic idea is to construct hierarchical tree which consist of nested clusters
- Algorithm is bottom-up method where clustering starts from single data points (genes) and stops when all data points are in same cluster (the root of the tree)
- Clustering begins with computing pairwise similarities between each data point and when clusters are formed similarity comparing is made between clusters.
- Branching process is repeated at most N-1 times which means that the leaf nodes (genes) make first pairs and the tree becomes a binary-tree.

- Calculate the pairwais similarities between data points into matrix
- Find two datapoints (nodes in the tree) wich are closest to each other or are most similar.
- Group them together to make a new cluster.
- Calculate the averige vector of datapoints which is expression profile for the cluster (inner node in the tree that joins the leaf nodes = datapoints vectors)
- Calculate new correlation matrix
- calculate pairwise similarity between the new cluster and other clusters.

- With Hierarchical clustering we could find the dendoclusters of datapoints but the constructed tree isn’t yet in optimal order
- After finding the dendogram which tells the similarity between nodes and genes, the final and optimal linear order for nodes can be discovered with help of dynamic programming

genes

experiments

Goal: Quickly and easily arrange the data for further inspection

A

B

C

D

E

nearest: we use correlation coefficient (normalized dot product)

can use other measures as well

A

B

C

D

E

Greedily join nearest cluster pair [3]

- Greedily join nearest cluster pair [3]
- Optimal ordering: minimize summed distance between consecutive genes
- Criterion suggested by Eisen

A

C

B

D

E

- Greedily join nearest cluster pair [3]
- Optimal ordering: minimize summed distance between consecutive genes
- Criterion suggested by Eisen

B

A

C

E

D

- Optimal linear ordering for genes expression vectors can be computed in O(N4) steps
- We would like to maximize the similarity between neighbournodes
where is the ith leaf when the tree is ordered according to

. The algorithm works from bottom up towards the root by recursively computing the cost of the optimal ordering M(V,U,W)

[1]

- The dynamic programming recurrence is given by:
- The optimal cost M(V) for V is obtained by maximizing over all pairs, U, W.
- The global optimal cost is obtained recursively when V is the root of the tree, and the optimal tree can be found by standard backtracking.

[1]

- Data points are divided into k clusters
- Find by iterating such group of centroids C={v1,…,vK}, which minimize the squared distances (d2) between expression vectors xj…xn and the centroid which they belong REP[xj,C]:
where the distance measure d is euclidean.

In practise the result is approximation (local optimum).

- Each expression vector belongs into one cluster.

- Initially put the expression vectors randomly into k clusters.
- Define the clusters centroids by calculating the average vector from expression vectors which belong into the cluster.
- Compute the distances between expression vectors and centroids.
- Move every expression vector into cluster with closest centroid.
- Define new centroids for clusters. If clusters centroids are stabile or some other stopping criteria is achieved, stop algorithm. Otherwise repeat steps 3-5.

Kuva 4 [4]: K-means example: 1) Expression vectors are randomly divided into three clusters 2) Define the centroids. 3) Compute expression vectors distances to the centroids. 4) Compute centroids new locations. 5) Compute expression vectors distances to the centroids. 6) Compute centroids new locations and finish the clustering cause the centroids are stabilized. Clusters formed are circled.

- EM algortihm is based on modelling complex distributions by combining together simple Gaussian distributions of clusters
- K-means algorithm is an oline approximation of EM algorithm
- maximizes the quadratic log-likelihood (minimizes quadratic distances of datapoints to their clusters centroids)

- The EM algorithm is used to optimize the centers of each cluster (weighted variance is maximal) which means that we find the maximum likelihood estimate for the center of the Gaussian distribution of the cluster
- Some initial guesses has to be made before starting
- number of clusters (k)
- initial centers of clusters

Algorithm is an iterative process with two optimization task:

- E-step: the membership probabilities (hidden variables) of each datapoint for each mixture model (cluster) are Estimated
The maximum likehood estimate of the mixing coefficient is the sample mean of the conditional probatilities that d1 comes from model k

- M-step: K-separate estimation problems of Maximizing the log-likelihood of k component with a weight given by the estimated membership probabilities
- In M-step means of Gaussian distributions are estimated so that they maximize the likelihood of the models

[1]Baldi, P and Hatfield, Wesley G, DNA Microarrays and Gene Expression, Cambridge University Press, 2002, 73-96.

[2]URL http://www-2.cs.cmu.edu/~zivbj/class04/lecture11.ppt

[3]Eisen MB, Spellman PT, Brown PO and Botstein D. (1998). Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95, 14863-8.

[4]Gasch, A. P. and Eisen, M. B., Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology, 3,11(2002), 1–22.

URL http://citeseer.ist.psu.edu/gasch02exploring.html.