1 / 25

# Unsupervised Analysis - PowerPoint PPT Presentation

Unsupervised Analysis. Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Unsupervised Analysis' - marika

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

• Goal A:Find groups of genes that have correlated expression profiles.These genes are believed to belong to the same biological process and/or are co-regulated.

• Goal B:Divide conditions to groups with similar gene expression profiles.Example: divide drugs according to their effect on gene expression.

Clustering Methods

• Given a set of numeric points in d dimensional space, and integer k

• Algorithm generates k (or fewer) clusters as follows:

• Assign all points to a cluster at random

• Compute centroid for each cluster

• Reassign each point to nearest centroid

• If centroids changed go back to stage 2

Step 1: Make random assignments and compute centroids (big dots)

Step 2: Assign points to nearest centroids

Step 3: Re-compute centroids (in this example, solution is now stable)

K-means: Example, k = 3

• The clusters produced by the k-means procedure are sometimes called "hard" or "crisp" clusters, since any feature vector x either is or is not a member of a particular cluster. This is in contrast to "soft" or "fuzzy" clusters, in which a feature vector x can have a degree of membership in each cluster.

• The fuzzy-k-means procedure allows each feature vector x to have a degree of membership in Cluster i:

• Make initial guesses for the means m1, m2,..., mk

• Until there are no changes in any mean:

• Use the estimated means to find the degree of membership u(j,i) of xj in Cluster i; for example, if dist(j,i) = exp(- || xj - mi ||2 ), one might use u(j,i) = dist(j,i) / Sj dist(j,i)

• For i from 1 to k

• Replace mi with the fuzzy mean of all of the examples for Cluster i

• end_for

• end_until

• Gene clustering.

• Given a series of microarray experiments measuring the expression of a set of genes at regular time intervals in a common cell line.

• Normalization allows comparisons across microarrays.

• Produce clusters of genes which vary in similar ways over time.

• Hypothesis: genes which vary in the same way may be co-regulated and/or participate in the same pathway.

Sample Array. Rows are genes and columns are time points.

A cluster of co-regulated genes.

Centroid Methods - K-means

• Iteratre until centroids are stable

• Assign points to centroids

• Move centroids to centerof assign points

Iteration = 3

• Results depend on distance update method

• Greedy iterative process

• Not robust against noise

• No inherent measure to choose the clusters

• Cluster genes and conditions

• 2 independent clustering:

• Genes represented as vectors of expression in all conditions

• Conditions are represented as vectors of expression of all genes

1. Identify tissue classes (tumor/normal)

Ribosomal proteins

Cytochrome C

metabolism

HLA2

2.Find Differentiating And Correlated Genes

Two-wayClustering

• Motivation: Only a small subset of genes play a role in a particular biological process; the other genes introduce noise, which may mask the signal of the important players. Only a subset of the samples exhibit the expression patterns of interest.

• New Goal: Use subsets of genes to study subsets of samples (and vice versa)

• A non-trivial task – exponential number of subsets.

• CTWC is a heuristic to solve this problem.

Tumor

Normal

(A)

Protocol A

Protocol B

(B)

• Simultaneously test m null hypotheses, one for each gene j

Hj: no association between expression measure of gene j and the response

• Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue

• Increased chance of false positives

Decision

Truth

• All probabilities are conditional on which hypotheses are true

• Strong control refers to control of the Type I error rate under any combination of true and false nulls

• Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true)

• In general, weak control without other safeguards is unsatisfactory

• Test level (e.g. 0.05) does not need to be determined in advance

• Some procedures most easily described in terms of their adjusted p-values

• Usually easily estimatedusing resampling

• Procedures can be readily compared based on the corresponding adjusted p-values

• For hypothesis Hj, j = 1, …, m

observed test statistic: tj

• Ordering of observed (absolute) tj: {rj}

such that |tr1|  |tr2|  …  |trG|

• Ordering of observed pj: {rj}

such that |pr1|  |pr2| …  |prG|

• Denote corresponding RVs by upper case letters (T, P)

pj* = min (mpj, 1)

• Sidak single-step (SS) adjusted p-values

pj * = 1 – (1 – pj)m

• Sidak free step-down (SD) adjusted p-values

pj * = 1 – (1 – p(j))(m – j + 1)

prj* = maxk = 1…j {min ((m-k+1)prk, 1)}

• Intuitive explanation: once H(1) rejected by Bonferroni, there are only m-1 remaining hyps that might still be true (then another Bonferroni, etc.)

• Hochberg (1988) step-up adjusted p-values (Simes inequality)

prj* = mink = j…m {min ((m-k+1)prk, 1) }

• Westfall & Young (1993) step-down minP adjusted p-values

prj* = maxk = 1…j { p(maxl{rk…rm} Pl prkH0C )}

• Westfall & Young (1993) step-down maxT adjusted p-values

prj* = maxk = 1…j { p(maxl{rk…rm} |Tl| ≥ |trk| H0C )}

• Step-down procedures: successively smaller adjustments at each step

• Take into account the joint distribution of the test statistics

• Less conservative than Bonferroni, Sidak, Holm, or Hochberg adjusted p-values

• Can be estimated by resampling but computer-intensive (especially for minP)