Unsupervised analysis
1 / 25

Unsupervised Analysis - PowerPoint PPT Presentation

  • Uploaded on

Unsupervised Analysis. Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Unsupervised Analysis' - marika

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Unsupervised analysis
Unsupervised Analysis

  • Goal A:Find groups of genes that have correlated expression profiles.These genes are believed to belong to the same biological process and/or are co-regulated.

  • Goal B:Divide conditions to groups with similar gene expression profiles.Example: divide drugs according to their effect on gene expression.

Clustering Methods

K means the algorithm
K-means: The Algorithm

  • Given a set of numeric points in d dimensional space, and integer k

  • Algorithm generates k (or fewer) clusters as follows:

    • Assign all points to a cluster at random

    • Compute centroid for each cluster

    • Reassign each point to nearest centroid

    • If centroids changed go back to stage 2

K means example k 3

Step 1: Make random assignments and compute centroids (big dots)

Step 2: Assign points to nearest centroids

Step 3: Re-compute centroids (in this example, solution is now stable)

K-means: Example, k = 3

Fuzzy k means
Fuzzy K means

  • The clusters produced by the k-means procedure are sometimes called "hard" or "crisp" clusters, since any feature vector x either is or is not a member of a particular cluster. This is in contrast to "soft" or "fuzzy" clusters, in which a feature vector x can have a degree of membership in each cluster.

  • The fuzzy-k-means procedure allows each feature vector x to have a degree of membership in Cluster i:

Fuzzy k means algorithm
Fuzzy K means Algorithm

  • Make initial guesses for the means m1, m2,..., mk

  • Until there are no changes in any mean:

    • Use the estimated means to find the degree of membership u(j,i) of xj in Cluster i; for example, if dist(j,i) = exp(- || xj - mi ||2 ), one might use u(j,i) = dist(j,i) / Sj dist(j,i)

    • For i from 1 to k

      • Replace mi with the fuzzy mean of all of the examples for Cluster i

    • end_for

  • end_until

K means sample application
K-means: Sample Application

  • Gene clustering.

    • Given a series of microarray experiments measuring the expression of a set of genes at regular time intervals in a common cell line.

    • Normalization allows comparisons across microarrays.

    • Produce clusters of genes which vary in similar ways over time.

    • Hypothesis: genes which vary in the same way may be co-regulated and/or participate in the same pathway.

Sample Array. Rows are genes and columns are time points.

A cluster of co-regulated genes.

Centroid methods k means
Centroid Methods - K-means

  • Start with random position of K centroids.

  • Iteratre until centroids are stable

    • Assign points to centroids

    • Move centroids to centerof assign points

Iteration = 3

Agglomerative hierarchical clustering
Agglomerative Hierarchical Clustering

  • Results depend on distance update method

    • Single linkage: elongated clusters

    • Complete linkage: sphere-like clusters

  • Greedy iterative process

  • Not robust against noise

  • No inherent measure to choose the clusters

Gene expression data
Gene Expression Data

  • Cluster genes and conditions

  • 2 independent clustering:

    • Genes represented as vectors of expression in all conditions

    • Conditions are represented as vectors of expression of all genes

First clustering experiments
First clustering - Experiments

1. Identify tissue classes (tumor/normal)

Second clustering genes
Second Clustering - Genes

Ribosomal proteins

Cytochrome C



2.Find Differentiating And Correlated Genes

Two way clustering

Coupled two way clustering ctwc
Coupled Two-way Clustering (CTWC)

  • Motivation: Only a small subset of genes play a role in a particular biological process; the other genes introduce noise, which may mask the signal of the important players. Only a subset of the samples exhibit the expression patterns of interest.

  • New Goal: Use subsets of genes to study subsets of samples (and vice versa)

  • A non-trivial task – exponential number of subsets.

  • CTWC is a heuristic to solve this problem.

Ctwc of colon cancer data
CTWC of Colon Cancer Data




Protocol A

Protocol B


Multiple testing problem
Multiple Testing Problem

  • Simultaneously test m null hypotheses, one for each gene j

    Hj: no association between expression measure of gene j and the response

  • Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue

  • Increased chance of false positives

Strong vs weak control
Strong Vs. Weak Control

  • All probabilities are conditional on which hypotheses are true

  • Strong control refers to control of the Type I error rate under any combination of true and false nulls

  • Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true)

  • In general, weak control without other safeguards is unsatisfactory

Adjusted p values p
Adjusted p-values (p*)

  • Test level (e.g. 0.05) does not need to be determined in advance

  • Some procedures most easily described in terms of their adjusted p-values

  • Usually easily estimatedusing resampling

  • Procedures can be readily compared based on the corresponding adjusted p-values

A little notation
A Little Notation

  • For hypothesis Hj, j = 1, …, m

    observed test statistic: tj

    observed unadjusted p-value: pj

  • Ordering of observed (absolute) tj: {rj}

    such that |tr1|  |tr2|  …  |trG|

  • Ordering of observed pj: {rj}

    such that |pr1|  |pr2| …  |prG|

  • Denote corresponding RVs by upper case letters (T, P)

Control of the type i errors
Control of the type I errors

  • Bonferroni single-stepadjusted p-values

    pj* = min (mpj, 1)

  • Sidak single-step (SS) adjusted p-values

    pj * = 1 – (1 – pj)m

  • Sidak free step-down (SD) adjusted p-values

    pj * = 1 – (1 – p(j))(m – j + 1)

Control of the type i errors1
Control of the type I errors

  • Holm (1979)step-down adjusted p-values

    prj* = maxk = 1…j {min ((m-k+1)prk, 1)}

    • Intuitive explanation: once H(1) rejected by Bonferroni, there are only m-1 remaining hyps that might still be true (then another Bonferroni, etc.)

  • Hochberg (1988) step-up adjusted p-values (Simes inequality)

    prj* = mink = j…m {min ((m-k+1)prk, 1) }

Control of the type i errors2
Control of the type I errors

  • Westfall & Young (1993) step-down minP adjusted p-values

    prj* = maxk = 1…j { p(maxl{rk…rm} Pl prkH0C )}

  • Westfall & Young (1993) step-down maxT adjusted p-values

    prj* = maxk = 1…j { p(maxl{rk…rm} |Tl| ≥ |trk| H0C )}

Westfall young 1993 adjusted p values
Westfall & Young (1993) Adjusted p-values

  • Step-down procedures: successively smaller adjustments at each step

  • Take into account the joint distribution of the test statistics

  • Less conservative than Bonferroni, Sidak, Holm, or Hochberg adjusted p-values

  • Can be estimated by resampling but computer-intensive (especially for minP)