Differential analysis
Download
1 / 18

Differential Analysis - PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on

Differential Analysis. Given phenotypically distinct classes, find “markers” that distinguish these classes from one another. Differential Analysis. Marker selection. Normal. Tumor. Normal. Tumor. Gene Marker Selection. Hierarchy of difficulty. Problem Gene Markers Error Example

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Differential Analysis' - manon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Differential analysis1

Given phenotypically distinct classes, find “markers” that distinguish these classes from one another

Differential Analysis

Marker selection

Normal

Tumor

Normal

Tumor


Gene marker selection
Gene Marker Selection

Hierarchy of difficulty

ProblemGene MarkersErrorExample

I. Tissue or Cell Type ~1000-2000 ~0% Normal vs. Renal carcinoma

Normal vs. Abnormal

II. Morphological ~200-500 ~0-5% Leukemia ALL vs. AML

Type

III. Morphological Subtype ~50-100 ~0-15% ALL B- vs. T-Cell

Multiclass Classification

IV. Treatment Outcome ~1-20 ~5-50% AML Treatment Outcome

Drug Sensitivity

Degree of Difficulty

adapted from P. Tamayo


Gene marker selection1
Gene Marker Selection

Compute score for each gene

Ranked gene list

Compute

score:

t-test,

SNR, etc.

Dataset

Score

Phenotype/

class labels

T-test:

Signal-to-Noise Ratio (SNR):


Gene marker selection2

Small sample size.

Each gene tested is a separate hypothesis  likelihood of false positives.

Gene interaction not taken into account.

Gene Marker Selection

Challenges


Gene markers selection
Gene Markers Selection

Small Sample Size

  • Generate a 10,000x100 matrix from a Gaussian (mean=0, SD=0.5)

  • Pickn columns (6,14,30,100)

  • Assign sample labels yellow and green

  • Select top 25 markers for yellow, top 25 markers for green

Yellow Green

Yellow Green

Yellow Green

Yellow Green

6 samples

14 samples

30 samples

100 samples

With small sample size it is easy to find genes correlated with phenotype


P value calculation

If a gene is normally distributed the t-score follows the t-distribution

What if they aren’t normally distributed?

Permutation Test:

shuffle labels (class membership)

compute score for each gene (t-score, SNR, .. )

repeat many times

Empirical null distribution of scores for each gene

Compare observed score to empirical distribution.

Observed score of gene

scores

Distribution of permuted scores for given gene

P-value calculation

No distributional assumptions are made - compute gene-specific p-values


Permutation test and p value
Permutation test and P-value t-distribution

To determine how significant a gene’s statistical score is

“Called” Class A

“Called” Class B

Known class A samples

Known class B samples

Score

“True” classes

Permutation 1

Permutation 2

Permutation n

Generates a “null distribution” of values for this gene

Compare with “real” score for this gene


Marker selection process
Marker Selection Process t-distribution

Measure of

significance

Compute

score:

t-test,

SNR, etc.

Measure

significance:

permutation

test

Ranked gene list

Dataset

Score

Phenotype/

class labels

Correct for multiple hypotheses:

FDR, FWER, etc.

Markers


Multiple hypotheses

Bonferroni Correction: t-distribution

Most conservative metric

Divides the p-value by the number of hypotheses

FWER (Family-Wise Error Rate): probability of calling one or more hypotheses significant given that they are all null

FDR (False Discovery Rate): probability that the null hypothesis is true given that the result is significant

Try to reduce the number of hypotheses tested in the first place (i.e. filtering)

Multiple Hypotheses

What to control


Exercise
Exercise t-distribution

ComparativeMarkerSelection Module

  • Choose module:

    • Gene List Selection  ComparativeMarkerSelection

  • Choose input file:

    Next to “input file”, choose “Specify URL”

    View datasets window in Web browser

    Click and drag all_aml_train.preprocessed.gct

  • Choose class file:

    Next to “cls file”, choose “Specify URL”

    View datasets window in Web browser

    Click and drag all_aml_train.cls

  • Click Run


Viewing analysis results
Viewing Analysis Results t-distribution


Differential analysis cookbook

Reduce number of hypotheses/genes by variation filtering (attempt at reducing false negatives)

Choose test statistic (e.g., SNR, t-score, ...)

If enough samples, compute p-values by permutation test (otherwise, compute asymptotic test using the standard t-distribution).

Control for Multiple Hypothesis Testing by using the FDR correction

Remember: if you choose FDR ≤ 0.05, you’re willing to accept 5% of false positives.

If number of significant hypotheses/genes “too large” even for very small threshold values, either:

use the maxT correction (possible w/ empirical p-values only).

use additional criteria (e.g., min fold-change, min expression value, etc.)

Differential Analysis Cookbook


Differential analysis2

Create expression data set – (attempt at reducing false negatives)ExpressionFileCreator

Reduce number of hypotheses/genes by variation filtering – PreprocessDataset

Make class file

Run Differential Analysis – ComparativeMarkerSelection

Choose test statistic (say, t-score)

View results with ComparativeMarkerSelectionViewer

If enough samples, compute p-values by permutation test (otherwise, use asymptotic test).

Control for MHT by using the FDR correction

Use HeatMapViewer to view results for top genes

Use GSEA to find gene sets (or pathways) that are enriched in your dataset.

Differential Analysis

GenePattern modules


Working with samples and features

Working with Samples and Features (attempt at reducing false negatives)


Overview

Extracting a set of samples (attempt at reducing false negatives)

Computing co-expressed genes

Converting probe set ids to gene names

Computing overlap between gene sets

Overview


Working with samples and features1
Working with Samples and Features (attempt at reducing false negatives)

  • From a combined dataset of cancer and normal samples, select the normal samples.

  • Within the normal samples, find the genes coexpressed with LRPPRC (Affymetrix probe M92439_at), a gene with mitochondrial function.

  • Compare these genes and those coexpressed with LRPPRC in another expression dataset to determine the coexpressed genes common to both datasets.

GCM_Total.res

SelectFeaturesColumns

GCM_Normals.res

GeneNeighbors

GCM_Normals.markerdata.gct

GCM_Normals.markerlist.odf

GeneListSignificanceViewer

CollapseDataset

GCM_Total_Normals.markerdata.collapsed.gct

ExtractRowNames

GCM_Total_Normals.markerdata.collapsed.row.names.txt

VennDiagram


Exercise1

Exercise (attempt at reducing false negatives)