- By
**manon** - Follow User

- 56 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Differential Analysis' - manon

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Working with Samples and Features (attempt at reducing false negatives)

### Exercise (attempt at reducing false negatives)

Given phenotypically distinct classes, find “markers” that distinguish these classes from one another

Differential AnalysisMarker selection

Normal

Tumor

Normal

Tumor

Gene Marker Selection

Hierarchy of difficulty

ProblemGene MarkersErrorExample

I. Tissue or Cell Type ~1000-2000 ~0% Normal vs. Renal carcinoma

Normal vs. Abnormal

II. Morphological ~200-500 ~0-5% Leukemia ALL vs. AML

Type

III. Morphological Subtype ~50-100 ~0-15% ALL B- vs. T-Cell

Multiclass Classification

IV. Treatment Outcome ~1-20 ~5-50% AML Treatment Outcome

Drug Sensitivity

Degree of Difficulty

adapted from P. Tamayo

Gene Marker Selection

Compute score for each gene

Ranked gene list

Compute

score:

t-test,

SNR, etc.

Dataset

Score

Phenotype/

class labels

T-test:

Signal-to-Noise Ratio (SNR):

Each gene tested is a separate hypothesis likelihood of false positives.

Gene interaction not taken into account.

Gene Marker SelectionChallenges

Gene Markers Selection

Small Sample Size

- Generate a 10,000x100 matrix from a Gaussian (mean=0, SD=0.5)
- Pickn columns (6,14,30,100)
- Assign sample labels yellow and green
- Select top 25 markers for yellow, top 25 markers for green

Yellow Green

Yellow Green

Yellow Green

Yellow Green

6 samples

14 samples

30 samples

100 samples

With small sample size it is easy to find genes correlated with phenotype

If a gene is normally distributed the t-score follows the t-distribution

What if they aren’t normally distributed?

Permutation Test:

shuffle labels (class membership)

compute score for each gene (t-score, SNR, .. )

repeat many times

Empirical null distribution of scores for each gene

Compare observed score to empirical distribution.

Observed score of gene

scores

Distribution of permuted scores for given gene

P-value calculationNo distributional assumptions are made - compute gene-specific p-values

Permutation test and P-value t-distribution

To determine how significant a gene’s statistical score is

“Called” Class A

“Called” Class B

Known class A samples

Known class B samples

Score

“True” classes

Permutation 1

Permutation 2

Permutation n

Generates a “null distribution” of values for this gene

Compare with “real” score for this gene

Marker Selection Process t-distribution

Measure of

significance

Compute

score:

t-test,

SNR, etc.

Measure

significance:

permutation

test

Ranked gene list

Dataset

Score

Phenotype/

class labels

Correct for multiple hypotheses:

FDR, FWER, etc.

Markers

Bonferroni Correction: t-distribution

Most conservative metric

Divides the p-value by the number of hypotheses

FWER (Family-Wise Error Rate): probability of calling one or more hypotheses significant given that they are all null

FDR (False Discovery Rate): probability that the null hypothesis is true given that the result is significant

Try to reduce the number of hypotheses tested in the first place (i.e. filtering)

Multiple HypothesesWhat to control

Exercise t-distribution

ComparativeMarkerSelection Module

- Choose module:
- Gene List Selection ComparativeMarkerSelection

- Choose input file:
Next to “input file”, choose “Specify URL”

View datasets window in Web browser

Click and drag all_aml_train.preprocessed.gct

- Choose class file:
Next to “cls file”, choose “Specify URL”

View datasets window in Web browser

Click and drag all_aml_train.cls

- Click Run

Viewing Analysis Results t-distribution

Reduce number of hypotheses/genes by variation filtering (attempt at reducing false negatives)

Choose test statistic (e.g., SNR, t-score, ...)

If enough samples, compute p-values by permutation test (otherwise, compute asymptotic test using the standard t-distribution).

Control for Multiple Hypothesis Testing by using the FDR correction

Remember: if you choose FDR ≤ 0.05, you’re willing to accept 5% of false positives.

If number of significant hypotheses/genes “too large” even for very small threshold values, either:

use the maxT correction (possible w/ empirical p-values only).

use additional criteria (e.g., min fold-change, min expression value, etc.)

Differential Analysis CookbookCreate expression data set – (attempt at reducing false negatives)ExpressionFileCreator

Reduce number of hypotheses/genes by variation filtering – PreprocessDataset

Make class file

Run Differential Analysis – ComparativeMarkerSelection

Choose test statistic (say, t-score)

View results with ComparativeMarkerSelectionViewer

If enough samples, compute p-values by permutation test (otherwise, use asymptotic test).

Control for MHT by using the FDR correction

Use HeatMapViewer to view results for top genes

Use GSEA to find gene sets (or pathways) that are enriched in your dataset.

Differential AnalysisGenePattern modules

Extracting a set of samples (attempt at reducing false negatives)

Computing co-expressed genes

Converting probe set ids to gene names

Computing overlap between gene sets

OverviewWorking with Samples and Features (attempt at reducing false negatives)

- From a combined dataset of cancer and normal samples, select the normal samples.
- Within the normal samples, find the genes coexpressed with LRPPRC (Affymetrix probe M92439_at), a gene with mitochondrial function.
- Compare these genes and those coexpressed with LRPPRC in another expression dataset to determine the coexpressed genes common to both datasets.

GCM_Total.res

SelectFeaturesColumns

GCM_Normals.res

GeneNeighbors

GCM_Normals.markerdata.gct

GCM_Normals.markerlist.odf

GeneListSignificanceViewer

CollapseDataset

GCM_Total_Normals.markerdata.collapsed.gct

ExtractRowNames

GCM_Total_Normals.markerdata.collapsed.row.names.txt

VennDiagram

Download Presentation

Connecting to Server..