- 75 Views
- Uploaded on
- Presentation posted in: General

MSCL Analyst’s Toolbox, Part 2

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Instructors:

Jennifer Barb, Zoila G. Rangel, Peter Munson

March 2007

Mathematical and Statistical Computing Laboratory

Division of Computational Biosciences

- Quality Control Charts
- False Discovery Rate
- Principal Components Analysis explained
- PCA Heatmap
- Data normalization, transformation
- Affymetrix probesets and “Probe-level” analysis
- MAS5, RMA, S10 compared

- Started in mid-1990s, exponential growth in popularity
- High-throughput -- measures 10,000s of genes at once
- Very noisy -- systematic and random errors
- Chip manufacturing, printing artifacts
- RNA sample quality issues
- Sample preparation, amplification, labeling reaction problems
- Hybridization reaction variability
- Linearity of response, saturation, background

- Affymetrix has controlled chip quality well.
- REPLICATION IS STILL REQUIRED!
- Statistical methods are critical in analysis!
- Quality Control is Essential!

New Scanner Installed

Scanner “burn-in”?

- Cross-sectional clinical studies from 2 or more patient groups or tissues; identify markers, prognostic indicators.
- Animal model: samples compared between treatments, groups, or over time; identify genes involved in disease process.
- Intervention Trial: collect blood samples pre/post treatment or over time, identify (and rationalize) genes involved.
- Cell culture: Treat cells in culture, identify genes and patterns of response. Complex study designs possible.
- Genetic Knock-out: Perturb genotype, give treatment, investigate expression response, in animal or cells.

- Clinical Studies:
- Exploratory analysis, Hierarchical Cluster, Heat maps
- Sample size often insufficient
- Two-sample tests, Discriminant Analysis, “machine learning” approaches to find prognostic factors

- Designed studies: Analysis plan should follow design
- T-tests, one-way ANOVA to select significantly changing genes
- Blocking to account for experimental batch
- Two-way ANOVA for complete two-factor experiments
- Regression (etc.) for time-course experimemts

- Corrections for multiple-comparison (20,000 genes tested)
- False Discovery Rate

- Interpretation of gene lists (open-ended problem!)

True

discoveries

False

discoveries

Cut at p<.05

- Note excess of small p-values in 45,000 probe sets
- Indicates presence of significant, differentially expressed genes

Expected Number of False Discoveries

FDR* =

Number Discovered

(Number of tests) x p-value cutoff

=

Number Discovered at this p-value

12

12,000 * .001

= 25%

FDR =

=

48

48

Example: 48 genes detected at p<.001 in chip with 12,000 genes.

*Benjamini, Y., Hochberg, Y. (1995) JRSS-B, 57, 289-300.

Now we have guarantee that,

1

Samples

n

1

12,625

Genes

Annotations for each Gene

Expression Matrix, X

Information about

each Sample

- "pre-condition" the Expression Data Matrix
- Select "significant" Genes (False Discovery Rate)
- Select relevant Samples (Outlier rejection, QC)
- Re-order, partition the Genes ("clustering")
- Re-order the Samples
- Visualize the matrix ("heat-map", PCA scatterplot), encode Gene and Sample annotations
- Visualize by Sample (rows of X, scatterplots, line plots)
- Visualize by Gene (cols of X)
- Visualize the Annotations (how?)
- Browse the display for new hypotheses!

Each Principal Component is an orthogonal, linear combination of the expression levels. For the ith gene chip:

In matrix notation:

Principal Components Matrix

Patterns Matrix

Expression Data Matrix

Or

A was chosen so that AAT is the Identity matrix:

Genes

Components

Genes

1

12,625

n

1

1

12,625

1

1

1

PC

*

=

Experiments

Experiments

Components

X

EP

n

n

n

Plot PC(i,1) vs PC(i,2)

for each experiment

- EP row1 contains most important “expression pattern"
- PC col 1 defines how that pattern is manifest in each experiment
- Similarly for EP row 2, PC col 2, etc.
- Only a few patterns needed to reconstruct data matrix X

PC 2(12%)

PC 1(38%)

Each spot is one chip

N=469

Genes

Components

Genes

1

12,625

n

1

1

12,625

1

1

1

PC

*

=

X

Experiments

Experiments

Components

EP

n

n

n

Visualize coefficients

of a first few “Patterns”,

Re-order Experiments

Conclusion:

Sample Type and Project

determine clusters

469 Chips,

468 Components5,933,750 values!

Data Normalization and Transformation

- Signal intensity varies chip-to-chip for a variety of technical reasons.
- Scale adjustments can be made in variety of ways.
- Median adjustment (divide by col median) is commonly used
- Other quantiles (e.g.75th percentile) may work better

- Log-transform
- spreads data more evenly
- makes variance more uniform

- “Lmed” is median normalized, log transform

- Quantile normalization (“ranking” the data): every percentile becomes identical across chips
- Quantile normalization may remove technical artifacts (e.g. curvature)
- Variance should be homogeneous across measurement scale
- Variance may be “homogenized” with appropriate transform (e.g. logarithm, square-root, arcsinh)
- “S10” transform -- optimal variance stabilizing, quantile normalizing transform, calibrated to match Log10 over central part of measurement scale

Note deviation from line of identity

- Note deviation from line of identity
- Note nonuniform variance

- Adequate in most cases
BUT….

- Some nonlinearity may remain, requiring further normalization
- Variance is not truly constant, expands at low intensities
- Cannot treat zero or negative values
- Logarithm may not be best transformation
- Median normalization may not always be adequate

Symmetric Adaptive Transform (S10):

- We start with quantile normalization to convenient distribution
- We further transform to make variance constant with mean
- We adapt transform to empirical variance model (with experiment with at least 5 to 10 chips)
- We scale transform to match log10 units midrange
- We require symmetry around origin

Model the nonlinear relationship

Red line is plot of

quantile of chip 1 vs quantile of chip 2

- Second chip is quantile-normalized to first chip
- Curvature is cured!
- Now, can we remove the variable spread?
- Nonuniform variance?

- Uses Quantile normalization
- Gives better fit to line of identity
- Adapts scale to give homogeneous variance
- Uniform scatter about line
- Calibrated to match Log10 in middle of scale
- *Munson, P.J. A consistency test for determining the significance of gene expression changes on replicate samples and two convenient variance-stabilizing transformations. in GeneLogic Workshop of Low Level Analysis of Affymetrix GeneChip Data. 2001. Bethesda, MD.

Lmed

S10

- 12 Chips
- 3 Groups
- Two apparent outliers
- Groups not well separated
- 1st PC explains 15.3% of variation

- Outliers no longer obvious
- Groups well-separated
- 1st PC explains 30.8% of variation

LFC - Repl. 2

Log Fold Change-Drug vs. Control - Repl. 1

SFC - Repl. 2

SFC-Drug vs. Control - Repl. 1

2

Lmed Transform Value

Std Dev Lmed

Mean Lmed Value

Signal Value

S10 Transform Value

Std Dev S10

Mean S10 Value

Lmed Transform Value

“Probe Level” analysis

Comparison of Signal, RMA, S10

To go from 11 probe pairs to a single number:

- Affymetrix MAS 4.0 (Average difference)
- Affymetrix MAS 5.0 (Signal)
- dChip (Li and Wong, 2001)
- RMA (Irizarry, 2003)
- PLIER (Hubbell, 2004, Affymetrix)
- Transformations of above statistics (Log, Glog, S10, etc.)

- Spike-in (or Latin Square) study on Affy U133A chip
- 13 concentrations plus “control” spiked into complex HeLa background
- 42 oligos, 0, 0.125 - 512 pM
- Concentration doubles at each step
- Three chips run for each concentrationwww.affymetrix.com“Latin Square Data for Expression Algorithm Assessment”

Mean Intensity for Probeset

Concentration Number

Move selector box to detect more Red, fewer Blue points

RED - spike-in genes

BLUE - background

TP=Red points inside detection box

FP=Blue points inside detection box

Number of True Positives

Lmed(Signal)

Number of False Positives

RMA

S10(Signal)

Number of True Positives

Lmed(Signal)

Number of False Positives

- RMA
- gives overall best ROC curve
- requires probes on multiple chips be summarized together
- Implemented in Affy EC, R, Bioconductor or ArrayAssistLite

- Signal (MAS5)
- is convenient, available in Affy GCOS software
- summarizes each chip separately
- has expanded variance near baseline
- LmedMAS5 give worst ROC curve

- S10 transform
- cures variance problem for Signal,
- improves detection efficiency (ROC curve),
- is simple to compute!