microarray pre processing
Skip this Video
Download Presentation
Microarray Pre-Processing

Loading in 2 Seconds...

play fullscreen
1 / 79

Microarray Pre-Processing - PowerPoint PPT Presentation

  • Uploaded on

Microarray Pre-Processing. Mark Reimers CSHL Data 2012. Outline. Microarray technologies Quality assessment Background Normalization Other normalization issues Summarization of Affymetrix. Microarray Technologies. Outline. Library preparation Hybridization cDNA expression arrays

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Microarray Pre-Processing' - ince

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
microarray pre processing

Microarray Pre-Processing

Mark Reimers

CSHL Data 2012

  • Microarray technologies
  • Quality assessment
  • Background
  • Normalization
    • Other normalization issues
  • Summarization of Affymetrix
  • Library preparation
  • Hybridization
  • cDNA expression arrays
  • Oligo expression arrays
    • Agilent
    • Affymetrix
    • Illumina
    • NimbleGen
  • Other array types
  • Microarrays measure the abundance of DNA or RNA by relative hybridization
preparing a cdna rna library from mrna
Preparing a cDNA/RNA Library from mRNA
  • Reverse transcribe cDNA from RNA
  • Fragment
  • Amplify cDNA
  • OR
  • Use cDNA to transcribe RNA
affymetrix probes schematic
Affymetrix Probes Schematic


Probes are



affymetrix probe sets
Affymetrix Probe Sets
  • Probes for older expression arrays are drawn from the 3’ end of the gene
  • Poly-T priming picks up poly-A tails of transcripts
  • Newer exon and whole-gene arrays have probes evenly distributed
  • Random priming more even – but not uniform!
printed oligonucleotide arrays
Printed Oligonucleotide Arrays
  • Agilent (off-shoot of HP) uses printing technology
agilent arrays
Agilent Arrays
  • Now second largest supplier of arrays
  • Reputation for high quality and attention to detail (e.g. scanner optics)
  • Typical 60 nucleotide probes (60-mers)
  • 44K, 185K, and 244K standard sizes
  • Can do several (up to 8) arrays per slide
nimblegen oligonucleotide arrays
NimbleGen Oligonucleotide Arrays

Nimblegen uses a micro-mirror method to de-protect during oligo synthesis in situ

roche nimblegen arrays
(Roche-) NimbleGen Arrays
  • Usually 60-mers
  • Random sequence controls provided
  • Standard sizes from 385K up to 2.1 million probes
  • Can also be multiplexed
  • Patent issues kept the production facility in Iceland
illumina bead arrays
Illumina Bead Arrays
  • 3 mm beads manufactured with identifying segment (~12 nt) and 50-mer probe for target
  • Beads in wells (for some assays with optical fiber)
  • First scan reads ID tag; second reads target
illumina probes
Illumina Probes
  • Typically about 30 beads per array
  • SD very high
  • No controls on most arrays
  • Can be multiplexed
quality assessment
Quality Assessment
  • You are going to be doing a lot of intense analysis on expensive data
  • Are there any factors that would lead you to doubt or distrust a particular datum (array) ?
  • Quality of library – e.g. RNA quality
  • Quality of hybridization process
  • Statistical QA – try to detect non-random technical variation on any chip
rna quality
RNA Quality

Ideal: Two sharp peaks for 18S & 28S RNA

Agilent BioAnalyzer

statistical approaches
Statistical Approaches
  • Aim: are any samples different from others in technical preparation?
  • Exploratory Data Analysis (EDA)
    • Box plots, density plots, clustering, PCA
  • Are there any outliers?
    • These could be biologically interesting
  • Are there associations with technical factors?
      • Technician; date of sample prep; etc.
eda boxplots
EDA - Boxplots
  • Boxplot of 16 chips from Cheung et al Nature 2005
each pair replicates one sample
Each Pair Replicates One Sample
  • Boxplot of 16 chips from Cheung et al Nature 2005
some causes of technical variation
Some Causes of Technical Variation
  • Amount of RNA in sample differs always
  • Yield of conversion to cDNA or cRNA may differ
  • Label incorporation may differ
  • Temperature of hybridization may differ
  • RNA may be slightly degraded in some samples
  • Strength of ionic buffers differs
  • Stringency of wash differs
  • Scratches may occur on some chips
  • Ozone may bleach Cy5 at some times
borrow an idea from model testing
Borrow an Idea from Model Testing
  • Question: Is the model adequate? Or do hidden factors cause systematic errors?
  • Examine residuals after fitting model
    • Should be IID Normal
    • Is there structure in residuals?
    • Plot against known technical covariates, such as order of sample
  • How to adapt residual examination for high-throughput assays?
statistical qa for arrays
Statistical QA for Arrays
  • Model for signal of probe i on chip j: yij ~ mi + eij
    • Each gene has same mean in all arrays (mostly true)
    • Look at residuals after fitting model
  • New twist for high-throughput assays:
    • Examine residuals within each chip (fix j; vary i)
    • Plot against known technical factors of probes
    • Is there any factor that seems to be predicting systematic errors?
statistical qa of arrays
Statistical QA of Arrays
  • Significant artifacts may not be obvious from visual inspection or bulk statistics
  • General approach: plot deviations from average or residuals from fit against any technical variable:
    • CG content or Tm (thermodynamics)
    • Probe position relative to 3’ end of gene (for poly-T primed RNA)
    • Physical location on chip (fluid artifacts)
    • Average Intensity across chips (saturation)
ratio vs intensity plots reveal saturation quenching

Decreasing rate of binding of RNA as more RNA occupies the probe


Light emitted by one dye molecule may be re-absorbed by a nearby dye molecule; then lost as heat

Effect proportional to square of density

Ratio vs Intensity Plots Reveal Saturation & Quenching

Plot of log ratio against average log intensity across chips

GSM25377 from the CEPH expression data GSE2552

how much variability on r i
How Much Variability on R-I?
  • Ratio-Intensity plots for six arrays at random from Cheung et al Nature (2005)
rna quality plots in bioconductor
RNA Quality Plots in Bioconductor
  • affyRNAdeg plots in affy package
  • Effects do not appear large because averaged
  • Samples with RNA quaility differences stick out

Plot of average intensity for each probe position across all genes against probe position

local bias on affymetrix chips
Local Bias on Affymetrix Chips

Image of raw data on a log2 scale shows striations but no obvious artifacts

Image of ratios of probes to standard shows a smudge

Non-coding probes

Images show high values as red, low values as yellow

spatial artifacts on affy chips
Spatial Artifacts on Affy Chips

Bubbles (yellow) in hybridization chamber

Touching cover slip and

wiping incompletely

Scratches on cover slip

model based qc for affy in bioc
Model-Based QC for Affy in BioC
  • Robust Multi-chip Analysis (RMA)
    • fits a linear model to each probe set
    • High residuals show regional patterns

High residuals in green

See http://plmimagegallery.bmbolstad.com/

Available in affyQCReport package at www.bioconductor.org

affy qc metrics in bioconductor
Affy QC Metrics in Bioconductor
  • affyPLM package fits probe level model to Affymetrix raw data
  • NUSE - Normalized Unscaled Standard Errors
    • normalized relative to each gene
  • How many big errors?
spatial artifacts in agilent
Spatial Artifacts in Agilent
  • Usually artifacts are not as strong as on other array types
  • BUT – consequential because only one probe per gene
  • More diffuse artifacts are common
    • probably reflecting wash irregularities
general issues in estimating and compensating background
General Issues in Estimating and Compensating Background
  • ‘Background’ is heterogeneous – different genomic regions or probes have very different background levels
  • Most are comparable and a few are high
microarray background
Microarray Background
  • Non-specific hybridization
  • Cross-hybridization to specific non-targets
  • Distribution of Background has outliers
    • High CG more variable than low
current model for background estimation
Current Model for Background Estimation
  • 25-mers are prone to cross-hybridization
  • MM > PM for about 1/3 of all probes
  • Cross-hybridization varies with GC content
  • Bases at ends matter less than central
  • Signal intensity varies with cross-hybe
  • Simple approach is linear model:

mj,k are mean effects

of base j at position k

the gcrma approach to background correction
Estimate non-specific binding using either:

True null assay (non-homologous RNA)

Estimates from MM

Rather than fit 25 independent coefficients fit spline with 5 df for each base

Process background first; then normalize and fit model

The gcRMAApproach to Background Correction

Typical coefficients fit for each

base at each position in the

gcRMA background model

(using 5df splines to model

each base curve

evaluating the gcrma model
Evaluating the gcRMA Model
  • We compared RNA-Seq data to microarray data on the same samples to identify genes that were not expressed; therefore all signal is cross-hybridization for these probes
  • We fit the gcRMA model to those probes
  • The model explained less than 10% of the variance among probes
evaluating gcrma
Evaluating gcRMA
  • gcRMA won on AffyComp data sets (2006) using replicates with 14 spike-ins done by Affy
  • Many investigators get bad results (and don’t write it up)
      • Gharaibehet al.BMC Bioinformatics 2008 9:452 claimed that gcRMAdoes very well on highly expressed genes, not nearly so well on less expressed genes
  • That’s precisely where it doesn’t matter
why does gcrma fail
Why Does gcRMA Fail?
  • gcRMA estimates cross-hybridization by fitting regression to MM probes
  • MM probes contain a good deal of specific signal
  • Symptom: gcRMA curves are almost identical for different chips, but cross-hybe varies considerably between chips assessed by other means (e.g. comparing controls or fitting the gcRMA model to genes known to be absent)
does cross hybridization matter for long oligos
Does Cross-Hybridization Matter for Long Oligos?
  • Variation in GC content is more constrained
  • Cross-hybridization seems much more uniform
  • Too hard to estimate individual effects of bases
  • Model using quadratic curve to estimate distributions of bases over length is effective at reducing error
    • Three terms: constant, linear, quadratic
common normalization methods
Common Normalization Methods
  • Simple parametric methods
    • Align mean or median intensities
    • Match mean/median and SD/MAD
  • Nonparametric methods
    • Lowess for two-color arrays
    • Align an‘Invariant Set’ across arrays
    • ‘Shoehorn’ all samples to a common distribution
how to assess normalization
How to Assess Normalization?
  • We want to minimize technical variations in relation to biological variation
    • Most tests like t-test or ANOVA compare technical and between-group variance
  • Compare distributions of biological to technical variation after normalization
  • Most small estimates of variance are under-estimates
one parameter alignment
One Parameter Alignment
  • Set target mean or median
    • Usually mean on a log scale to avoid influence of a few very high intensities
  • Scale all values to match mean
    • Add or subtract from log values
  • Agilent suggests aligning 75th percentile (3rd quartile) of distributions
    • Median of the half of genes that are expressed
two color intensity dependent bias
Therefore saturation occurs at different densities for Cy-3 (green) and Cy-5 (red) dyes

We estimate the bias by an intensity dependent function


Two-color Intensity-Dependent Bias
  • Different amounts of fluorescent label get incorporated into DNA


global lowess normalization 2000
Global normalized data {(M,A)}n=1..M:

Mnorm = M-b(A)

The bias, b(A), could be determined by any local averaging method

Terry Speed suggested lowess (local weighted regression)

Subtract b(A) to obtain ‘corrected’ data

Global (lowess) Normalization (2000)
quantile normalization
Quantile Normalization
  • Determine reference distribution (can use any good chip or average a set of chips)
  • For each chip, for each probe, determine quantile within that chip
  • Shift to corresponding quantile of reference distribution


  • Easy to implement
  • Resolves intensity dependent bias as well as loess
quantile normalization method irizarry et al 2002
Quantile Normalization Method (Irizarry et al 2002)

The mapping by quantile normalization

key assumption of quantile norm
Key Assumption of Quantile Norm
  • The processes that distort the distribution act on all probes of a given intensity more or less equally
  • Probably true within differences of 30% or 40%
  • Smaller differences depend quite strongly on technical characteristics of probes
critiques of quantile normalization
Critiques of Quantile Normalization
  • Compresses variation of highly expressed genes
  • Confounds systematic changes due to cross-hybridization with changes in abundance to genes of low expression
  • Induces artificial correlations in gene expression across samples
special issues cancer
Special Issues: Cancer
  • Many cancers express very many genes at much higher than normal levels
  • Quantile normalization forces down genes that actually stay the same, including many that are absent
  • Removing genes that are very high and those that are obviously higher in cancers to fit the normalization, then fitting in others around those, improves the results
overall evaluation of common normalization methods
Overall Evaluation of Common Normalization Methods
  • The popular simple non-parametric methods such as lowess and quantile normalization significantly improve reproducibility of results for most arrays in many common circumstances
  • There are significantly better, but more complex normalization methods
  • No good normalization method can afford to ignore real changes in the distribution
what is summarization
What is Summarization?
  • Some expression arrays (Affymetrix, Nimblegen) use multiple probes to target a single transcript – a ‘probe set’
  • Typically probes have different fold changes between any two samples
  • How to effectively summarize the information in a probe set?
affy perfect match and mismatch
Affy: Perfect Match and Mismatch

How to combine signals from PM & MM?

Mostly ignore MM – not used on modern chips

probe variation
Probe Variation
  • Individual probes don’t agree on fold changes
  • Probes vary by two orders of magnitude on each chip
    • CG content is most important factor in signal strength

Signal from 16 probes

along one gene on

one chip

many approaches to summarization
Many Approaches to Summarization
  • Affymetrix MicroArray Suite; PLiER
  • dChip - Li and Wong, HSPH
  • Bioconductor:
    • RMA - Bolstad, Irizarry, Speed, et al
    • affyPLM – Bolstad
    • gcRMA – Wu
  • Physical chemistry models – Zhang et al
  • Factor model
  • Probe-weighting
critique of averaging mas5
Critique of Averaging (MAS5)
  • Not clear what an average of different probes should mean
  • Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here
  • No ‘learning’ based on performance of individual probes across chips
motivation for multi chip models
Motivation for multi-chip models:

Probe level data from spike-in study ( log scale ) note parallel trend of all probes

Courtesy of Terry Speed

a linear model for probe intensities
A Linear Model for Probe Intensities
  • Imagine a matrix [bij] of intensities (brightness) of probes j in one probe set across all arrays i
  • Typically j = 1,…, 4 or j = 1,…,11
  • Each probe pj binds the gene with efficiency fj
  • Sample i has an amount ai of target RNA
  • Probe intensity bij should be proportional to fjxai
  • For now we ignore non-specific hybridization
    • A probe can give high signal when binding to intended target and also to other transcripts

Probes 1 2 3

chip 1



chip 2


bolstad irizarry speed rma
Bolstad, Irizarry, Speed – (RMA)
  • For each probe set, take log(bij)
  • where caret represents “after pre-processing”
  • Problem: there may be many outliers
  • Therefore fit this additive model by iteratively re-weighted least-squares or median polish
median polish
Median Polish
  • Residuals of regular linear model have row and column sums 0
  • Tukey proposed iteratively subtracting medians from rows and columns successively, until row and column medians converge to 0
  • Add up accumulated row summaries to give estimates of relative abundance of gene
bioinformatics issues
Bioinformatics Issues
  • Probes may not map accurately
  • SNP’s in probes
  • Affymetrix places most probes in 3’UTR of genes
    • Alternate Poly-A sites mean that some probe targets may occur less often than other probe targets from the same gene
alternate poly adenylation sites
Alternate Poly-Adenylation Sites

Poly-A marks mRNA ‘tail’

Many genes have alternative poly-A sites

3’ UTR may be longer or shorter

Early Affymetrix probe sets were in 3’UTR

probe set definitions
Probe Set Definitions
  • For every type of chip, the probes were designed according to the state of the art knowledge at the time, according to UniGene (national database of transcript sequences)
  • There has been a significant increase in sequence information available over the last several years!
  • New CDF (chip definition files) are ocnstructedregularly to reflect up-to-date knowledge of where each individual probe maps
some sources for custom cdfs
Some sources for custom CDFs
  • BrainArray project
    • www.brainarray.mbni.med.umich.edu
  • National Cancer Institute
    • www.masker.nci.nih.gov/ev/
  • Weizmann Institute of Science
    • www.genecards.weizmann.ac.il/geneannot/customcdf.shtml
  • Nutrigenomics consortium
    • www.nugo-r.bioinformatics.nl/NuGO_R.html
  • Bioconductor repository
    • www.bioconductor.org/packages/release/data/annotation
how much difference does it make
How much difference does it make?

Dai, et al. Evolving gene/transcript definitions significantly alter

the interpretations of GeneChip data. Nucleic Acids Research,

2005, 33(20):e175.

how much difference does it make1
How much difference does it make?
  • Original tests showed a 30-50% difference between predicted differential expression.
    • Hall JL, Grindle S, et al. Genomic profiling of the human heart before and after mechanical support with a ventricular assist device reveals alterations in vascular signaling networks. Physiological Genomics 2004, 17(3):283-291
  • Comparisons with more recent manufacturer CDFs also show significantly more precise and reproducible results.
    • Rickard Sandberg and Ola Larsson Improved precision and accuracy for microarrays using updated probe set definitions. BMC Bioinformatics (2007) 8:48.