Microarray pre processing
1 / 79

Microarray Pre-Processing - PowerPoint PPT Presentation

  • Uploaded on

Microarray Pre-Processing. Mark Reimers CSHL Data 2012. Outline. Microarray technologies Quality assessment Background Normalization Other normalization issues Summarization of Affymetrix. Microarray Technologies. Outline. Library preparation Hybridization cDNA expression arrays

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Microarray Pre-Processing' - ince

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Microarray pre processing

Microarray Pre-Processing

Mark Reimers

CSHL Data 2012


  • Microarray technologies

  • Quality assessment

  • Background

  • Normalization

    • Other normalization issues

  • Summarization of Affymetrix


  • Library preparation

  • Hybridization

  • cDNA expression arrays

  • Oligo expression arrays

    • Agilent

    • Affymetrix

    • Illumina

    • NimbleGen

  • Other array types


  • Microarrays measure the abundance of DNA or RNA by relative hybridization

Preparing a cdna rna library from mrna
Preparing a cDNA/RNA Library from mRNA

  • Reverse transcribe cDNA from RNA

  • Fragment

  • Amplify cDNA

  • OR

  • Use cDNA to transcribe RNA

Affymetrix probes schematic
Affymetrix Probes Schematic


Probes are



Affymetrix probe sets
Affymetrix Probe Sets

  • Probes for older expression arrays are drawn from the 3’ end of the gene

  • Poly-T priming picks up poly-A tails of transcripts

  • Newer exon and whole-gene arrays have probes evenly distributed

  • Random priming more even – but not uniform!

Printed oligonucleotide arrays
Printed Oligonucleotide Arrays

  • Agilent (off-shoot of HP) uses printing technology

Agilent arrays
Agilent Arrays

  • Now second largest supplier of arrays

  • Reputation for high quality and attention to detail (e.g. scanner optics)

  • Typical 60 nucleotide probes (60-mers)

  • 44K, 185K, and 244K standard sizes

  • Can do several (up to 8) arrays per slide

Nimblegen oligonucleotide arrays
NimbleGen Oligonucleotide Arrays

Nimblegen uses a micro-mirror method to de-protect during oligo synthesis in situ

Roche nimblegen arrays
(Roche-) NimbleGen Arrays

  • Usually 60-mers

  • Random sequence controls provided

  • Standard sizes from 385K up to 2.1 million probes

  • Can also be multiplexed

  • Patent issues kept the production facility in Iceland

Illumina bead arrays
Illumina Bead Arrays

  • 3 mm beads manufactured with identifying segment (~12 nt) and 50-mer probe for target

  • Beads in wells (for some assays with optical fiber)

  • First scan reads ID tag; second reads target

Illumina probes
Illumina Probes

  • Typically about 30 beads per array

  • SD very high

  • No controls on most arrays

  • Can be multiplexed

Quality assessment
Quality Assessment

  • You are going to be doing a lot of intense analysis on expensive data

  • Are there any factors that would lead you to doubt or distrust a particular datum (array) ?

  • Quality of library – e.g. RNA quality

  • Quality of hybridization process

  • Statistical QA – try to detect non-random technical variation on any chip

Rna quality
RNA Quality

Ideal: Two sharp peaks for 18S & 28S RNA

Agilent BioAnalyzer

Statistical approaches
Statistical Approaches

  • Aim: are any samples different from others in technical preparation?

  • Exploratory Data Analysis (EDA)

    • Box plots, density plots, clustering, PCA

  • Are there any outliers?

    • These could be biologically interesting

  • Are there associations with technical factors?

    • Technician; date of sample prep; etc.

Eda boxplots
EDA - Boxplots

  • Boxplot of 16 chips from Cheung et al Nature 2005

Each pair replicates one sample
Each Pair Replicates One Sample

  • Boxplot of 16 chips from Cheung et al Nature 2005

Some causes of technical variation
Some Causes of Technical Variation

  • Amount of RNA in sample differs always

  • Yield of conversion to cDNA or cRNA may differ

  • Label incorporation may differ

  • Temperature of hybridization may differ

  • RNA may be slightly degraded in some samples

  • Strength of ionic buffers differs

  • Stringency of wash differs

  • Scratches may occur on some chips

  • Ozone may bleach Cy5 at some times

Borrow an idea from model testing
Borrow an Idea from Model Testing

  • Question: Is the model adequate? Or do hidden factors cause systematic errors?

  • Examine residuals after fitting model

    • Should be IID Normal

    • Is there structure in residuals?

    • Plot against known technical covariates, such as order of sample

  • How to adapt residual examination for high-throughput assays?

Statistical qa for arrays
Statistical QA for Arrays

  • Model for signal of probe i on chip j: yij ~ mi + eij

    • Each gene has same mean in all arrays (mostly true)

    • Look at residuals after fitting model

  • New twist for high-throughput assays:

    • Examine residuals within each chip (fix j; vary i)

    • Plot against known technical factors of probes

    • Is there any factor that seems to be predicting systematic errors?

Statistical qa of arrays
Statistical QA of Arrays

  • Significant artifacts may not be obvious from visual inspection or bulk statistics

  • General approach: plot deviations from average or residuals from fit against any technical variable:

    • CG content or Tm (thermodynamics)

    • Probe position relative to 3’ end of gene (for poly-T primed RNA)

    • Physical location on chip (fluid artifacts)

    • Average Intensity across chips (saturation)

Ratio vs intensity plots reveal saturation quenching


Decreasing rate of binding of RNA as more RNA occupies the probe


Light emitted by one dye molecule may be re-absorbed by a nearby dye molecule; then lost as heat

Effect proportional to square of density

Ratio vs Intensity Plots Reveal Saturation & Quenching

Plot of log ratio against average log intensity across chips

GSM25377 from the CEPH expression data GSE2552

How much variability on r i
How Much Variability on R-I?

  • Ratio-Intensity plots for six arrays at random from Cheung et al Nature (2005)

Rna quality plots in bioconductor
RNA Quality Plots in Bioconductor

  • affyRNAdeg plots in affy package

  • Effects do not appear large because averaged

  • Samples with RNA quaility differences stick out

Plot of average intensity for each probe position across all genes against probe position

Local bias on affymetrix chips
Local Bias on Affymetrix Chips

Image of raw data on a log2 scale shows striations but no obvious artifacts

Image of ratios of probes to standard shows a smudge

Non-coding probes

Images show high values as red, low values as yellow

Spatial artifacts on affy chips
Spatial Artifacts on Affy Chips

Bubbles (yellow) in hybridization chamber

Touching cover slip and

wiping incompletely

Scratches on cover slip

Model based qc for affy in bioc
Model-Based QC for Affy in BioC

  • Robust Multi-chip Analysis (RMA)

    • fits a linear model to each probe set

    • High residuals show regional patterns

High residuals in green

See http://plmimagegallery.bmbolstad.com/

Available in affyQCReport package at www.bioconductor.org

Affy qc metrics in bioconductor
Affy QC Metrics in Bioconductor

  • affyPLM package fits probe level model to Affymetrix raw data

  • NUSE - Normalized Unscaled Standard Errors

    • normalized relative to each gene

  • How many big errors?

Spatial artifacts in agilent
Spatial Artifacts in Agilent

  • Usually artifacts are not as strong as on other array types

  • BUT – consequential because only one probe per gene

  • More diffuse artifacts are common

    • probably reflecting wash irregularities

Background estimation

Background Estimation

Mark Reimers

General issues in estimating and compensating background
General Issues in Estimating and Compensating Background

  • ‘Background’ is heterogeneous – different genomic regions or probes have very different background levels

  • Most are comparable and a few are high

Microarray background
Microarray Background

  • Non-specific hybridization

  • Cross-hybridization to specific non-targets

  • Distribution of Background has outliers

    • High CG more variable than low

Current model for background estimation
Current Model for Background Estimation

  • 25-mers are prone to cross-hybridization

  • MM > PM for about 1/3 of all probes

  • Cross-hybridization varies with GC content

  • Bases at ends matter less than central

  • Signal intensity varies with cross-hybe

  • Simple approach is linear model:

mj,k are mean effects

of base j at position k

The gcrma approach to background correction

Estimate non-specific binding using either:

True null assay (non-homologous RNA)

Estimates from MM

Rather than fit 25 independent coefficients fit spline with 5 df for each base

Process background first; then normalize and fit model

The gcRMAApproach to Background Correction

Typical coefficients fit for each

base at each position in the

gcRMA background model

(using 5df splines to model

each base curve

Evaluating the gcrma model
Evaluating the gcRMA Model

  • We compared RNA-Seq data to microarray data on the same samples to identify genes that were not expressed; therefore all signal is cross-hybridization for these probes

  • We fit the gcRMA model to those probes

  • The model explained less than 10% of the variance among probes

Evaluating gcrma
Evaluating gcRMA

  • gcRMA won on AffyComp data sets (2006) using replicates with 14 spike-ins done by Affy

  • Many investigators get bad results (and don’t write it up)

    • Gharaibehet al.BMC Bioinformatics 2008 9:452 claimed that gcRMAdoes very well on highly expressed genes, not nearly so well on less expressed genes

  • That’s precisely where it doesn’t matter

  • Why does gcrma fail
    Why Does gcRMA Fail?

    • gcRMA estimates cross-hybridization by fitting regression to MM probes

    • MM probes contain a good deal of specific signal

    • Symptom: gcRMA curves are almost identical for different chips, but cross-hybe varies considerably between chips assessed by other means (e.g. comparing controls or fitting the gcRMA model to genes known to be absent)

    Does cross hybridization matter for long oligos
    Does Cross-Hybridization Matter for Long Oligos?

    • Variation in GC content is more constrained

    • Cross-hybridization seems much more uniform

    • Too hard to estimate individual effects of bases

    • Model using quadratic curve to estimate distributions of bases over length is effective at reducing error

      • Three terms: constant, linear, quadratic

    Microarray normalization widely used methods

    Microarray NormalizationWidely-Used Methods

    Common normalization methods
    Common Normalization Methods

    • Simple parametric methods

      • Align mean or median intensities

      • Match mean/median and SD/MAD

    • Nonparametric methods

      • Lowess for two-color arrays

      • Align an‘Invariant Set’ across arrays

      • ‘Shoehorn’ all samples to a common distribution

    How to assess normalization
    How to Assess Normalization?

    • We want to minimize technical variations in relation to biological variation

      • Most tests like t-test or ANOVA compare technical and between-group variance

    • Compare distributions of biological to technical variation after normalization

    • Most small estimates of variance are under-estimates

    One parameter alignment
    One Parameter Alignment

    • Set target mean or median

      • Usually mean on a log scale to avoid influence of a few very high intensities

    • Scale all values to match mean

      • Add or subtract from log values

    • Agilent suggests aligning 75th percentile (3rd quartile) of distributions

      • Median of the half of genes that are expressed

    Two color intensity dependent bias

    Therefore saturation occurs at different densities for Cy-3 (green) and Cy-5 (red) dyes

    We estimate the bias by an intensity dependent function


    Two-color Intensity-Dependent Bias

    • Different amounts of fluorescent label get incorporated into DNA


    Global lowess normalization 2000

    Global normalized data {( (green) and Cy-5 (red) dyes M,A)}n=1..M:

    Mnorm = M-b(A)

    The bias, b(A), could be determined by any local averaging method

    Terry Speed suggested lowess (local weighted regression)

    Subtract b(A) to obtain ‘corrected’ data

    Global (lowess) Normalization (2000)

    Quantile normalization directly addresses incompatible distributions
    Quantile (green) and Cy-5 (red) dyes Normalization Directly Addresses Incompatible Distributions

    Quantile normalization
    Quantile Normalization (green) and Cy-5 (red) dyes

    • Determine reference distribution (can use any good chip or average a set of chips)

    • For each chip, for each probe, determine quantile within that chip

    • Shift to corresponding quantile of reference distribution


    • Easy to implement

    • Resolves intensity dependent bias as well as loess

    Quantile normalization method irizarry et al 2002
    Quantile Normalization Method (green) and Cy-5 (red) dyes (Irizarry et al 2002)

    The mapping by quantile normalization

    Key assumption of quantile norm
    Key Assumption of Quantile Norm (green) and Cy-5 (red) dyes

    • The processes that distort the distribution act on all probes of a given intensity more or less equally

    • Probably true within differences of 30% or 40%

    • Smaller differences depend quite strongly on technical characteristics of probes

    Critiques of quantile normalization
    Critiques of Quantile Normalization (green) and Cy-5 (red) dyes

    • Compresses variation of highly expressed genes

    • Confounds systematic changes due to cross-hybridization with changes in abundance to genes of low expression

    • Induces artificial correlations in gene expression across samples

    Special issues cancer
    Special Issues: Cancer (green) and Cy-5 (red) dyes

    • Many cancers express very many genes at much higher than normal levels

    • Quantile normalization forces down genes that actually stay the same, including many that are absent

    • Removing genes that are very high and those that are obviously higher in cancers to fit the normalization, then fitting in others around those, improves the results

    Overall evaluation of common normalization methods
    Overall Evaluation of Common Normalization Methods (green) and Cy-5 (red) dyes

    • The popular simple non-parametric methods such as lowess and quantile normalization significantly improve reproducibility of results for most arrays in many common circumstances

    • There are significantly better, but more complex normalization methods

    • No good normalization method can afford to ignore real changes in the distribution

    Summarization of affymetrix expression arrays

    Summarization of (green) and Cy-5 (red) dyes AffymetrixExpression Arrays

    What is summarization
    What is Summarization? (green) and Cy-5 (red) dyes

    • Some expression arrays (Affymetrix, Nimblegen) use multiple probes to target a single transcript – a ‘probe set’

    • Typically probes have different fold changes between any two samples

    • How to effectively summarize the information in a probe set?

    Affy perfect match and mismatch
    Affy: Perfect Match and Mismatch (green) and Cy-5 (red) dyes

    How to combine signals from PM & MM?

    Mostly ignore MM – not used on modern chips

    Probe variation
    Probe Variation (green) and Cy-5 (red) dyes

    • Individual probes don’t agree on fold changes

    • Probes vary by two orders of magnitude on each chip

      • CG content is most important factor in signal strength

    Signal from 16 probes

    along one gene on

    one chip

    Many approaches to summarization
    Many Approaches to Summarization (green) and Cy-5 (red) dyes

    • Affymetrix MicroArray Suite; PLiER

    • dChip - Li and Wong, HSPH

    • Bioconductor:

      • RMA - Bolstad, Irizarry, Speed, et al

      • affyPLM – Bolstad

      • gcRMA – Wu

    • Physical chemistry models – Zhang et al

    • Factor model

    • Probe-weighting

    Critique of averaging mas5
    Critique of Averaging (MAS5) (green) and Cy-5 (red) dyes

    • Not clear what an average of different probes should mean

    • Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here

    • No ‘learning’ based on performance of individual probes across chips

    Motivation for multi chip models
    Motivation for multi-chip models: (green) and Cy-5 (red) dyes

    Probe level data from spike-in study ( log scale ) note parallel trend of all probes

    Courtesy of Terry Speed

    A linear model for probe intensities
    A Linear Model for Probe Intensities (green) and Cy-5 (red) dyes

    • Imagine a matrix [bij] of intensities (brightness) of probes j in one probe set across all arrays i

    • Typically j = 1,…, 4 or j = 1,…,11

    • Each probe pj binds the gene with efficiency fj

    • Sample i has an amount ai of target RNA

    • Probe intensity bij should be proportional to fjxai

    • For now we ignore non-specific hybridization

      • A probe can give high signal when binding to intended target and also to other transcripts

    Probes 1 2 3

    chip 1



    chip 2


    Bolstad irizarry speed rma
    Bolstad, Irizarry, Speed – (RMA) (green) and Cy-5 (red) dyes

    • For each probe set, take log(bij)

    • where caret represents “after pre-processing”

    • Problem: there may be many outliers

    • Therefore fit this additive model by iteratively re-weighted least-squares or median polish

    Median polish
    Median Polish (green) and Cy-5 (red) dyes

    • Residuals of regular linear model have row and column sums 0

    • Tukey proposed iteratively subtracting medians from rows and columns successively, until row and column medians converge to 0

    • Add up accumulated row summaries to give estimates of relative abundance of gene

    Bioinformatics issues
    Bioinformatics Issues (green) and Cy-5 (red) dyes

    • Probes may not map accurately

    • SNP’s in probes

    • Affymetrix places most probes in 3’UTR of genes

      • Alternate Poly-A sites mean that some probe targets may occur less often than other probe targets from the same gene

    Correlations among probes within a single probe set
    Correlations Among (green) and Cy-5 (red) dyes Probes Within a Single Probe Set

    Scale: Red = 1; Blue = 0

    Alternate poly adenylation sites
    Alternate Poly-Adenylation Sites (green) and Cy-5 (red) dyes

    Poly-A marks mRNA ‘tail’

    Many genes have alternative poly-A sites

    3’ UTR may be longer or shorter

    Early Affymetrix probe sets were in 3’UTR

    Probe set definitions
    Probe Set Definitions (green) and Cy-5 (red) dyes

    • For every type of chip, the probes were designed according to the state of the art knowledge at the time, according to UniGene (national database of transcript sequences)

    • There has been a significant increase in sequence information available over the last several years!

    • New CDF (chip definition files) are ocnstructedregularly to reflect up-to-date knowledge of where each individual probe maps

    Some sources for custom cdfs
    Some sources for custom CDFs (green) and Cy-5 (red) dyes

    • BrainArray project

      • www.brainarray.mbni.med.umich.edu

    • National Cancer Institute

      • www.masker.nci.nih.gov/ev/

    • Weizmann Institute of Science

      • www.genecards.weizmann.ac.il/geneannot/customcdf.shtml

    • Nutrigenomics consortium

      • www.nugo-r.bioinformatics.nl/NuGO_R.html

    • Bioconductor repository

      • www.bioconductor.org/packages/release/data/annotation

    How much difference does it make
    How much difference does it make? (green) and Cy-5 (red) dyes

    Dai, et al. Evolving gene/transcript definitions significantly alter

    the interpretations of GeneChip data. Nucleic Acids Research,

    2005, 33(20):e175.

    How much difference does it make1
    How much difference does it make? (green) and Cy-5 (red) dyes

    • Original tests showed a 30-50% difference between predicted differential expression.

      • Hall JL, Grindle S, et al. Genomic profiling of the human heart before and after mechanical support with a ventricular assist device reveals alterations in vascular signaling networks. Physiological Genomics 2004, 17(3):283-291

    • Comparisons with more recent manufacturer CDFs also show significantly more precise and reproducible results.

      • Rickard Sandberg and Ola Larsson Improved precision and accuracy for microarrays using updated probe set definitions. BMC Bioinformatics (2007) 8:48.