Microarray pre processing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 79

Microarray Pre-Processing PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Microarray Pre-Processing. Mark Reimers CSHL Data 2012. Outline. Microarray technologies Quality assessment Background Normalization Other normalization issues Summarization of Affymetrix. Microarray Technologies. Outline. Library preparation Hybridization cDNA expression arrays

Download Presentation

Microarray Pre-Processing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Microarray pre processing

Microarray Pre-Processing

Mark Reimers

CSHL Data 2012



  • Microarray technologies

  • Quality assessment

  • Background

  • Normalization

    • Other normalization issues

  • Summarization of Affymetrix

Microarray technologies

Microarray Technologies



  • Library preparation

  • Hybridization

  • cDNA expression arrays

  • Oligo expression arrays

    • Agilent

    • Affymetrix

    • Illumina

    • NimbleGen

  • Other array types



  • Microarrays measure the abundance of DNA or RNA by relative hybridization



Preparing a cdna rna library from mrna

Preparing a cDNA/RNA Library from mRNA

  • Reverse transcribe cDNA from RNA

  • Fragment

  • Amplify cDNA

  • OR

  • Use cDNA to transcribe RNA

Glass slide microarrays 1994

Glass Slide Microarrays (1994)

Printing glass slide arrays

Printing Glass Slide Arrays

Synthetic oligonucleotide arrays

Synthetic Oligonucleotide Arrays

Up to 25 bases

Affymetrix probes schematic

Affymetrix Probes Schematic


Probes are



Affymetrix probe sets

Affymetrix Probe Sets

  • Probes for older expression arrays are drawn from the 3’ end of the gene

  • Poly-T priming picks up poly-A tails of transcripts

  • Newer exon and whole-gene arrays have probes evenly distributed

  • Random priming more even – but not uniform!

Printed oligonucleotide arrays

Printed Oligonucleotide Arrays

  • Agilent (off-shoot of HP) uses printing technology

Agilent arrays

Agilent Arrays

  • Now second largest supplier of arrays

  • Reputation for high quality and attention to detail (e.g. scanner optics)

  • Typical 60 nucleotide probes (60-mers)

  • 44K, 185K, and 244K standard sizes

  • Can do several (up to 8) arrays per slide

Nimblegen oligonucleotide arrays

NimbleGen Oligonucleotide Arrays

Nimblegen uses a micro-mirror method to de-protect during oligo synthesis in situ

Roche nimblegen arrays

(Roche-) NimbleGen Arrays

  • Usually 60-mers

  • Random sequence controls provided

  • Standard sizes from 385K up to 2.1 million probes

  • Can also be multiplexed

  • Patent issues kept the production facility in Iceland

Illumina bead arrays

Illumina Bead Arrays

  • 3 mm beads manufactured with identifying segment (~12 nt) and 50-mer probe for target

  • Beads in wells (for some assays with optical fiber)

  • First scan reads ID tag; second reads target

Illumina probes

Illumina Probes

  • Typically about 30 beads per array

  • SD very high

  • No controls on most arrays

  • Can be multiplexed

Microarray quality assessment

Microarray Quality Assessment

Quality assessment

Quality Assessment

  • You are going to be doing a lot of intense analysis on expensive data

  • Are there any factors that would lead you to doubt or distrust a particular datum (array) ?

  • Quality of library – e.g. RNA quality

  • Quality of hybridization process

  • Statistical QA – try to detect non-random technical variation on any chip

Rna quality

RNA Quality

Ideal: Two sharp peaks for 18S & 28S RNA

Agilent BioAnalyzer

Statistical approaches

Statistical Approaches

  • Aim: are any samples different from others in technical preparation?

  • Exploratory Data Analysis (EDA)

    • Box plots, density plots, clustering, PCA

  • Are there any outliers?

    • These could be biologically interesting

  • Are there associations with technical factors?

    • Technician; date of sample prep; etc.

Eda boxplots

EDA - Boxplots

  • Boxplot of 16 chips from Cheung et al Nature 2005

Another portrait densities

Another Portrait - Densities

Each pair replicates one sample

Each Pair Replicates One Sample

  • Boxplot of 16 chips from Cheung et al Nature 2005

Some causes of technical variation

Some Causes of Technical Variation

  • Amount of RNA in sample differs always

  • Yield of conversion to cDNA or cRNA may differ

  • Label incorporation may differ

  • Temperature of hybridization may differ

  • RNA may be slightly degraded in some samples

  • Strength of ionic buffers differs

  • Stringency of wash differs

  • Scratches may occur on some chips

  • Ozone may bleach Cy5 at some times

Borrow an idea from model testing

Borrow an Idea from Model Testing

  • Question: Is the model adequate? Or do hidden factors cause systematic errors?

  • Examine residuals after fitting model

    • Should be IID Normal

    • Is there structure in residuals?

    • Plot against known technical covariates, such as order of sample

  • How to adapt residual examination for high-throughput assays?

Statistical qa for arrays

Statistical QA for Arrays

  • Model for signal of probe i on chip j: yij ~ mi + eij

    • Each gene has same mean in all arrays (mostly true)

    • Look at residuals after fitting model

  • New twist for high-throughput assays:

    • Examine residuals within each chip (fix j; vary i)

    • Plot against known technical factors of probes

    • Is there any factor that seems to be predicting systematic errors?

Statistical qa of arrays

Statistical QA of Arrays

  • Significant artifacts may not be obvious from visual inspection or bulk statistics

  • General approach: plot deviations from average or residuals from fit against any technical variable:

    • CG content or Tm (thermodynamics)

    • Probe position relative to 3’ end of gene (for poly-T primed RNA)

    • Physical location on chip (fluid artifacts)

    • Average Intensity across chips (saturation)

Ratio vs intensity plots reveal saturation quenching


Decreasing rate of binding of RNA as more RNA occupies the probe


Light emitted by one dye molecule may be re-absorbed by a nearby dye molecule; then lost as heat

Effect proportional to square of density

Ratio vs Intensity Plots Reveal Saturation & Quenching

Plot of log ratio against average log intensity across chips

GSM25377 from the CEPH expression data GSE2552

How much variability on r i

How Much Variability on R-I?

  • Ratio-Intensity plots for six arrays at random from Cheung et al Nature (2005)

Rna quality plots in bioconductor

RNA Quality Plots in Bioconductor

  • affyRNAdeg plots in affy package

  • Effects do not appear large because averaged

  • Samples with RNA quaility differences stick out

Plot of average intensity for each probe position across all genes against probe position

Local bias on affymetrix chips

Local Bias on Affymetrix Chips

Image of raw data on a log2 scale shows striations but no obvious artifacts

Image of ratios of probes to standard shows a smudge

Non-coding probes

Images show high values as red, low values as yellow

Spatial artifacts on affy chips

Spatial Artifacts on Affy Chips

Bubbles (yellow) in hybridization chamber

Touching cover slip and

wiping incompletely

Scratches on cover slip

Model based qc for affy in bioc

Model-Based QC for Affy in BioC

  • Robust Multi-chip Analysis (RMA)

    • fits a linear model to each probe set

    • High residuals show regional patterns

High residuals in green

See http://plmimagegallery.bmbolstad.com/

Available in affyQCReport package at www.bioconductor.org

Affy qc metrics in bioconductor

Affy QC Metrics in Bioconductor

  • affyPLM package fits probe level model to Affymetrix raw data

  • NUSE - Normalized Unscaled Standard Errors

    • normalized relative to each gene

  • How many big errors?

Spatial artifacts in agilent

Spatial Artifacts in Agilent

  • Usually artifacts are not as strong as on other array types

  • BUT – consequential because only one probe per gene

  • More diffuse artifacts are common

    • probably reflecting wash irregularities

Bioconductor arrayquality package

Bioconductor arrayQuality Package

Background estimation

Background Estimation

Mark Reimers

General issues in estimating and compensating background

General Issues in Estimating and Compensating Background

  • ‘Background’ is heterogeneous – different genomic regions or probes have very different background levels

  • Most are comparable and a few are high

Microarray background

Microarray Background

  • Non-specific hybridization

  • Cross-hybridization to specific non-targets

  • Distribution of Background has outliers

    • High CG more variable than low

Current model for background estimation

Current Model for Background Estimation

  • 25-mers are prone to cross-hybridization

  • MM > PM for about 1/3 of all probes

  • Cross-hybridization varies with GC content

  • Bases at ends matter less than central

  • Signal intensity varies with cross-hybe

  • Simple approach is linear model:

mj,k are mean effects

of base j at position k

The gcrma approach to background correction

Estimate non-specific binding using either:

True null assay (non-homologous RNA)

Estimates from MM

Rather than fit 25 independent coefficients fit spline with 5 df for each base

Process background first; then normalize and fit model

The gcRMAApproach to Background Correction

Typical coefficients fit for each

base at each position in the

gcRMA background model

(using 5df splines to model

each base curve

Evaluating the gcrma model

Evaluating the gcRMA Model

  • We compared RNA-Seq data to microarray data on the same samples to identify genes that were not expressed; therefore all signal is cross-hybridization for these probes

  • We fit the gcRMA model to those probes

  • The model explained less than 10% of the variance among probes

Evaluating gcrma

Evaluating gcRMA

  • gcRMA won on AffyComp data sets (2006) using replicates with 14 spike-ins done by Affy

  • Many investigators get bad results (and don’t write it up)

    • Gharaibehet al.BMC Bioinformatics 2008 9:452 claimed that gcRMAdoes very well on highly expressed genes, not nearly so well on less expressed genes

  • That’s precisely where it doesn’t matter

  • Why does gcrma fail

    Why Does gcRMA Fail?

    • gcRMA estimates cross-hybridization by fitting regression to MM probes

    • MM probes contain a good deal of specific signal

    • Symptom: gcRMA curves are almost identical for different chips, but cross-hybe varies considerably between chips assessed by other means (e.g. comparing controls or fitting the gcRMA model to genes known to be absent)

    Does cross hybridization matter for long oligos

    Does Cross-Hybridization Matter for Long Oligos?

    • Variation in GC content is more constrained

    • Cross-hybridization seems much more uniform

    • Too hard to estimate individual effects of bases

    • Model using quadratic curve to estimate distributions of bases over length is effective at reducing error

      • Three terms: constant, linear, quadratic

    Background varies across long oligo arrays

    Background Varies Across Long-Oligo Arrays

    Microarray normalization widely used methods

    Microarray NormalizationWidely-Used Methods

    Common normalization methods

    Common Normalization Methods

    • Simple parametric methods

      • Align mean or median intensities

      • Match mean/median and SD/MAD

    • Nonparametric methods

      • Lowess for two-color arrays

      • Align an‘Invariant Set’ across arrays

      • ‘Shoehorn’ all samples to a common distribution

    How to assess normalization

    How to Assess Normalization?

    • We want to minimize technical variations in relation to biological variation

      • Most tests like t-test or ANOVA compare technical and between-group variance

    • Compare distributions of biological to technical variation after normalization

    • Most small estimates of variance are under-estimates

    One parameter alignment

    One Parameter Alignment

    • Set target mean or median

      • Usually mean on a log scale to avoid influence of a few very high intensities

    • Scale all values to match mean

      • Add or subtract from log values

    • Agilent suggests aligning 75th percentile (3rd quartile) of distributions

      • Median of the half of genes that are expressed

    Two color intensity dependent bias

    Therefore saturation occurs at different densities for Cy-3 (green) and Cy-5 (red) dyes

    We estimate the bias by an intensity dependent function


    Two-color Intensity-Dependent Bias

    • Different amounts of fluorescent label get incorporated into DNA


    Global lowess normalization 2000

    Global normalized data {(M,A)}n=1..M:

    Mnorm = M-b(A)

    The bias, b(A), could be determined by any local averaging method

    Terry Speed suggested lowess (local weighted regression)

    Subtract b(A) to obtain ‘corrected’ data

    Global (lowess) Normalization (2000)

    Quantile normalization directly addresses incompatible distributions

    QuantileNormalization Directly Addresses Incompatible Distributions

    Quantile normalization

    Quantile Normalization

    • Determine reference distribution (can use any good chip or average a set of chips)

    • For each chip, for each probe, determine quantile within that chip

    • Shift to corresponding quantile of reference distribution


    • Easy to implement

    • Resolves intensity dependent bias as well as loess

    Quantile normalization method irizarry et al 2002

    Quantile Normalization Method (Irizarry et al 2002)

    The mapping by quantile normalization

    Key assumption of quantile norm

    Key Assumption of Quantile Norm

    • The processes that distort the distribution act on all probes of a given intensity more or less equally

    • Probably true within differences of 30% or 40%

    • Smaller differences depend quite strongly on technical characteristics of probes

    Critiques of quantile normalization

    Critiques of Quantile Normalization

    • Compresses variation of highly expressed genes

    • Confounds systematic changes due to cross-hybridization with changes in abundance to genes of low expression

    • Induces artificial correlations in gene expression across samples

    Special issues cancer

    Special Issues: Cancer

    • Many cancers express very many genes at much higher than normal levels

    • Quantile normalization forces down genes that actually stay the same, including many that are absent

    • Removing genes that are very high and those that are obviously higher in cancers to fit the normalization, then fitting in others around those, improves the results

    Overall evaluation of common normalization methods

    Overall Evaluation of Common Normalization Methods

    • The popular simple non-parametric methods such as lowess and quantile normalization significantly improve reproducibility of results for most arrays in many common circumstances

    • There are significantly better, but more complex normalization methods

    • No good normalization method can afford to ignore real changes in the distribution

    Summarization of affymetrix expression arrays

    Summarization of AffymetrixExpression Arrays

    What is summarization

    What is Summarization?

    • Some expression arrays (Affymetrix, Nimblegen) use multiple probes to target a single transcript – a ‘probe set’

    • Typically probes have different fold changes between any two samples

    • How to effectively summarize the information in a probe set?

    Affy perfect match and mismatch

    Affy: Perfect Match and Mismatch

    How to combine signals from PM & MM?

    Mostly ignore MM – not used on modern chips

    Probe variation

    Probe Variation

    • Individual probes don’t agree on fold changes

    • Probes vary by two orders of magnitude on each chip

      • CG content is most important factor in signal strength

    Signal from 16 probes

    along one gene on

    one chip

    Many approaches to summarization

    Many Approaches to Summarization

    • Affymetrix MicroArray Suite; PLiER

    • dChip - Li and Wong, HSPH

    • Bioconductor:

      • RMA - Bolstad, Irizarry, Speed, et al

      • affyPLM – Bolstad

      • gcRMA – Wu

    • Physical chemistry models – Zhang et al

    • Factor model

    • Probe-weighting

    Critique of averaging mas5

    Critique of Averaging (MAS5)

    • Not clear what an average of different probes should mean

    • Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here

    • No ‘learning’ based on performance of individual probes across chips

    Motivation for multi chip models

    Motivation for multi-chip models:

    Probe level data from spike-in study ( log scale ) note parallel trend of all probes

    Courtesy of Terry Speed

    A linear model for probe intensities

    A Linear Model for Probe Intensities

    • Imagine a matrix [bij] of intensities (brightness) of probes j in one probe set across all arrays i

    • Typically j = 1,…, 4 or j = 1,…,11

    • Each probe pj binds the gene with efficiency fj

    • Sample i has an amount ai of target RNA

    • Probe intensity bij should be proportional to fjxai

    • For now we ignore non-specific hybridization

      • A probe can give high signal when binding to intended target and also to other transcripts

    Probes 1 2 3

    chip 1



    chip 2


    Bolstad irizarry speed rma

    Bolstad, Irizarry, Speed – (RMA)

    • For each probe set, take log(bij)

    • where caret represents “after pre-processing”

    • Problem: there may be many outliers

    • Therefore fit this additive model by iteratively re-weighted least-squares or median polish

    Median polish

    Median Polish

    • Residuals of regular linear model have row and column sums 0

    • Tukey proposed iteratively subtracting medians from rows and columns successively, until row and column medians converge to 0

    • Add up accumulated row summaries to give estimates of relative abundance of gene

    Bioinformatics issues

    Bioinformatics Issues

    • Probes may not map accurately

    • SNP’s in probes

    • Affymetrix places most probes in 3’UTR of genes

      • Alternate Poly-A sites mean that some probe targets may occur less often than other probe targets from the same gene

    Correlations among probes within a single probe set

    Correlations Among Probes Within a Single Probe Set

    Scale: Red = 1; Blue = 0

    Alternate poly adenylation sites

    Alternate Poly-Adenylation Sites

    Poly-A marks mRNA ‘tail’

    Many genes have alternative poly-A sites

    3’ UTR may be longer or shorter

    Early Affymetrix probe sets were in 3’UTR

    Probe set definitions

    Probe Set Definitions

    • For every type of chip, the probes were designed according to the state of the art knowledge at the time, according to UniGene (national database of transcript sequences)

    • There has been a significant increase in sequence information available over the last several years!

    • New CDF (chip definition files) are ocnstructedregularly to reflect up-to-date knowledge of where each individual probe maps

    Some sources for custom cdfs

    Some sources for custom CDFs

    • BrainArray project

      • www.brainarray.mbni.med.umich.edu

    • National Cancer Institute

      • www.masker.nci.nih.gov/ev/

    • Weizmann Institute of Science

      • www.genecards.weizmann.ac.il/geneannot/customcdf.shtml

    • Nutrigenomics consortium

      • www.nugo-r.bioinformatics.nl/NuGO_R.html

    • Bioconductor repository

      • www.bioconductor.org/packages/release/data/annotation

    How much difference does it make

    How much difference does it make?

    Dai, et al. Evolving gene/transcript definitions significantly alter

    the interpretations of GeneChip data. Nucleic Acids Research,

    2005, 33(20):e175.

    How much difference does it make1

    How much difference does it make?

    • Original tests showed a 30-50% difference between predicted differential expression.

      • Hall JL, Grindle S, et al. Genomic profiling of the human heart before and after mechanical support with a ventricular assist device reveals alterations in vascular signaling networks. Physiological Genomics 2004, 17(3):283-291

    • Comparisons with more recent manufacturer CDFs also show significantly more precise and reproducible results.

      • Rickard Sandberg and Ola Larsson Improved precision and accuracy for microarrays using updated probe set definitions. BMC Bioinformatics (2007) 8:48.

  • Login