introduction to microarray gene expression n.
Skip this Video
Download Presentation
Introduction to Microarray Gene Expression

Loading in 2 Seconds...

play fullscreen
1 / 54

Introduction to Microarray Gene Expression - PowerPoint PPT Presentation

  • Uploaded on

Introduction to Microarray Gene Expression. Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC. Outline of the four talks. A general overview of microarray data Some important terminology and background Various platforms

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Introduction to Microarray Gene Expression' - heller

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to microarray gene expression

Introduction to Microarray Gene Expression

Shyamal D. PeddadaBiostatistics Branch

National Inst. Environmental

Health Sciences (NIH)Research Triangle Park, NC

outline of the four talks
Outline of the four talks
  • A general overview of microarray data
    • Some important terminology and background
    • Various platforms
    • Sources of variation
    • Normalization of data
  • Analysis of gene expression data - Nominal explanatory variables
    • Two types of explanatory variables
    • Scientific questions of interest
    • A brief discussion on false discovery rate (FDR) analysis
    • Some existing methods of analysis.
outline of the four talks1
Outline of the four talks
  • Analysis of ordered gene expression data
    • Common experimental designs
    • Some existing statistical methods
    • An example
    • Demonstration of ORIOGEN
    • Some open research problems
  • Analysis of data from cell-cycle experiments
    • Some background on cell-cycle experiments
    • Modeling the data
    • Data from multiple experiments
    • Some open research problem
to perform statistical analysis of any given data
To perform statistical analysis of any given data
  • It is important to understand all sources of (i) bias, (ii) variability.
    • Some basic understanding of the underlying technology!
    • Understand the sampling/experimental design
some background terminology dna and rna
Some background terminology:DNA and RNA
  • DNA (Deoxyribonucleic acid) - Contains genetic code or instructions for the development and function living organisms. It is double stranded.
  • Four Nucleotides (building blocks of DNA)
    • Adenine (A), Guanine (G),
    • Thymine (T), Cytosine (C)
  • Base pairs: (A, T) (G, C)

E.g. 5’ ---AAATGCAT---3’

3’ ---TTTACGTA---5’

some background terminology dna and rna1
Some background terminology:DNA and RNA
  • RNA (Ribonucleic acid) - transcribed (or copied) from DNA. It is single stranded. (Complimentary copy of one of the strands of DNA)
  • RNA polymerase - An enzyme that helps in the transcription of DNA to form RNA.
  • Four Nucleotides (building blocks of DNA)
    • Adenine (A), Guanine (G),
    • Uracil (U), Cytosine (C)
  • Base pairs: (A, U) (G, C)
some background terminology types of rna
Some background terminology:Types of RNA
  • Types of RNA - (transfer) tRNA,

(ribosomal)rRNA, etc.

  • mRNA - messenger RNA. Carries information from DNA to ribosomes where protein synthesis takes place (less stable than DNA).
some background terminology oligos
Some background terminology: Oligos
  • Oligonucleotide - a short segment of DNA consisting of a few base pairs. In short it is commonly called “Oligo”.
  • “mer” - unit of measurement for an Oligo. It is the number of base pairs. So 30 base pair Oligo would be 30-mer long.
some background terminology probes
Some background terminology: Probes
  • cDNA - complimentary DNA. DNA sequence that is complimentary to the given mRNA.
    • Obtained using an enzyme called reverse transcriptase.
  • Probes - a short segment of DNA (about 100-mer or longer) used to detect DNA or RNA that compliments the sequence present in the probe.
some background terminology blots origins of microarrays
Some background terminology:“Blots” - Origins of Microarrays
  • Southern blot (Edwin Southern, 1975 J. Molec. Biol.)
    • A method used to identify the presence of a DNA sequence in a sample of DNA.
  • Western blot (immunoblot)
    • to identify a specific protein from a tissue extract.
some background terminology
Some background terminology
  • Southwestern blot
    • to identify and characterize DNA-binding proteins.
  • Northern blot
    • A method used to study the gene expression from a sample of mRNA.
what is a microarray
What is a Microarray?
  • Sequences from thousands of different genes are immobilized, or attached, at fixed locations.
  • Spotted, or actually synthesized directly onto the support.
microarray technology
Microarray Technology
  • Two color dye array (Spotted array)
    • Spotted cDNA microarrays
    • Spotted oligo microarrays
  • Single dye array
    • In situ oligo microarrays
spotted dna microarray
Spotted DNA Microarray
  • Spotted DNA array is typically “home made” so you need to think about:
    • cDNA or Oligo
    • Location of the Oligo in a given gene
    • Oligo length - number of bp?
spotted dna microarray1
Spotted DNA Microarray
  • Gene expression:
    • Y < 0; gene is over expressed in green labeled sample compared to red-labeled sample
    • Y = 0; gene is equally expressed in both samples
    • Y > 0; gene is over expressed in red-labeled sample compared to green labeled sample
major commercial platforms
Major Commercial Platforms
  • More than 50 companies are currently offering various DNA microarray platforms, reagents and software
  • Affymetrix dominated the marker for many years

*Agilent has one and two-color microarray platform

affymetrix genechip
Affymetrix GeneChip
  • Each gene is represented by 11 to 20 oligos of 25-mers
  • Probe: An oligo of 25-mer
  • Probe Pair: a PM and MM pair
  • Perfect match (PM): A 25-mer complementary to a reference sequence of interest (part of the gene)
  • Mismatch (MM): same as PM with a single base change for the middle (13th) base (G <-> C, A <-> T)
  • Probe set: a collection of probe-pairs (11 to 20) related to a fraction of gene
affymetrix call for the presence of a signal
Affymetrix call for the presence of a signal
  • Affymetrix detection algorithm uses probe pair intensities to obtain detection p-value

Using this p-value they decide whether the signal


    • “ present”, “marginal” or “absent”
affy call
Affy call
  • Detection of p-value
    • Calculate Kendall’s tau T for each probe pair
      • T = (PM-MM) / (PM+MM)
    • Determine the statistical significance of the gene by computing the p-value.
affy call1
Affy call

Ref: Affymetrix Technical Manual

affymetrix vs illumina
Affymetrix Vs Illumina

Ref: Pan Du & Simon Lin

why normalize data
Why Normalize Data?
  • To “calibrate”/adjust data so as to reduce or eliminate the effects arising from variation in technology and other sources rather than due to true biological differences between test groups.
sources of bias variation
Sources of bias/variation
  • Tissue or cell lines
  • mRNA
    • It can degrade over time - so there is a potential batch effect if portions of experiment are performed at different times
    • Purity and quantity
  • Dye color effect (spotted arrays)
  • Variation due to technology - is substantially reduced with improved technology
  • Etc.
a useful graphical representation of data1
A useful graphical representation of data
  • Let its spectral decomposition be given by


common normalization methods
Common Normalization Methods
  • Internal Control Normalization
  • Global Normalization
  • Linear Normalization (Spotted arrays)
  • Non-linear Normalization Method (Spotted arrays) - LOWESS curve.
  • COMBAT (for batch effect)
internal control normalization housekeeping gene s
Internal control normalization(Housekeeping gene(s))
  • Expression of each gene is measured relative to the average of house keeping genes.
    • Basic assumption: Expression of housekeeping genes does not change.
      • Disadvantage:
        • House keeping genes may be highly expressed sometimes. Unexpected regulation of house keeping gene(s) leads to misinterpretation
global normalization
Global Normalization
  • Basic assumption
    • Mean/Median expression ratio of all monitored mRNAs is constant across a chip.

Regression of

In simple terms the log ratios are corrected by a common “mean” or “median”

This method can also be applied to single Dye data

linear normalization for spotted arrays
Linear Normalization(for spotted arrays)
  • Basic assumption
    • Mean/Median expression ratio of all monitored mRNAs depends upon the average intensity

Regression of

non linear normalization for spotted arrays
Non-Linear Normalization(for spotted arrays)
  • Basic assumption
    • Mean/Median expression ratio of all monitored mRNAs depends upon the average intensity

Regression of

Where is estimated by the robust scatter plot

smoother LOWESS (Locally WEighted Scatterplot Smoothing)

analysis of variance anova
Analysis of Variance (ANOVA)
  • Standard Analysis of Variance model
    • Response variable - Gene expression
    • Explanatory variables:
    • Dye color
    • Batch
    • Other potential effects?
  • Advantage: Statistically significant

genes can be identified while controlling for the

various experimental conditions/factors.

some important experimental designs
Some important experimental designs
  • Pooled Samples versus Separate samples
    • Sometimes there may not be sufficient biological sample/specimen from a given animal. In such cases biological samples are pooled from several identical animals to form a sample.
an example of a pooling design for each treatment group
An example of a pooling design(for each treatment group)

Subjects Pool Observations

(Microarray chips)

the pooling design
The pooling design

Subjects Pool Observations

(Microarray chips)

9 3 6

(3 per pool)

More generally:

n p m

(r=n/p per pool)

the standard design
The standard design

Subjects # Pool Observations

(Microarray chips)

9 9 9


More generally:

n p=n m=n


some issues
Some issues
  • What are the underlying parameters?
  • Effect of pooling on power.
  • The basic assumption. Validity of the assumption.
  • Total variation in the expression of a gene can be decomposed in to:
    • Biological variation
    • Technical variation
  • Biological samples (n)
  • Number of pools (p)
  • Biological samples per pool (r=n/p)
  • Observed number of samples (e.g. microarrays) (m)
some comments about pooling
Some comments about pooling

Variance of the estimated mean expression of a gene depends on:

  • number of pools (p)
  • number of bio samples per pool (r)
  • number of arrays (m)
  • biological variation
  • Technical variation.

Pooling works well when the biological variation in the gene

expression is substantially larger than the technical variation.

power comparisons
Power comparisons

# Bio #Micro Pool size Power

5/group 5/group 1 (Standard design) 0.81

6/group 6/group 1 (Standard design) 0.95

6/group 3/group 2 (i.e 3 pools/group) 0.30

8/group 4/group 2 (i.e. 4 pools/group) 0.80

10/group 5/group 2 (i.e. 5 pools/group) 0.98

  • Zhang and Gant (2005)
power comparisons1
Power comparisons

Conditions of the simulation study:

Biological variation is 4 times the technical variation.

False positive rate is 0.001.

Detect 2-fold expression.

Data are normally distributed.

a fundamental assumption
A fundamental assumption

Biological averaging:

Suppose an experiment consists of pooling “r” samples. Then

the expression of a gene in the pooled sample is assumed to

be the average of the gene’s expression in the “r” samples.

This assumption need not be true especially if the expression

values are transformed non-linearly.

some important experimental designs1
Some important experimental designs
  • Reference designs (Spotted array)
    • Each treatment sample is hybridized against a common reference control.
  • Loop designs (Spotted array)
    • Suppose we have a control and three experimental groups A, B and C. Then hybridize Control and A, A with B, B with C and C with A.
data analysis preliminaries
Data Analysis - Preliminaries
  • Normalization
  • Transformation of data (usual methods)
    • Perhaps first fit ANOVA and plot the residuals
      • Log transformation
      • Square root
      • More generally, Box-Cox family of transformations
  • Identify potential outliers in the data (again, perhaps use the residuals)
data analysis
Data Analysis
  • Method of Analysis depends upon the scientific question of interest.
  • In the next three lectures we describe several general methods and illustrate some using real data!