GeneChip analysis

GeneChip analysis Microarrays Statistics Physics Biology Andrew Harrison, UCL + Essex London Pain Consortium harry@biochem.ucl.ac.uk

Microarrays (mRNA expression) Microarrays are a massively-parallel Northern Blot. Each array contains thousands of sets of different nucleotide sequences. Each set of sequences on the array is complementary to the mRNA nucleotide sequence of a different gene.

Probe cells of an Affymetrix Gene chip contain millions of identical 25-mers 25-mer

Hybridization between biotin-labelled mRNA and the probes on the chip

A laser causes the biotin to fluoresce, which is then detected by a scanner

Affymetrix microarrays 5’ 3’ GGTGGGAATTGGGTCAGAAGGACTGTGGCTAGGCGC GGAATTGGGTCAGAAGGACTGTGGC GGAATTGGGTCACAAGGACTGTGGC perfect match probe cells mismatch probe cells actually scattered on chip

Data for the same gene Perfect Match (PM) Mismatch (MM) Probe pair Affymetrix probe set Each Gene Chip contains tens of thousands of probe sets

Each chip emits over a broad range of intensities (a dynamic range of many hundreds)

Chip calibration Correct Background, Normalise, Correct for Cross Hybridisation, Expression Measure High-level analysis, biological interpretation

Background Fluorescence needs to be removed

Chips need to normalised against each other. Each chip is a different colour e.g. invariant genes, lowess, quantiles

Expression Measure The intensities of the multiple probes within a probeset are combined into ONE measure of expression MAS, RMA, dChip

MAS 5.0 (Signal) takes the Tukey bi-weighted mean of the difference in logs of PM and MM.

1-9 are different chips. dChip and RMA ‘model’ the systematic hybridisation patterns when calibrating an expression measure.

Chip calibration Differentially expressed genes are identified T-tests Fold Changes Z-scores

T-statistics: Each gene is studied independently of the other genes More significant change Less significant change

Once chips have gone through the calibration process, changes in gene expression between conditions or over time can be observed. m=log2(Fold Change), a=log2(Average Intensity) The ratio of expression values between two conditions is known as Fold Change

Variability of fold change is a function of intensity!! MAS 5 shows large variability at low intensities RMA shows small variability at low intensities

Fold change is NOT the same as significance!! At least one of these people have tripled in weight in the last two years Is such a change unusual?

Sliding Z Quackenbush (2002) m - mean(m) standard deviation (m) Z =

How can you validate the results from statistics? Perform calibrations where you know the answer to expect (spike-in experiments) Determine the statistical properties you expect your results to have, and see if the experiments match your assumptions - thought experiments Does the biology make sense?

The density of intensities of significant genes produced by T-statistics have a poor overlap with the intensities of all the other genes Histogram is the population density. Line is the density of significant genes

The intensity histogram of Sliding Z scores matches very well to the population.

Within Bioconductor (within R), the package “affy” allows a choice of calibration protocol: 3 background corrections Nothing, MAS, RMA 5 different normalisations Constant, Invariant Genes, Lowess, Qspline, Quantiles 3 different expression measures dChip (aka Li-Wong), MAS, RMA There are 45 different permutations.

Which factors lead to certain calibration protocols sharing a large consensus of significant genes? 1, 2 & 3 are identical to themselves (consensus is 100%, colour white) 2 and 3 share the most in common (light grey) 1 and 2 share the least in common (dark grey)

The list of significantly changing genes derived from T-statistics is sensitive (dark) to the choice of calibration protocol (45 possibilities)

T-statistics: Each gene is studied independently of the other genes More significant change Less significant change

Significant: Fold change is small but variance is very small Condition A Condition B T-tests are very sensitive to the choice of normalisation Not Significant: Fold change is a little larger but variance is also larger Normalisation need modify only one signal

Penalised t-test

Z-scores Z-scores are less sensitive to the choice of calibration protocol

Clustering the calibration protocol matrix indicates that the major impact on consensus of significant genes for Z-scores is the choice of expression measure cluster Expression measures are Li-Wong (dChip), RMA and MAS

45 different calibration protocols 3 background x 5 normalisations Microarray Analysis Suite Robust Multichip Average Major uncertainty in the calibration is the choice of expression measure Li & Wong

Recap The biggest uncertainty in the calibration of Affymetrix data is how to combine all the multiple probes into one value (mRNA expression per gene) Fold change is biased in intensity. T-tests are sensitive to the choice of calibration Z-scores of fold changes provide a reliable statistical measure for all intensities. ….. but why can’t we use fold changes?

Spike-in measurements of known concentrations The plateau is probably due to cross-hybridisation to the genomic population of mRNA log (intensity) log (transcript concentration) For RMA (which only uses the PM information) there remains considerable signal at very low concentrations.

The non-linearity means that Fold Change (Intensity) is NOT the same as Fold Change (Transcript) It is difficult to establish when a gene is NOT expressed

Cross Hybridisation MAS 5.0 (Affymetrix) corrects for cross-hybridisation by subtracting the MisMatch signal from the Perfect-Match. RMA ignore the mismatches because they hybridise to the Perfect Signal. But the perfect match contains a contribution from cross-hybridisation. There is a need for a model of the physics of hybridisation (Naef and Magnasco 2003)

GC content is important AT bonds have two hydrogen bonds. GC have 3 hydrogen bonds

Van der Waals interactions between adjacent bases H-bond interactions between adjacent bases Nearest-neighbour interactions predict duplex kinetics and so sequence order is important (Santa Lucia) CTG GTC The binding energy of GAC is not the same as CAG

The fraction of overlap between transcript and probe depends upon the position along the probe (SantaLucia) Imagine if all your fragments were of length 20. Imagine dropping the fragments randomly along a line of 25 Fraction 1 5 13 20 25 There will also be Duplex breathing and a torque between the duplex and the unbound fragment

Biotin labelling interferes with the hybridisation C & T (pyrimidines) are labelled. So GC* binds less strongly than CG, and AT* binding is weaker than TA. If the probe contains no C & T, it will hybridise well but with no fluorescence. If you have all C & T, it will have difficulty hybridising.

Size is important e.g. perfect match #13 = A, so mismatch #13 is T, and the complementary base in mRNA is also T/U T Pyrimidines (C & T) are small There will be no steric hindrance between the pyrimidine in the mismatch and the pyrimidine in the mRNA of interest. C G A

Size is important e.g. perfect match #13 = T, so mismatch #13 is A, and the complementary base in mRNA is also A T Purines (G & A) are large C There will be a large steric hindrance between the purine in the mismatch and the purine in the mRNA of interest. G A

Naef and Magnasco (2003) The difference in intensity between the PM and MM is sensitive to the choice of the central base of the probe!

There is a lot of physics to consider. In order to simplify matters, there have been several attempts to generate simple mathematical models which incorporates the key physics. The parameters in the models are then fitted using the data from many chips.

Zhang, Miles and Aldape (2003) Their model is named Position Dependent Nearest Neighbour (PDNN) PDNN has 24 weight factors for Gene Specific Binding, 24 factors for Non-Specific Binding and 16 stacking energy parameters They fit their model with a dataset of ~5,000,000 probe measurements (~40 chips)

Naef and Magnasco (2003) The model contains only position specific affinities for each base (fitted using ~80 chips) A low order function can be fitted to the hybridisation for a given base at a given position. The total hybridisation for the 25 base sequence is then the sum of the local hybridisations.

Wu and Irizarry report spike in yeast controls on a human chip. This measures non-specific hybridisation directly Theory is comparable to experiment Not as clean as Naef Many unchanging genes do not express!

GCRMA (Wu and Irizarry 2004) Lots of close sequences will hybridise to a given probe. Wu and Irizarry model the variation in hybridisation of these similar processes using a statistical model. b) Theory GCRMA determines the contribution to the PM from Signal and from Non-Specific Hybridisation Stickiness

GCRMA produces a linear relationship between intensity and concentration GCRMA

Do the genes identified by statistics make biological sense? BE CAREFUL: YOU KNOW TOO MUCH! Biologically relevant genes High-throughput experiments Anatomy, Developmental Biology, Neuroscience, Medicine, Pharmacology, Physiology

GeneChip analysis

GeneChip analysis

Presentation Transcript

multivariate analysis: factor analysis

GeneChip Hybridization

Strategic Analysis. Environmental Analysis

Introduction to Affymetrix GeneChip data

Nonlinear Analysis: Riks Analysis

Analysis

Problem Analysis/Statement Behaviour Analysis Participant Analysis Communication Channel Analysis

Analysis

Terrain Analysis (Surface Analysis)

Complexity Analysis : Asymptotic Analysis

Analysis

Overview of Affymetrix GeneChip Technology

Summaries of Affymetrix GeneChip probe level data 2002

Complexity Analysis : Asymptotic Analysis

Data Analysis Regression Analysis

AFFYMETRIX GENECHIP ARRAY STATION

Analysis of Affymetrix GeneChip Data

Homework on the Analysis of Affymetrix GeneChip Data

Affymetrix GeneChip Data Analysis