Summaries of Affymetrix GeneChip probe level data . By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao. 5´. 3´. Gene Sequence. Multiple oligo probes. Perfect Match. Mismatch. Microarrays: Many Probes for One Gene.
By Rafael A. Irizarry
PH 296 Project, Fall 2003
Group:Kelly Moore, Amanda Shieh, Xin Zhao
Multiple oligo probes
MismatchMicroarrays: Many Probes for One Gene
Each gene or portion of a gene is represented by 16 to 20 oligonucleotides of 25 base-pairs, i.e., 25-mers. A mRNA molecule of interest (usually related to a gene) is represented by a probe set composed of 11-20 probe pairs of these oligonucleotides.
• Probe: a 25-mer.
• Perfect match (PM): A 25-mer complementary to a reference sequence of interest (e.g., part of a gene).
• Mismatch (MM): same as PM but with a single homomeric base change for the middle (13th) base (transversion purine <->pyrimidine, G <->C, A <->T) .
• Probe-pair: a (PM,MM) pair.
• Probe-pair set: a collection of probe-pairs (16 to 20) related to a common gene or fraction of a gene.
• AffyID: an identifier for a probe-pair set.
• The purpose of the MM probe design is to measure non-specific binding and background noise.
After scanning the arrays hybridized to labeled RNA samples, intensity values PMij and
Mmij are recorded for arrays i = 1,…., I and probe pairs j=1,…, J, for any given probe set.
Array1 Array2 Array3 Array4 Array5 …
Gene1 0.46 0.30 0.80 1.51 0.90 ...
Gene2 -0.10 0.49 0.24 0.06 0.46 ...
Gene3 0.15 0.74 0.04 0.10 0.20 ...
Gene4 -0.45 -1.03 -0.79 -0.56 -0.32 ...
Gene5 -0.06 1.06 1.35 1.09 -1.09 ...
….. … … …
This assumption does not hold for GeneChip probe level data since probes with larger mean intensities have larger variances
PM-MM values for probe pairs
(i) the precision of the measures of expression, as estimated by standard deviations across replicate chips;
(ii) the consistency of fold change estimates based on widely differing concentrations of target mRNA hybridized to the chip;
(iii) the specificity and sensitivity of the measures’ ability to detect differential expression, presented in terms of receiver operating characteristic (ROC) curves.
Two sources of cRNA, human liver tissue and a central nervous system cell line (CNS), were hybridized to human arrays (HG-U95A) in a range of dilutions and proportions.
Data from six groups of arrays that had hybridized liver and CNS cRNA at concentrations of 1.25, 2.5, 5.0, 7.5, 10.0 and 20.0 µg were studied.
Five replicate arrays were available for each generated cRNA (n=60 total).
Different cRNA fragments were added to the hybridization mixture of the arrays at different pM concentrations.
The cRNAs were spike-in at a different concentration on each array arranged in a cyclic Latin square design with each concentration appearing once in each row and column.
Two different data sets from:
(ii) GeneLogicStudy Design
This data set consists of 3 technical replicates of 14 separate hybridizations of 42 spiked transcripts in a complex human background at concentrations ranging from 0.125pM to 512pM. Thirty of the spikes are isolated from a human cell line, four spikes are bacterial controls, and eight spikes are artificially engineered sequences believed to be unique in the human genome.
The above plot showed that RMA had a smaller SD at all levels of expression.
Log (base 2) fold change estimates of gene expression between liver and CNS samples computed from arrays hybridized to 1.25 μg of cRNA were plotted against the same estimates obtained from arrays hybridized to 20 μg for all three measures.
RMA provides more consistent estimates of fold change.
plot for expression Xg and Yg from two arrays being compared for all genes, g=1,…, G.
• M vs. A plots are useful in the way that log fold change (the quantity of most interest) is represented on the y-axis and average absolute log expression (another quantity of interest) on the x-axis.
•The plots on next slides are produced by selecting one array from one of the Affymetrix spike-in experiments to use as a reference and then computing Mg and Ag for the comparisons of that array with all other arrays in the experiment using MAS5.0, dChip, and RMA.
In these plots, the colored numbers represent the log2 fold change in concentrations of spiked-in genes. The red points represent non-spiked-in genes with a fold change larger than 2. Using RMA, the plot has fewer red points, showing smaller variance, especially for genes with lower absolute expression. This resulted in better detection capability of genes spiked-in at different concentrations.
Exposed:benzene-exposed shoe workers ,6 samples
Controls: clothing factories ‘ workers, 6 samples
Matched on: gender ,age and smoking
Samples: 6 pairs’ matched people ; gene: lymphocyte RNA
The plots are produced by unexposed(x1)/exposed(x2) arrays in both chip A and B computing Mg and Ag. Using RMA, shows smaller variance as compared to MAS 5.0 which also supports results discussed in the paper.