Introduction to BioinformaticsMicroarrays2:Microarray Data Normalisation Course 341 Department of Computing Imperial College, London Moustafa Ghanem
Lecture Overview • Background and Motivation • Introduction • Microarray experiments and microarray data analysis • Sources of variability • Experimental design • Normalisation Examples • Probe intensity values • Two colour arrays • Positive controls • Spatial normalisation within array • Between array normalisation • Normalisation Methods • Total intensity normalisation • Scaling and centring • Linear regression • MA plots and Lowess
A Microarray works by exploiting the ability of mRNA molecule to hybridize to its complementary DNA probe The mRNA molecules in a target biological sample are labelled using a fluorescent dye and applied to the array The fluorescent label enables the detection of which probes have hybridised (presence) via the light emitted from the probe. BackgroundMicroarrays • A Microarray is a device detects the presence and abundance of labelled nucleic acids in a biological sample. • The Microarray consists of a solid surface onto which known DNA molecules have been chemically bonded at special locations. • Each array location is typically known as a probe and contains many replicates of the same molecule. • The molecules in each array location are carefully chosen so as to hybridise only with mRNA molecules corresponding to a single gene.
BackgroundMicroarray Data Analysis Biological question Sample Attributes Experimental design Platform Choice 16-bit TIFF Files Microarray experiment (Rspot, Rbkg), (Gspot, Gbkg) Image analysis Normalization StatisticalAnalysis Clustering Data Mining Pattern Discovery Classification Biological verification and interpretation
Motivation • Data generated from Microarray experiments are inherently highly variable. • First, there is the Law of Large Numbers: • Any measurement of thousands of values will find some large differences due to chance (normal distribution) • However, the average gene does not change its expression across experiments • Must have replication (e.g. different patients different experiments) and statistics to show that measured differences are real. • Second, there are also Systematic Sources of Variability: • e.g. Errors is scanning microarray images, differences between properties of Cy3 and Cy5 channels, etc • Must have systematic methods for addressing such errors.
Motivation • Normalisation is a general term for a collection of methods that are directed at reasoning about and resolving the systematic errors and bias introduced by microarray experimental platforms • Normalisation methods stand in contrast with the data analysis methods described in other lectures (e.g. differential gene expression analysis, classification and clustering). • Our overall aim is to be able to quantify measured/calculated variability, differentials and similarity: • Are they biologically significant or just side effects of the experimental platforms and conditions?
Variability between Individuals True gene expression of individual Variability between sample preparations Variability between arrays and hybridisations Variability between replicate features Measured gene expression The measured gene expression in any experiment includes true gene expression,together with contributions from many sources of variability IntroductionSources of Microarray Data Variability • There are several levels of variability in measured gene expression of a feature. • At the highest level, there is biological variability in the population from which the sample derives. • At an experimental level, there is • variability between preparations and labelling of the sample, • variability between hybridisations of the same sample to different arrays, and • variability between the signal on replicate features on the same array.
There are many standard experimental protocols that biologists need to follow when conducting their experiments to minimize variability IntroductionSources of Microarray Data Variability • Population Variation • Whose mRNA are we using? May need to test different samples in parallel. • May need many replicates to study biological variation • Sample Treatment • Experimental conditions • Tissue preparation • Target Preparation • RNA isolation: need to use use identical amounts of tissue, identical extraction methods; use minimum number of steps; measure amount of RNA and normalize concentration • Labelling: need to account for and measure incorporation of label and normalize samples to same concentration • Amount: Need to add same amount of label to each hybridization
Oligos reduce variability of probes compared to PCR products. In-situ synthesis standardises probe production and produces better spot quality and reduces errors in image acquisition IntroductionSources of Microarray Data Variability • Arrays • Same sample may be hybridized to different arrays in different labs • PCR products probes: prepared through amplification directly from cells, must add same amount of product to each spot on filter • Uniformity of spotting: must use arraying tool for filter arrays or robot for microarrays. • Treatment and handling of filters or slides • Hybridization and washing • Time: long hybridization ensures that hybridization goes to completion. • Temperature: most hybridisations performed between 45 and 65 oc • Data acquisition • Image acquisition • Spot and background detection
IntroductionBiological and Technical Variability • Biological Variability: variation between individuals in the population and is independent of the microarray process itself • Population variability can be measured with pilot studies • Technical Variability is dependent on the microarray process itself. • Technical variability is measured in calibration experiments. • In good experiments, technical variation should be much less than biological variation
Experiment Biological Replicates Replicate 1 Replicate 2 Technical Replicates Extract 1 Extract 2 Label Cy3 Label Cy5 Label Cy3 Label Cy3 Label Cy3 IntroductionExperimental Design • Tree representation of replicate experiments: • The first level is at the level of biological replicates • This is followed by two independent mRNA extractions, and reciprocal Cy3 and Cy5 labelling • Finally on each array, each probe is printed in triplicate. • In this example, each data point in the experiment is replicated a total of 24 times. • Furthermore, in each microarray experiment, each gene (each probe or probe set) is really a separate experiment in its own right
Finaldata Gene Expression Matrix Rawdata Intermediatedata Array scans Images Samples Genes Spots Gene expression levels Spot/Image quantiations IntroductionGene Expression Matrices
Typical Problem: Usually more variability at low intensity Normalisation Examples Probe Intensity Value • The raw intensities of signal from each spot on the array are not directly comparable. Depending on the types of experiments done, a number of different approaches to normalization may be needed. Not all types of normalization are appropriate in all experiments. Some experiments may use more than one type of normalization. • Reasonable Assumption:intensities of fluorescent molecules reflect the abundance of the mRNA molecules – generally true but could be problematic • Example: • intensity of gene A spot is 100 units in normal-tissue array • intensity of gene A spot is 50 units in cancer-tissue array • Conclusion: gene A’s expression level in normal issue is significantly higher than in cancer tissue
Normalisation Examples Probe Intensity Value Images showing examples of how background intensity can be calculated • Problem? What if the overall background intensity of the normal-tissue array is 95 units while the background intensity of cancer-tissue array is 10 units? • Solutions: • Subtract background intensity value • Take ratio of spot intensity to background intensity (preferable) • In both cases have to decide where to measure background intensity (e.g. local to spot or globally per chip) • In general, There could be many factors contributing to the background intensity of a microarray chip • To compare microarray data across different chips, data (intensity levels) need to be normalized to the “same” level
SAMPLE CONTROL Normalisation Examples Two Colour Arrays • Reasonable Assumption:For two colour arrays, in a self self hybridization, we expect for each spot Red = Green • Problem: This is not necessarily true due to labelling effects, chemistry (dye properties), scanner properties, etc • Dye Bias in Two-channel microarrays: Intensity in one channel may be higher than the other • Solutions: • Dye swapping experiments: in first replicate label control with red and experiment with green; in second replicate swap colours • Calibration Experiments (Self vs. self Hybridisation):label same extract with both colours and calculate variation
Normalisation Examples Two Colour Arrays • “Error” correction y = ax y = x • possible ways of “correction” – • dividing all x by a; 2. multiplying all y by a; • Can easily be extended when regression line is y = ax+b
How does this approach compare to the affymetrix PM/MM probes? Normalisation ExamplesRatio of Signal to Positive Control • Problem: Is there any cross hybridisation? • Solution: It is often useful to spike the labelling reaction with some foreign RNA or DNA that is not normally in the RNA population. • The signal si for gene i would therefore be raw counts gi divided by the median of the counts for the vector spots.
Normalisation ExamplesRatio of Signal to Positive Control • Normalization of signal for each gene to a ratio makes it possible to compare ratios between experiments, provided that the spiked controls are the same in all experiments. • Normalization to a positive control is typically used in single-label experiments. Comparison of one experiment to another can either be done by plotting signal si directly on a graph, or signals from two experiments can be converted into a ratio, usually by choosing one treatment as a control. • For example, in a time course, a 0 hour time point might be chosen, and signal from all other time points divided by the signal for the 0 hour time point, to give a ratio.
Normalisation ExamplesSpatial variation within array • Problem: Signal varies according to spot location • Particularly corners: Less hybridization solution • Also because of print-tip group of robot • Solutions: • Calculate ratio to mean or total intensity value • Use Locally Weighted Regression (Lowess) • Use Block-Block Lowess
Normalisation ExamplesBetween Array Normalisation • Assumption: the overall intensities across two arrays should be similar • Problem: Not always the case • Solution1: Ensure that data points in the two-intensity coordinate system should be roughly centered around the diagonal Solution2: Use total intensity normalization for large number
Normalisation MethodsBetween Array Normalisation • Mean/Median centering – mean/median intensity of every chip brought to same level • Total intensity normalization – scaling factor determined by summing intensities • Spiked-control, housekeeping normalization (Positive Controls)
Normalisation MethodsCentring andScaling • Data is scaled to ensure that the means and the standard deviations of all of the distributions are equal. For each measurement on the array, subtract the mean measurement of the array and divide by the standard deviation. Following centring, the mean measurements on each array will be zero, and the standard deviation will be 1
Normalisation MethodsNormalized ratios usually expressed as logs • To facilitate easier mathematical handling of the data, as well as comparisons over a wide range of expression levels, ratios are usually expressed as logs. • For example, if a gene is expressed at 4 times the level in the control than in the mutant, log2 (1/4) = -2. A log ratio of 0 is therefore indicative of a gene whose expression is the same in both conditions or treatments. Rg Ratio = Tg = Gg Rg log2 Log Ratio = log2(Tg) = Gg
Normalisation MethodsRegression Normalisation • Regression normalization: • Fit the linear regression model: y = ax + b • Test the significance of the intercept b. Fit a linear regression without b if it is insignificant. • Transform the data • Problem: assumption may not hold due to nonlinear trend
Normalisation MethodsFrom Scatter Plot to MA Plot • Instead of plotting the two intensity values against one another (Scatter plot), it is common to use an MA plot • M: log2(R/G): ratio of two intensities • A: log2SQRT(R*G) = ½ log2(R*G): mean log intensity of two values
Normalisation MethodsLowess Normalization • Locally Weighted Least Square Regression • Assumption: Variation in data is intensity dependent • Smoothes the intensity function • Lowess is typically applied to M-A plots
Summary • Normalisation used to identify if variation is due to experimental conditions. • Typical sources of variation are • Population, Sample, Target, Array (Probe), Hybridisation, Data Acquisition • Different Normalisation Examples • Probe intensity values • Two colour arrays • Positive controls • Spatial normalisation within array • Between array normalisation • Common Normalisation Methods • Mainly scaling factors and regression