Bio277 Lab 1: Implementing Microarray Analysis

Bio277 Lab 1: Implementing Microarray Analysis Jess Mar Department of Biostatistics Quackenbush Lab DFCI jmar@hsph.harvard.edu

Outline Gene Expression and Computing Low-Level Analysis: Normalization Expression Summary Measures Questions

Gene Expression Definition: the process by which the inheritable information which comprises a gene, such as the DNA sequence, is made manifest as a physical and biologically functional gene product, such as protein or RNA. In this course, we are mainly interested in quantitative gene expression – the number of transcript mRNA copies synthesized by a cell. While we will focus a lot on microarrays, please keep in mind this is only one technology that allows us to measure (certain aspects of) gene expression. • Non-array based technologies include: • Quantitative Real-Time RT-PCR • High throughput sequencing, next generation sequencing (AB SOLiD, 454) Gene expression technologies differ in sensitivity, high throughput-ness, cost, etc.

Computational Biology What is the biological question we are trying to answer? What is the underlying hypothesis? In bioinformatics and computational biology research, you will be forever inundated with new algorithms, predictors, techniques and methodologies that seek to grab your attention.

Introducing BioConductor • www.bioconductor.org • BioConductor is the "BioR" of bioinformatics development software projects • [c.f BioPerl, BioJava, BioPython]. • Open source and open development; conceived Fall 2001. • Main Features: • Statistical and graphical methods for the analysis of genomic data. • Majority of software is [currently] geared at microarray data • Biological metadata [PubMed, LocusLink, KEGG, GO, GenBank]. • Documentation and Reproducible Research • The project is experiencing rapid growth: • version 1.6, number of packages: 123 (version 1.1, 20 packages) • 28 core developers.

Installation • First install R. • Easiest method of installation: • Download the skeleton version of BioConductor via a internal call to the website: • > source("http://www.bioconductor.org/biocLite.R") • > biocLite() • Add any other packages you want later: • > biocLite(c("pkg1", "pkg2"))

Zebrafish Swirl Data Set • Experiment: cDNA microarrays compared zebrafish with a point mutant in the BMP2 gene to wild-type zebrafish. • BMP2 gene affects developmental of the zebrafish (specifically dorsal/ventral body axis). • Biological Question: what genes have their expression disrupted when BMP2 is point mutated? • Data: 4 replicate hybridizations (2 dye swaps), 8448 probes, 4x4 grid matrix, 22 x 24 spot matrices. • In BioConductor: • > library(marray) ; data(swirl) • > class(swirl) # returns marrayRaw • Pre-normalization intensity data for the 4 arrays in contained in this R object.

Low-level Analysis: Normalizing Microarray Data

Motivation for Normalization "down-scaling of an experiment makes it generally sensitive to external and internal fluctuations" Schuchhardt et al. (2002). • Sources of variability include: • inconsistencies in the binding efficiencies of the two channel protein dyes • presence of dust particles and other contaminants • sensitivity of the experiment to varied environmental factors • (e.g. humidity and temperature)

Common Ways to Normalize Data • For Affymetrix data: • quantile, rank invariant set, loess. • For cDNA microarray data: • print-tip loess, global loess, median, spatial-based methods. • A normalization method carries with it assumptions about the data. • Numerical algorithms function like "black boxes" when correcting for artifacts in the data. • How to normalize array data is an open issue. • Biology-based Normalization methods: • Housekeeping genes, positive/negative spiked controls, etc.

Intensity-Dependent Normalization normalize values • Assumption: • The majority of genes have unchanged expression between treatment and reference groups; log ratio values are independent of spot intensity. • Examples of Intensity-Dependent Normalization: • Global LoWeSS (locally weighted scatter plot smoother) • Print-Tip Lowess, Scaled Print-Tip Lowess.

For cDNA microarrays

Normalization for cDNA Microarrays • Normalization methods provided by the marray package: • Print Tip Loess – "printTipLoess" • Median – "median" • Loess – "loess" • Two Dimensional Spatial Location using loess – "twoD" • Scale Print Tip MAD – "scalePrintTipMAD" • Apply loess normalization to all slides in the zebrafish swirl data set • > swirl.norm <- maNorm(swirl, norm = "printTipLoess") • Let's look at the difference our normalization method makes: • > box.lab <- paste("Slide", 1:4, sep="") • > boxplot(swirl, main = "Raw Log Ratios", names=box.lab, col = rainbow(4)) • > boxplot(swirl.norm, main = "Print Tip Normalized Log Ratios", names = box.lab, col = rainbow(4))

Some Exploratory Questions What effects do the different normalization methods have on the swirl data? Use summary statistics (e.g. standard deviations, medians) and simple plots (e.g. histograms, quantile-quantile plots, box plots) to explore this question. Construct the M vs A (log ratio vs intensity) plots for the different normalization methods. Choose one slide to focus on in your comparisons. How do the normalization methods perform in removing technical variation? Which normalization method would you recommend to the owners of the swirl data?

Normalization for Affymetrix Chips • Normalization methods are provided in the affy package: • Quantile – "quantiles", Robust version – "quantiles.robust" • Invariant Set Method by Li-Wong – "invariantset" • Loess – "loess" • Spline Based Method – "qspline" • Scaling Factor – "constant" • > data(affybatch.example) • > affybatch.example.norm <- normalize(affybatch.example, method = "loess")

Constructing Summary Expression Measures

Affymetrix Probe Sets Affymetrix chips gather information at the probe-level. But what are probes exactly? And how do we interpret these measures with respect to "gene expression"? Statistical models have been developed to bring probe-level information together to spit out a meaningful summary measure of a gene.

Computing Expression Measures for Affymetrix Data The package affy has a function that allows us to perform background correction, normalization, probe specific background correction and assign a summary value of expression in a single function. Background Correction Methods: MAS, None, RMA, RMA2 Probe Specific Correction: MAS, PM Only, Subtract MM > library(affy) > eset <- expresso(affybatch.example, bgcorrect.method="rma", normalize.method="constant", pmcorrect.method="pmonly", summary.method="avgdiff")

Normalization Methods (for Affymetrix data) > normalize.methods(affybatch.example) [1] "constant" "contrasts" "invariantset" "loess" [5] "qspline" "quantiles" "quantiles.robust" To normalize anything: > ab.ex.ls <- normalize(affybatch.example, method="loess") > ab.ex.is <- normalize(affybatch.example, method="invariantset")

Different Types of Expression Measures There are 4 methods for assigning summary expression values for the genes on an Affymetrix array. • Average Difference Model (Affymetrix default) • Li-Wong Model Based Expression Index (MBEI) – maximum likelihood estimates from a linear model. • MAS5 (default Affymetrix software) • Robust Multi-Array average (RMA) by Irizarry et al.

Making Summary Measures eset <- expresso(affybatch.example, bgcorrect.method="rma", normalize.method="constant", pmcorrect.method="pmonly", summary.method="liwong") Other options for summary.method: avdiff, mas, medianpolish, playerout Alternatively (if you have widgets enabled, like Tcl/Tk on Windows) you can interactively select your options: eset <- expresso(affybatch.example, widget=T) (Note: this might get stuck!)

Accessing Our Pre-Processed Data We have now gone from individual probe-level information to having a single value that measures the expression of a gene for different samples. To access these measures we use the "exprs" command: > exprs(eset) To access phenotypic information (about the experiment): > phenoData(eset) The eset we created is an instance of a "ExpressionSet" > class(eset) > eset > slotNames(eset)

What Next? • Granted all this might be a bit boring, consider this - now we are in a position to ask (and answer!) questions like… • What genes are differentially expressed across our conditions of interest? • Are these genes linked to special pathways or processes of interest? • Do genes from the same process have strikingly different patterns of expression? • Do co-expressed genes share any common promoters (are they also co-regulated as well)? • Can I build a predictor that will take a set of genes and predict what the state of the system will be? i.e. tumor vs non-tumor, wild-type vs mutant. • Etc, etc, etc.

Bio277 Lab 1: Implementing Microarray Analysis

Bio277 Lab 1: Implementing Microarray Analysis

Presentation Transcript

:: Microarray analysis ::

Microarray Data Analysis

Microarray Data Analysis

Microarray data analysis

Microarray analysis challenges.

Microarray Data Analysis

Microarray Data Analysis

Bio277 Lab 3: Finding Transcription Factor Binding Motifs

Microarray Data Analysis

Microarray Analysis Software

Microarray data analysis

Microarray Data Analysis

Bio277 Lab 2: Clustering and Classification of Microarray Data

Microarray analysis

Microarray Analysis

Microarray data analysis

Microarray Analysis Software

Microarray analysis

Microarray Data Analysis

Microarray analysis

Microarray analysis challenges.

Microarray Analysis Market