1 / 37

Introduction to R Bioconductor

Outline. Advantages

loe
Download Presentation

Introduction to R Bioconductor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Introduction to R + Bioconductor J. Verducci MBI Summer Workshop August, 2005

    2. Outline Advantages & Disadvantages of R+BioC Getting Started: Setup and Basic Objects Preprocessing: from probe-level to expressions Differential Gene Expression in +8 AML Interpretation via GO

    3. Advantages of R + Bioconductor Most Extensive and Flexible Environment Oligonucleotide & cDNA Arrays Multiple methods for Background correction Probe-specific correction Normalization over multiple chips State of the Art Routines Open Source Best Platform for Development FREE!

    4. Disadvantages LARGE startup/learning cost in time No warrantee Informal Support Needs LOTS of memory Code typically not optimized

    5. Getting Started Background Basics of Affymetrics Chip Design Need to correct for Background Noise Need for Normalization NCI Expression Data on +8 subtype of AML The R computing environment Installation for Windows Bioconductor Installation Biobase Package Affy CEL and CDF files Affy Package ReadAffy Expresso

    6. Affymetrics Chip System Probes Oligionucleotides 25 bp in length (~1030 possible) Attach to different segments of each RNA target Probe Pair Perfect Match (PM) probe: exact match to target Mismatch (MM) probe: middle base pair mismatch Probe Pair Sets 16 probe pairs for a specific RNA Ordered from 5 to 3 end of the RNA Files CEL: probe level PMs and MMs related to a common affyID (one CEL file for chip) CDF: relates probe pair sets to location on the array (same map for all chips used in an experiment)

    7. Calibration Issues Global Background Correction Spatial flaws in chip (dust, scratches, etc.) Variations in binding due to sample preparation due to variations in number of G-C bonds due to positioning of probe along length of target Probe Specific Background Correction A MM probe may be similar to other PM probes, some corresponding to highly expressed genes, making MM > PM Sample may have contained DNA fragments Normalization to compensate for chip to chip variations

    8. Krahe AML Data Acute myeloid leukemia (AML) is a heterogeneous group of diseases. Normal cytogenetics (CN) constitutes the single largest group trisomy 8 (+8) as a sole abnormality is the most frequent trisomy Control Group CD34(+) cells purified from normal bone marrow (BM) were also analyzed as a representative heterogeneous population of stem and progenitor cells. Results Expression patterns of AML patients were clearly distinct from those of CD34(+) cells of normal individuals. AML+8 blasts overexpress genes on chromosome 8, estimated at 32% on average, suggesting gene-dosage effects underlying AML+8. Systematic analysis by cellular function indicated up-regulation of genes involved in cell adhesion in both groups of AML compared with CD34(+) blasts from normal individuals apoptosis-regulating genes were significantly down-regulated in AML+8 compared with AML-CN Clinical and cytogenetic heterogeneity of AML is due to fundamental biological differences.

    9. The R Computing Environment Download from http://cran.r-project.org/bin/windows/base/ CRAN Comprehensive R Archive Network Site for introduction, help, and news Object oriented programming S developed by John Chambers, AT&T Commercial package Splus (7.0) from Insightful Freeware R (2.1.0) supported by CRAN

    10. Bioconductor Installation From your R session, type: source("http://www.bioconductor.org/getBioC.R") This will download the getBioC functionality into your R session. To install Bioconductor packages, use the function "getBioC" as follows. To download and install the default packages, type: getBioC(). Note: despite documentation, tkWidgets [not really needed] must be loaded separately Win32 package download Remove the folder tkWidgets from tkWidgets_1.5.20 to unzip. From the Packages menu, pull down Load package From the new window select affy This automatically loads the required additional packages Biobase Tools reposTools

    11. Biobase Toy data geneData, data frame of real, anonymized, expression data 500 rows (genes) 26 cases (chips) eset, same information saved as an expression set To make these accessible, at the command sign > , type data(geneData) data(eset)

    12. Expression Set Fundamental Object used for most analyses Contains more stored information than a simple Data frame Information is stored in slots. To access these type expr.set@slot name (e.g eset@exprs) slotNames(eset) "exprs" 500 x 26 matrix of expressions "se.exprs" estimated std. errors "phenoData" info on covariates assoc. w cases "description" usually contains MIAME information "annotation" links affy names of genes to other systems "notes"

    13. Reading in CEL files Create a directory Move all relevant CEL files to that directory This will unzip files that have been downloaded Select Change Dir from the File menu to point to the CEL file directory Load the affy package [if not already done; you can check using >search()] Read in the data using nciAML <- ReadAffy() The symbol <- is the assignment operator in R. nciAML is an AffyBatch object This may require a lot of memory (at least 1G for Krahe AML data) Quick alternative, requiring less space, is to go directly to an expression file: >AMLeset <- justRMA()

    14. AffyBatch Example: nciAML slots nciAML@cdfName Hu6800 nciAML@nrow -- 536 common chip dimension nciAML@ncol -- 536 common chip dimension nciAML@exprs <287,296 x 31> matrix Probe values for 31 chips nciAML@se.exprs -- <0 x 0 matrix> Not yet processed nciAML@phenoData phenoData object with 1 covariate, which gives type of cell for each chip Chips 1-9: AML-CN Chips 10-23: AML+8 Chips 24-31: CD34 bone marrow stem cells from normals

    15. Preprocessing Background correction Adjust for spatial biases on each chip Normalization Adjust for Chip to Chip technical effects Probe specific background correction Adjust PM for MM Summarize adjusted probe set values into one expression measure (and s.e.)

    16. Expresso Combines 4 preprocessing steps: nciAMLeset <- expresso(nciAML, bgcorrect.method=rma, normalize.method=quantiles, pmcorrect.method=pmonly, summary.method=medianpolish) Alternatively, preprocessing may be done one step at a time, examining effects of each step (see Exploration of Probe Level Data)

    17. Method Options > bgcorrect.methods mas none rma rma2 > normalize.methods(nciAML) constant contrasts invariantset loess qspline quantiles quantiles.robust

    18. Method Options (continued) > pmcorrect.methods mas pmonly subtractmm > express.summary.stat.methods avgdiff liwong mas medianpolish playerout

    19. Generic Versions of Commercial Alternatives MAS 5.0 (generic version of Affymetrics) >AMLeset5 <- mas5(nciAML) >AMLcalls <- mas5calls(nciAML) dchip (Li and Wongs MBEI) >AMLdchip <- expresso(nciAML, normalize.method=invariantset, bg.correct=FALSE, pmcorrect.method=pmonly, summary.method=liwong)

    20. Exploration of Probe Level Data Its good practice to look at the data in a variety of ways. Can check on outliers, patterns of bias, need for log transform, etc. Possible Actions Correct obvious problems directly Use standard correction methods appropriate to the data

    21. Chip Images image(nciAML)

    22. Odd AML+8 Chip

    23. ProbeSet class of objects Organizes information from an AffyBatch object into pm and mm matrices, one for each gene. >gn <- geneNames(affybatch.example) >ps <- probeset(affybatch.example,gn) >show(ps[[1]]) ProbeSet object: id=A28102_at pm= 16 probes x 3 chips

    24. Toy Data (AffyData) barplot.ProbeSet(ps[[1]])

    25. Another gene

    26. Compare Pairs of Chips pairs.AffyBatch(affybatch.example)

    27. The M vs A plot CAGE Xylella Project (Apr/2004) The so-called M-A-plot is a graphical way to see ratios and fluorescence intensity at the same time. It was proposed by: Dudoit et al. Statistica Sinica (2002) 12:111. The merits are: it shows non-linear unwanted dependence between ratios and fluorescence intensities and it shows that using only ratios is a naive way to identify differentially expressed genes. As defined for two color cDNA arrays in that work: A = 1/2*(log2(Cy5) + log2(Cy3)) M = log2(Cy5 / Cy3) For Affy arrays, Cy3 is the reference value, which is the median value of a pm over all chips chosen from a homogeneous group Cy5 is the pm value on the chip of interest Using just the ratios or log ratios to visualize the data does not enable us to see the systematic dependence of the ratio on intensity values. The final result from micro-array should depend only on biological material hybridized and not on the strength of the hybridization signal. The M-A plot, such as the one in the figure on the next slide, shows if it is necessary to use a non-linear fitting for normalization.

    28. Lowess Normalization

    29. Chip Bias May Depend on Level

    30. Or Not

    31. Determining Differential Expression Usually just 2 groups, but sometimes more Assume Expression Set contains all relevant information Most direct approach: Use (some sort of) t-test for each gene Correct for multiple comparisons (e.g. by restricting the FDR) Verify statistically significant genes by rtPCR

    32. Danger of Using Fold-Difference Using just the ratios would not enable us to see that (corrected) ratios of low-intensity spots are not reliable. R = 100/10 = 10-fold difference has not the same experimental value of R' = 1000/100 = 10-fold too, for example. This is obvious from M-A-plot of "self-self" hybridizations (same biological material labeled with both Cy3 and Cy5), as shown in the figure below:

    33. t-test The t-test is the most basic and well-known hypothesis test of Statistics. For each gene, compares the difference: (mean expression in group A) (mean expression in group B) Relative to levels of within-group variability Some methods try to incorporate std err of the expression values Some analysts use Bayesian methods to expand within-group variability, especially for low-level expression values Typically this expression in the expression data set is measured on a log scale, so differences in logs correspond to log ratio

    34. Multiple Comparisons via Limma http://bioinf.wehi.edu.au/limma/usersguide.pdf Load package limma from the Package menu. >aml.type <- c(rep(1,9),rep(0,14)) >design<-model.matrix(~aml.type) >colnames(design) <- c("CN","+8") >fit <- lmFit(AMLeset[,1:23],design) >results <- decideTests(fit, adjust="fdr", p=0.05) To select all up genes: >fit.up <- fit[results[,2] ==1, ] To select all down genes: >fit.down <- fit[results[,2]== -1, ] To output to a file: >write.fit(fit.up[,2], file="myresults.txt") or simply >write(fit.up[,2], file="myresults.txt")

    35. Heatmap of Selected Genes

    36. Heatmap of Genes from eset

    37. More Advanced Methods Incorporating Gene Correlations Elastic Net Zou & Hastie, 2005 http://cran.r-project.org/src/contrib/PACKAGES.html Gradient Directed Regularization Friedman & Popescu, 2004 http://www-stat.stanford.edu/~jhf/PathSeeker.html Shrunken Centroid Ordering by Othogonal Projections (SCOOP) Verducci, 200? E-mail me (verducci.1@osu.edu) for alpha version

    38. Connecting to Gene Ontology OntologyTraverser: an R package for GO analysis A. Young , N. Whitehouse , J. Cho and C. Shaw * Baylor College of Medicine, Department of Molecular and Human Genetics Houston, TX, USA Summary: Gene Ontology (GO) annotations have become a major tool for analysis of genome-scale experiments. We have created OntologyTraverseran R package for GO analysis of gene lists. Our system is a major advance over previous work because the system can be installed as an R package, the system uses Java to instantiate the GO structure and the SJava system to integrate R and Java the system is also deployed as a publicly available web tool. Availability: Our software is academically available through http://franklin.imgen.bcm.tmc.edu/OntologyTraverser/. Both the R package and the web tool are accessible. Contact: cashaw@bcm.tmc.edu

More Related