1 / 56

An introduction to quantitative biology and R

An introduction to quantitative biology and R. David Quigley dquigley@cc.ucsf.edu Helen Diller Comprehensive Cancer Center, UCSF Institute for Cancer Research, University of Oslo. Molecular biology 20 years ago. qualitative methods small-scale quantitative tests.

geneva
Download Presentation

An introduction to quantitative biology and R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An introduction to quantitative biology and R David Quigley dquigley@cc.ucsf.edu Helen Diller Comprehensive Cancer Center, UCSF Institute for Cancer Research, University of Oslo

  2. Molecular biology 20 years ago qualitative methods small-scale quantitative tests Suzuki Med Mol Morph 2010 Oh PNAS 1996 Mao Genes Dev 2004 David Quigley dquigley@cc.ucsf.edu

  3. Molecular biology now qualitative methods small-scale quantitative tests large-scale quantitative analysis microarrays, *seq methods, cell phenotype screens... Nik-Zainal Cell 2012 Fullwood Nature 2009 CGAN Nature 2012 David Quigley dquigley@cc.ucsf.edu

  4. Many studies use both approaches Hypothesis-generating: Normalize breast tumor expression data (three cohorts) Call genotypes from SNP arrays Calculate association between genotype and expression genome-wide (eQTL) Identify an interesting candidate Valid ate eQTL in independent cohort Hypothesis-driven: Methylation analysis at MRPS30 promoter In vitro (two cell lines): ChIP-PCR +/- estrogen qPCR sequencing David Quigley dquigley@cc.ucsf.edu

  5. Common quantitative techniques Gene Expression transcription splicing methylation protein binding (ChIP-seq) Genomics and Genetics de novo assembly DNA copy number (CNV and tumors) germline variant analysis tumor variant analysis SNPs, indels, translocation analysis of tumor clonality

  6. Challenges requires statistical sophistication in study design in interpretation many data points 1,000 to 1,000,000 measurements per sample many false positives which look like great stories software becomes part of the experiment divide between engineering, biology culture & thinking David Quigley dquigley@cc.ucsf.edu

  7. Wet lab and quantitative skills: better job prospects Schillebeeckx Nature Biotech. 2013 David Quigley dquigley@cc.ucsf.edu

  8. What approachs are used to analyze quantitative data?

  9. Chosing a tool Cost Learning curve Ease of use Flexibility (closed to open-ended) Software ecosystem (none to extensive) David Quigley dquigley@cc.ucsf.edu

  10. Traditional programming languages Python, C++, Java, others can solve any computable problem creates the fastest tools free requires programming expertise complex to write and test  high effort David Quigley dquigley@cc.ucsf.edu

  11. Specialized single-purpose programs command line tools academic research type commands at a prompt or run scripts PLINK, bowtie, GATK, bedtools GUI (point and click) commercial software for a vendor’s platform slick, opaque, hard/impossible to automate David Quigley dquigley@cc.ucsf.edu

  12. Commercial statistics programs STATA, SPSS, GraphPad, others 1) Load one dataset 2) Select analysis by clicking on a GUI 3) Generate a report May have a built-in language Very mature tools for traditional biostatistics Not free David Quigley dquigley@cc.ucsf.edu

  13. Web-based tools Galaxy string together pre-defined analysis steps very easy to use reproducible David Quigley dquigley@cc.ucsf.edu

  14. R: a “software environment” Using R is like writing and using software Traditionally, biologists did not do this. R was written by statisticians to be a free replica of another language called “S”. David Quigley dquigley@cc.ucsf.edu

  15. Why is R popular? Flexible, open-ended, open-source Large library of packages package: easy-to use published methods like a Qiagen kit Free! David Quigley dquigley@cc.ucsf.edu

  16. You use R by typing at the prompt There is no pull-down menu of statistical commands David Quigley dquigley@cc.ucsf.edu

  17. What’s good about this approach? chain analyses work with multiple datasets use packages of code easy to reproduce runs on anything makes sense to computer programmers David Quigley dquigley@cc.ucsf.edu

  18. What’s hard about this approach? hard to get started cryptic commands built-in help is amusingly unhelpful David Quigley dquigley@cc.ucsf.edu

  19. packages: collections of R functions collection of R code that solves a specific task limma: microarray normalization and analysis samr: differential expression impute: dealing with missing data downloaded for free from a central repository David Quigley dquigley@cc.ucsf.edu

  20. bioconductor Curated collection of R packages Microarrays, aCGH, sequence analysis, advanced statistics, graphics, lots more bioconductor.org David Quigley dquigley@cc.ucsf.edu

  21. Learning R data types by comparing them to Excel spreadsheets

  22. Comparing Excel and R Excel Easy tasks are easy non-trivial tasks impossible or expensive No paper trail Mangles gene names Plots look terrible David Quigley dquigley@cc.ucsf.edu

  23. Comparing Excel and R Excel Easy tasks are easy non-trivial tasks impossible or expensive No paper trail Mangles gene names Plots look terrible R Easy jobs are hard at first Non-trivial things are possible Easy to make a paper trail Biostatistics researchers publish tools in R Can create publication-ready plots David Quigley dquigley@cc.ucsf.edu

  24. Organizing data in Excel Each subject has a row. Each column has a feature of your subjects. David Quigley dquigley@cc.ucsf.edu

  25. R calls the data points variables variables numbers and characters (letters, words) numbers: 2.6, 4 characters: “Flopsy”, “white, brown paws” David Quigley dquigley@cc.ucsf.edu

  26. R calls the columns vectors vectors ordered collections of a variable name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”] age: [2.5, 2.6, 2.5, 4] David Quigley dquigley@cc.ucsf.edu

  27. R calls the data set a data frame data frame a list of vectors (columns) that have names elements can be read and written by row & column David Quigley dquigley@cc.ucsf.edu

  28. I can slice and dice the data frame David Quigley dquigley@cc.ucsf.edu

  29. Tell R to do things using functions function_name( details about how to do it ) generate sequence from 1 to 5 counting by 0.5 parameters for seq are named from, to, and by David Quigley dquigley@cc.ucsf.edu

  30. Tell R to do things using functions function_name( details about how to do it ) report the mean of my.data. Result of one function is fed into another one. David Quigley dquigley@cc.ucsf.edu

  31. Tell R to do things using functions function_name( details about how to do it ) define a new function that adds 2 to whatever it’s passed compare to original value of my.data David Quigley dquigley@cc.ucsf.edu

  32. Walk-through a straightforward analysis

  33. Primary data from METABRIC study gene expression TP53sequence 1,400 samples from 5 hospitals Is there an association between breast cancer subtype and TP53 mutation? David Quigley dquigley@cc.ucsf.edu

  34. Tasks Normalize data batch effects unwanted inter-sample variation Identify outliers associations between p53 and subtype David Quigley dquigley@cc.ucsf.edu

  35. Quantile Normalization (limma) Force every array to have the same distribution of expression intensities > library(limma) > raw = read.table('raw_extract.txt’, ...) > raw.normalized = normalize.quantiles( raw ) > normalized = log2( raw.normalized ) David Quigley dquigley@cc.ucsf.edu

  36. Identify batch effects in microarrays Principle Components Analysis Identify strongest variation in a matrix gene 1 gene 2 David Quigley dquigley@cc.ucsf.edu

  37. Identify batch effects in microarrays Principle Components Analysis Identify axes of maximal variation in a matrix gene 1 first principle component gene 2 David Quigley dquigley@cc.ucsf.edu

  38. Identify batch effects in microarrays Principle Components Analysis Identify strongest variation in a matrix group A group B gene 1 gene 1 gene 2 gene 2 David Quigley dquigley@cc.ucsf.edu

  39. PCA of identifies a batch effect second principle component hospital 3 (yellow) first principle component > my.pca = prcmp( t( expression.data ) ) > plot( my.pca, ... ) David Quigley dquigley@cc.ucsf.edu

  40. batch correction reduces bias (ComBat) ComBat package reduces user-defined batch effects second principle component first principle component David Quigley dquigley@cc.ucsf.edu

  41. Molecular subtypes of breast carcinoma, defined by gene expression ER status Luminal A N=507 Luminal B N=379 Her2 N=161 Basal N=234 > sa = read.table(‘patients.txt’, ...) > tumor.counts = table( sa$ER.status, sa$PAM50Subtype) (convert counts to percentages) > barplot( c( tumor.counts[1], tumor.counts[2] ), col=c(“red”,”green”), ... ) David Quigley dquigley@cc.ucsf.edu

  42. Find interactions: TP53 and subtype Fit a linear model: > fitted.model = lm( dependent ~ independent ) Perform Analysis of Variance: > anova( fitted.model ) general form of my analysis: > anova( lm( gene.expression ~ PAM * TP53 ) 18,000 genes PAM: {LumA, LumB, Her2, Basal} TP53: {mutant, WT} David Quigley dquigley@cc.ucsf.edu

  43. Automate with loops Calculate anova for 18,000 genes by looping through each gene and storing result. > n_genes = 18000 > result = rep( 0, n_genes ) > for( counter in 1:n_genes ){ result[counter] = anova(...) } sort results identify significant interaction repeat 18,000 times David Quigley dquigley@cc.ucsf.edu

  44. Immune infiltration in TP53-WT Basal CD3E Does p53 have a role in immune surveillance? log2 expression log2 expression absent mild severe David Quigley dquigley@cc.ucsf.edu infiltration

  45. Next steps: getting help and learning more

  46. online forums: expert help for free biostars.org all of bioinformatics David Quigley dquigley@cc.ucsf.edu

  47. online forums: expert help for free seqanswers.com biostars.org Nextgen sequencing all of bioinformatics David Quigley dquigley@cc.ucsf.edu

  48. online forums: expert help for free seqanswers.com biostars.org Nextgen sequencing all of bioinformatics stats.stackexchange.com statistics David Quigley dquigley@cc.ucsf.edu

  49. UCSF resources Library classes and information Formal courses (BMI, Biostatistics) Cores (Computational Biology, Genomics) QGDG monthly methods discussion group David Quigley dquigley@cc.ucsf.edu

  50. Online classes and blogs Free courses on data analysis http://jhudatascience.org simplystatistics.org Coursera etc... Good tutorials on sequence analysis http://evomics.org/learning David Quigley dquigley@cc.ucsf.edu

More Related