An introduction to quantitative biology and R

An introduction to quantitative biology and R David Quigley dquigley@cc.ucsf.edu Helen Diller Comprehensive Cancer Center, UCSF Institute for Cancer Research, University of Oslo

Molecular biology 20 years ago qualitative methods small-scale quantitative tests Suzuki Med Mol Morph 2010 Oh PNAS 1996 Mao Genes Dev 2004 David Quigley dquigley@cc.ucsf.edu

Molecular biology now qualitative methods small-scale quantitative tests large-scale quantitative analysis microarrays, *seq methods, cell phenotype screens... Nik-Zainal Cell 2012 Fullwood Nature 2009 CGAN Nature 2012 David Quigley dquigley@cc.ucsf.edu

Many studies use both approaches Hypothesis-generating: Normalize breast tumor expression data (three cohorts) Call genotypes from SNP arrays Calculate association between genotype and expression genome-wide (eQTL) Identify an interesting candidate Valid ate eQTL in independent cohort Hypothesis-driven: Methylation analysis at MRPS30 promoter In vitro (two cell lines): ChIP-PCR +/- estrogen qPCR sequencing David Quigley dquigley@cc.ucsf.edu

Common quantitative techniques Gene Expression transcription splicing methylation protein binding (ChIP-seq) Genomics and Genetics de novo assembly DNA copy number (CNV and tumors) germline variant analysis tumor variant analysis SNPs, indels, translocation analysis of tumor clonality

Challenges requires statistical sophistication in study design in interpretation many data points 1,000 to 1,000,000 measurements per sample many false positives which look like great stories software becomes part of the experiment divide between engineering, biology culture & thinking David Quigley dquigley@cc.ucsf.edu

Wet lab and quantitative skills: better job prospects Schillebeeckx Nature Biotech. 2013 David Quigley dquigley@cc.ucsf.edu

What approachs are used to analyze quantitative data?

Chosing a tool Cost Learning curve Ease of use Flexibility (closed to open-ended) Software ecosystem (none to extensive) David Quigley dquigley@cc.ucsf.edu

Traditional programming languages Python, C++, Java, others can solve any computable problem creates the fastest tools free requires programming expertise complex to write and test  high effort David Quigley dquigley@cc.ucsf.edu

Specialized single-purpose programs command line tools academic research type commands at a prompt or run scripts PLINK, bowtie, GATK, bedtools GUI (point and click) commercial software for a vendor’s platform slick, opaque, hard/impossible to automate David Quigley dquigley@cc.ucsf.edu

Commercial statistics programs STATA, SPSS, GraphPad, others 1) Load one dataset 2) Select analysis by clicking on a GUI 3) Generate a report May have a built-in language Very mature tools for traditional biostatistics Not free David Quigley dquigley@cc.ucsf.edu

Web-based tools Galaxy string together pre-defined analysis steps very easy to use reproducible David Quigley dquigley@cc.ucsf.edu

R: a “software environment” Using R is like writing and using software Traditionally, biologists did not do this. R was written by statisticians to be a free replica of another language called “S”. David Quigley dquigley@cc.ucsf.edu

Why is R popular? Flexible, open-ended, open-source Large library of packages package: easy-to use published methods like a Qiagen kit Free! David Quigley dquigley@cc.ucsf.edu

You use R by typing at the prompt There is no pull-down menu of statistical commands David Quigley dquigley@cc.ucsf.edu

What’s good about this approach? chain analyses work with multiple datasets use packages of code easy to reproduce runs on anything makes sense to computer programmers David Quigley dquigley@cc.ucsf.edu

What’s hard about this approach? hard to get started cryptic commands built-in help is amusingly unhelpful David Quigley dquigley@cc.ucsf.edu

packages: collections of R functions collection of R code that solves a specific task limma: microarray normalization and analysis samr: differential expression impute: dealing with missing data downloaded for free from a central repository David Quigley dquigley@cc.ucsf.edu

bioconductor Curated collection of R packages Microarrays, aCGH, sequence analysis, advanced statistics, graphics, lots more bioconductor.org David Quigley dquigley@cc.ucsf.edu

Learning R data types by comparing them to Excel spreadsheets

Comparing Excel and R Excel Easy tasks are easy non-trivial tasks impossible or expensive No paper trail Mangles gene names Plots look terrible David Quigley dquigley@cc.ucsf.edu

Comparing Excel and R Excel Easy tasks are easy non-trivial tasks impossible or expensive No paper trail Mangles gene names Plots look terrible R Easy jobs are hard at first Non-trivial things are possible Easy to make a paper trail Biostatistics researchers publish tools in R Can create publication-ready plots David Quigley dquigley@cc.ucsf.edu

Organizing data in Excel Each subject has a row. Each column has a feature of your subjects. David Quigley dquigley@cc.ucsf.edu

R calls the data points variables variables numbers and characters (letters, words) numbers: 2.6, 4 characters: “Flopsy”, “white, brown paws” David Quigley dquigley@cc.ucsf.edu

R calls the columns vectors vectors ordered collections of a variable name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”] age: [2.5, 2.6, 2.5, 4] David Quigley dquigley@cc.ucsf.edu

R calls the data set a data frame data frame a list of vectors (columns) that have names elements can be read and written by row & column David Quigley dquigley@cc.ucsf.edu

I can slice and dice the data frame David Quigley dquigley@cc.ucsf.edu

Tell R to do things using functions function_name( details about how to do it ) generate sequence from 1 to 5 counting by 0.5 parameters for seq are named from, to, and by David Quigley dquigley@cc.ucsf.edu

Tell R to do things using functions function_name( details about how to do it ) report the mean of my.data. Result of one function is fed into another one. David Quigley dquigley@cc.ucsf.edu

Tell R to do things using functions function_name( details about how to do it ) define a new function that adds 2 to whatever it’s passed compare to original value of my.data David Quigley dquigley@cc.ucsf.edu

Walk-through a straightforward analysis

Primary data from METABRIC study gene expression TP53sequence 1,400 samples from 5 hospitals Is there an association between breast cancer subtype and TP53 mutation? David Quigley dquigley@cc.ucsf.edu

Tasks Normalize data batch effects unwanted inter-sample variation Identify outliers associations between p53 and subtype David Quigley dquigley@cc.ucsf.edu

Quantile Normalization (limma) Force every array to have the same distribution of expression intensities > library(limma) > raw = read.table('raw_extract.txt’, ...) > raw.normalized = normalize.quantiles( raw ) > normalized = log2( raw.normalized ) David Quigley dquigley@cc.ucsf.edu

Identify batch effects in microarrays Principle Components Analysis Identify strongest variation in a matrix gene 1 gene 2 David Quigley dquigley@cc.ucsf.edu

Identify batch effects in microarrays Principle Components Analysis Identify axes of maximal variation in a matrix gene 1 first principle component gene 2 David Quigley dquigley@cc.ucsf.edu

Identify batch effects in microarrays Principle Components Analysis Identify strongest variation in a matrix group A group B gene 1 gene 1 gene 2 gene 2 David Quigley dquigley@cc.ucsf.edu

PCA of identifies a batch effect second principle component hospital 3 (yellow) first principle component > my.pca = prcmp( t( expression.data ) ) > plot( my.pca, ... ) David Quigley dquigley@cc.ucsf.edu

batch correction reduces bias (ComBat) ComBat package reduces user-defined batch effects second principle component first principle component David Quigley dquigley@cc.ucsf.edu

Molecular subtypes of breast carcinoma, defined by gene expression ER status Luminal A N=507 Luminal B N=379 Her2 N=161 Basal N=234 > sa = read.table(‘patients.txt’, ...) > tumor.counts = table( sa$ER.status, sa$PAM50Subtype) (convert counts to percentages) > barplot( c( tumor.counts[1], tumor.counts[2] ), col=c(“red”,”green”), ... ) David Quigley dquigley@cc.ucsf.edu

Find interactions: TP53 and subtype Fit a linear model: > fitted.model = lm( dependent ~ independent ) Perform Analysis of Variance: > anova( fitted.model ) general form of my analysis: > anova( lm( gene.expression ~ PAM * TP53 ) 18,000 genes PAM: {LumA, LumB, Her2, Basal} TP53: {mutant, WT} David Quigley dquigley@cc.ucsf.edu

Automate with loops Calculate anova for 18,000 genes by looping through each gene and storing result. > n_genes = 18000 > result = rep( 0, n_genes ) > for( counter in 1:n_genes ){ result[counter] = anova(...) } sort results identify significant interaction repeat 18,000 times David Quigley dquigley@cc.ucsf.edu

Immune infiltration in TP53-WT Basal CD3E Does p53 have a role in immune surveillance? log2 expression log2 expression absent mild severe David Quigley dquigley@cc.ucsf.edu infiltration

Next steps: getting help and learning more

online forums: expert help for free biostars.org all of bioinformatics David Quigley dquigley@cc.ucsf.edu

online forums: expert help for free seqanswers.com biostars.org Nextgen sequencing all of bioinformatics David Quigley dquigley@cc.ucsf.edu

online forums: expert help for free seqanswers.com biostars.org Nextgen sequencing all of bioinformatics stats.stackexchange.com statistics David Quigley dquigley@cc.ucsf.edu

UCSF resources Library classes and information Formal courses (BMI, Biostatistics) Cores (Computational Biology, Genomics) QGDG monthly methods discussion group David Quigley dquigley@cc.ucsf.edu

Online classes and blogs Free courses on data analysis http://jhudatascience.org simplystatistics.org Coursera etc... Good tutorials on sequence analysis http://evomics.org/learning David Quigley dquigley@cc.ucsf.edu

An introduction to quantitative biology and R

An introduction to quantitative biology and R

Presentation Transcript

An Introduction to R

An Introduction to Molecular Biology

An Introduction to Molecular Biology

An Introduction to Networks in Biology

An Introduction to R

An Introduction to R graphics

An introduction to R

An introduction to R

An Introduction to R

An Introduction to R

An introduction to R (cont)

An Introduction to R

An Introduction to R:

An Introduction to R

An introduction to HKDSE Biology

An Introduction to R graphics

An Introduction to ACT-R

An Introduction to Quantitative Marketing

An Introduction to Quantitative Research Peter Harper

An Introduction to Cancer Biology

An introduction to R

An introduction to R