Cellbased assays and the genome - data analysis and modeling of genetic networks

Cellbased assays and the genome - data analysis and modeling of genetic networks Wolfgang Huber European Molecular Biology Laboratory European Bioinformatics Institute Hinxton/Cambridgeshire

15 Feb 2001: "The human genome is sequenced"

But what does the sequence do? >gi|22046029|ref|NT_029998.5|Hs7_30253 Homo sapiens chromosome 7 reference genomic contig GATCTTATCTATCATGTTCACCTCCCAAGAGGTGAACATATCCCCCAAAGCCTGATAGAGAGAAGATGCTCATTAATATTTAATGCATGACCATGTGCAGACTTGGGAGGAAAAATATGCCTCAGCCTATCAATATTGGACCTTAATAAACAAGGATGTTTCTGCATCATTTCCCCACAACACCGAACAAGTGTGGCTCACTGTGGATGTTTAAGCAAATGCATTGTTTTTCCAGTTATATATCTGGTAGAGATGAGGCCATTGATAGGAATGGGAAGACGATCTCCTTTTATTTTGATGACCCAGCATGGCTGAACACTCAGTGACTACCACTGCACTTTGTTGTACTTTCAGCATTAGAGATGCCAGCCCTGTAGGATATAAAACAGGAACATCTAGTCCTCAATTATATTCAGAATTACTCAAGTCTTAGAAGCACCACTTGTCTTTTTTCAAGGGAGAGAAATGCTCAAGTGATGGGCTGAAGTGAAGGGAGGGAGTCACTCACTTGAACGGTTCCCTTAGGCTGTGTGGATGCAAACAGCATTAGACAATGACACTGACAGTGGGAAATGCACTGGAGACGATGACTGGCAAAGCCCTCCTTTTCTCCCCATCCACTATAGATACTGACAGCAAAGGGTTTGTCACAATGACAACTATACACTCCCAATATCACAGAAGAAGGAGGAATAAAAGGGTATATTATGAGTGACTGAAGTTTAGAATAAATTAATAAATATTATGTCCCTCATCCATAGAAACCACAAAGGTCTAGTAAGGCTAAGGATATAACAAGAAAATAATATGAATATTTGCTTCCCCTTCCTAGTGTAATAGAGTAAGTTACAAATGGCTTCAGGAAGGGGAGAGAGGAAGAAGAGTGGATGAGATACGTAAGAGTGCTTGAGGGCTAATTTTATGAAAGCTTTGGGAAGTTTTAAGAAAAAGAAAAGCTATTTTTCAAGGTACATGTGTGTATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGAAAGACAGAAGAAAGAGGGAGACCTAAGAAGACTATGAGACACTAAGAGAAAAATTAAGGTAAAAAAGACACACACTTAGAAAAACACACATAGGGAGGAGGGAGGAGGTTAAGACATTTTACTATGTGCTGTGAATGGAAACTACAAACCATTTTTGATATATGCAATATATATACATATATACACACATATACATATGTATTTAAATATTTAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTATGTCACTCTGCCCAGGCTGCAGTACAGTGGTTGTTCACAGTCATGATCATAGCACATTATAGCCTTGAACTCCTGGGCTCAAGCAACCCTCCTGTATTAGTCTCCCCAGTAGTTGGGATTACTAGCATATGCCACCATGTCCACCTTTATGCTTTTTAAAGTGAAAAACCATACTAAGAATGAGGCAGCTCAACTTAATAATAAAAACATTTCAAATGTAAAGAAATTTACAAAAGAAAAACAATCAACCCCATTAAAATTGGGCAAAGGGAATGAACAGACACTTTTCAAAAGAATACATGCATGCAGCCAACAAACATACAAAAAAAAAGTTCAACATCACTGATCATTAGAGAAATGCAAATCAAAACCATAATGAGATACCATCTCACACCAGTCAGAATAGCTATCATTAAAAAGTCAAAAAATAACAGATGCTAGTGAGGCTATGGAGAAAAGGGAATGCTTATACACTGTTGTTGGGTGTGCAAATCAGTTCAATCATTGTGCAAGGAAAGTGATTCCTCAAAGAGCTAAAAGCAGAGCTACCATTCGACCCAGTAATCCCACTACTGGGTATATACCCAGATGAATATAAACCATTCTACCATAAAGACACATGCATACAAATGTTCATTGCAGCACTGTTCACAATAGCAAAAGTATGGGATCAACCTAATGCCCATCAATGACAGATTGGATAAAGAAAATGTGGTACATATACACCATGGAATACTATGCCGCCATTAAAAATGATATCATGTCTTTTGCTGGAATATGGATGGACCTTCTATTATCCTTAGCAAACTAATGCAGGAACAGAAAACCAAATATAGCATACTCTCAGTTATAAGTGGGAGCTAAA

transcription translation DNA mRNA Protein Molecular network Organism Transcription and translation

Gene expression matters

Functional Genomics:"The Human Transcriptome Project" • Goals: • o experimentally identify all transcripts • o characterize their "function" • - many levels of detail • - contingent on what is experimentallly accessible • oultimately: math. models with explanatory power, involving genes, transcripts, proteins ("systems biology")

samples: mRNA from tissue biopsies, cell lines fluorescent detection of the amount of sample-probe binding arrays: probes = gene-specific DNA strands tissue A tissue B tissue C ErbB2 0.02 1.12 2.12 VIM 1.1 5.8 1.8 ALDH4 2.2 0.6 1.0 CASP4 0.01 0.72 0.12 LAMA4 1.32 1.67 0.67 MCAM 4.2 2.93 3.31 microarrays

DNA microarrays and cancer Cancer: somatic cells mutate to become anti-social, proliferate excessively, eventually kill the parent organism Current cancer taxonomies: o affected organ o cell type of origin o apparent grade of de-differentiation Goals: o molecular taxonomy: more precise, causal o molecular diagnosis: better estimation of risk and thus treatment strategy o molecular therapy (new, more specific drugs)

log-ratio which genes are differentially transcribed? same-same tumor-normal

Statistics 101: biasaccuracy  precision variance

Basic dogma of data analysis: Can always increase sensitivity on the cost of specificity, or vice versa, the art is to find the best trade-off. X X X X X X X X X (It can also be possible to increase both by better choice of method / model)

Raw data are not mRNA concentrations The problem is less that these steps are ‘not perfect’; it is that they may vary from array to array, experiment to experiment.

Statistical error modeling approach Parameters a, b, s1, s2 Maximum-Likelihood Robustification a la "Least Quantile Sum of Squares" Variance stabilizing transformation Huber et al. Bioinformatics (2002), SAGMB (2003) Software package vsn (Bioconductor project / LGPL)

 evaluation: a benchmark for Affymetrix genechip expression measures o Data: Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex background Dilution series: from GeneLogic 60 x HGU95Av2, liver & CNS cRNA in different proportions and amounts o Benchmark: 15 quality measures regarding -reproducibility -sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu

 ROC curves See Affycomp website at Johns Hopkins for full details

Conclusion for this part o Around 2002, microarray data had come to be perceived as very noise and unreliable. o Some of this caused by immature hardware, but a lot by naïve, ill-guided processing software (incl. that from the instrument vendor) o Proper statistical analysis by academic researchers has improved sensitivity and selectivity dramatically. o Open, academic discussion (incl. software sources) was essential.

Candidate gene sets from microarray studies: dozens…hundreds How to close the gap? Capacity of detailed in-vivo functional studies: one…few

 Drowning by numbers How to separate a flood of ‘significant’ secondary effects from causally relevant ones? VHL: tumor suppressor with “gatekeeper” role in kidney cancers Boer et al. Genome Res. 2001: kidney tumor/normal profiling study

Drowning by numbers Boer et al. Genome Res. 2001

 Tumor profiling Gene lists are great as features for classification …but what about the underlying biology? • Strategies: • Clever experimental design – ask directed questions • 2. Deeper analysis – use all available “meta”data (GO, KEGG, Ensembl, Entrez, …) • 3.Deeper experiments - new technologies: protein interaction assays, protein profiling, cell-based functional assays Multivariate (regression, prediction) pn overfitting regularization (…arbitrary) One gene at a time multiple testing loss of power Cohort studies provide associations not causal interactions de novo discovery of a gene/protein’s role is hard

Bioconductor an open source and open development software project for the analysis of biomedical and genomic data. Started in the fall of 2001 by Robert Gentleman (then at Harvard) and now includes 23 core developers in the US, Europe, and Australia. R and the R package system are used to design and distribute software. Initial focus on microarray statistics, now also: proteomics, cell-based assays, bioinformatic metadata, graph-theoretic methods

Bioconductor Strict 6-monthly release cycle, starting with about 15 packages 1.0 in March 2003, now at 1.5 with ca. 100 packages Ca. 2000 downloads in 4 weeks after release 1.5 Aggressive development, no backward compatibility Packages vary in their maturity: software ecosystem

Core contributors Ben Bolstad, UC Berkeley Andreas Buness, DKFZ Heidelberg Vince Carey, Biostatistics, Harvard Sandrine Dudoit, Biostatistics, UC Berkeley Jane Fridlyand, UCSF Laurent Gautier, Technical University of Denmark Jeff Gentry, Dana-Farber Cancer Institute Rafael Irizarry, Biostatistics, Johns Hopkins Denise Scholtens, Dana-Farber Cancer Institute Gordon Smyth, WEHI Yee Hwa (Jean) Yang, Biostatistics, UCSF Jianhua (John) Zhang, Dana-Farber Cancer Institute

Goals Provide access to powerful statistical and graphical methods for the analysis of genomic data Facilitate the integration of biological metadata (GenBank, GO, LocusLink, PubMed) in the analysis of experimental data Allow the rapid development of extensible, interoperable, and scalable software Provide high-quality documentation Promote reproducible research Provide training in computational and statistical methods

Tools in bioconductor Main platform: R But we also use many other tools: • PostgresQL • Perl • Python • Java • graphviz • Boost Graph Library (BGL) • libxml • MAGEstk • tcl/tk, Gtk Philosophy: don’t reinvent the wheel

Component software Most interesting problems will require the coordinated application of many different techniques. Thus we need integrated interoperable software. Don’t think your method is the end of it all. Design your piece to be a cog in a big machine.  Software modules with standardized I/O instead of stand-alone applications Web service instead of web site

Why are we Open Source so that you can find out what algorithm is being used, and how it is being used so that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs so that they can be used as components Transparency Pursuit of reproducibilty Efficiency of development Training

Good scientific software is like a good scientific publication oReproducibility oPeer-review oEasy accessibility by other researchers, society o Build on the work of others o Others will build their work on top of it o Commercialize those developments that are successful and have a market

Commercial distributions We coordinate with Insightful to help provide ArrayAnalyzer (which contains many Bioconductor packages and resources) Bioconductor is also interfaced by Genespring, Spotfire, ExpressionNTI Our software remains free, but some users are willing to pay money for professional support (hotlines, handbooks, stricter enforcement of uniform user interfaces,…) Win-win arrangement

Bioconductor packagesRelease 1.5, Nov 2004 Ca. 100 Packages • General infrastructure: Biobase, DynDoc, reposTools, ruuid, tkWidgets, widgetTools, BioStrings • Annotation: annotate, AnnBuilder& data packages • Graphics: geneplotter, hexbin • Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, makecdfenv, vsn, gcrma • Pre-processing two-color spotted DNA microarray data: marray, vsn, arrayMagic, arrayQuality • Differential gene expression: edd, genefilter, limma, multtest, ROC, siggenes, EBArrays, factDesign • Graphs and networks: graph, RBGL, Rgraphviz • Other data: SAGElyzer, DNAcopy, PROcess, aCGH, prada N.B. Many new packages in Bioconductor development version.

Classification, class prediction, machine learning: Predict outcome on basis of past observations of some explanatory variables (features) Outcome: E.g. tumor class, type of bacterial infection, response to treatment, survival Features: gene expression measures, covariates such as age, sex

R class prediction packages class: k-nearest neighbor (knn), learning vector quantization (lvq) classPP: projection pursuit e1071: support vector machine (svm) MASS: linear and quadratic discriminant analysis (lda, qda) sma: diagonal linear and quadratic discriminant analysis, naïve Bayes nnet: feed-forward neural networks and multinomial log-linear models rpart: classification and regression trees knnTree: k-nn classification with variable selection inside leaves of a tree randomForest: random forests LogitBoost: boosting for tree stumps ipred: bagging, resampling based estimation of prediction error mlbench: machine learning benchmark problems. pamR: prediction analysis for microarrays gpls: generalized partial least squares

graph, RBGL, Rgraphviz graph basic class definitions and functionality RBGL graph algorithms (e.g. shortest path, connectivity) Rgraphviz rendering. Different layout algorithms. Seamlessly combinable with R graphics.

 Algorithms Shortest paths (edge weights: 1, positive, real) Connectivity (strong, weak) Graph traversal Minimal spanning tree

Probabilistic tree model DNA aberrations in cancer (matrix CGH)

domain combination graph Apic, Huber, Teichmann, J. Struct. Fct. Genomics (2003)

 Association vs Interference Microarrays: Input: A few phenotype (e.g. disease) or treatments Readout: genome-wide transcription levels Mostly association and correlation Ensemble average over many cells / tissues Marriage of pathology and genomics Cellular assays: Input: perturbation of protein level or activities, genome-scale Output: phenotypes Causality Individual cells (microscopy / FACS) Marriage of cell biology and genomics

 Interference/Perturbation tools RNAi + genome wide o specificity - efficiency / monitoring? Transfection (expression) + 100% specific + monitoring - library size Small compounds …

 Monitoring tools Plate reader 96 or 384 well, 1…4 measurements per well FACS ca. 2000 x 4…8 measurements per well Automated Microscopy practically unlimited. Here: 30 x 1280 x 1024 x 3

 full-length cDNAs Sequence AND physical clone with complete and non-interrupted protein coding region of a gene FLJ/IMS UTokyo 12,560 MGC/NIH (*) 11,414 FLJ/HRI (Kazusa) 8,057 DKFZ&MIPS (*) 5,521 FLJ/KDRI 2,000 others 1,096 unique 21,037 (*) publicly available http://www.h-invitational.jp Imanishi et al. (2004). Integrative annotation of 21,037 human genes validatedby full-length cDNA clones. PLoS Biol.2(6) FLJ: Full-Length Long Japan, IMS: Institute for Medical Science,HRI: Helix Research Institute, MGC: Mammalian Gene Collection

21,000+ human cDNAs (genes) O(100) disease-relevant 'novel' genes ‘hot’ candidates  Strategy microarray study (humans, in vivo) HT functional assay (cell culture)

assays to challenge the cell-cycle and beyond  Other assays: - Protein localisation - Protein interaction

GFP-ORF- protein BrdU incorporation DAPI: identification effect on proliferation CFP: expression automated microscope BrdU: proliferation Data analysis HT functional assays cDNA library (>100 clones) expression clones

A A T BrdU C C S-Phase Detection of DNA-replication Cy5 BrdU Quantification of BrdU incorporation by specificantibodies Dorit Arlt

Liquid handling Transfection & antibody incubation Sterile housing Transfection reagents Buffers & antibodies Chamber slides Expression plasmids Transfection mixing plate Christian Schmitt

 High Content Screening Microscope - Cube - Urban Liebel EMBL

YFP channel Cy5 channel ... 621 258 2101 144 1732 401 1183 120 493 219 66 297 232 421 182 120 286 332 ... DAPI YFP Cy5 Anti-BrdU  Automated image analysis Urban Liebel EMBL

raw data

Distribution of expression intensities Frequency (# of cells) Signal intensity (expression) YFP

 Proliferation assay:within-well analyis ORF expression ORF expression ORF expression

Cellbased assays and the genome - data analysis and modeling of genetic networks

Cellbased assays and the genome - data analysis and modeling of genetic networks

Presentation Transcript

Ocean Modeling and Data Analysis

Genome analysis and annotation

Ocean Modeling and Data Analysis

Deterministic and Stochastic Analysis of Simple Genetic Networks

Genome-Wide DNA Methylation Assays

Analysis and Modeling of Social Networks

Modeling and Analysis of Computer Networks

THE MATHEMATICAL MODELING AND COMPUTER ANALYSIS OF GENE NETWORKS

INTEGRATED ANALYSIS OF GENETIC DATA

Qualitative Modeling and Simulation of Genetic Regulatory Networks

Modelisation and Dynamical Analysis of Genetic Regulatory Networks

Organization of Genome Data into Pathways and Networks

Deterministic and Stochastic Analysis of Simple Genetic Networks

Modeling and Analysis of Computer Networks

Data analysis and modeling: the tools of the trade

Genome analysis and annotation

Data analysis and modeling: the tools of the trade

Geospatial Data Modeling and Analysis

Database Analysis and Data Modeling

Robustness and genetic networks

Modeling and Analysis of Computer Networks (The physical Layer)

Data Architecture, Modeling, and Networks