640 likes | 780 Views
Cellbased assays and the genome - data analysis and modeling of genetic networks. Wolfgang Huber European Molecular Biology Laboratory European Bioinformatics Institute Hinxton/Cambridgeshire. 15 Feb 2001: "The human genome is sequenced". But what does the sequence do?.
E N D
Cellbased assays and the genome - data analysis and modeling of genetic networks Wolfgang Huber European Molecular Biology Laboratory European Bioinformatics Institute Hinxton/Cambridgeshire
But what does the sequence do? >gi|22046029|ref|NT_029998.5|Hs7_30253 Homo sapiens chromosome 7 reference genomic contig GATCTTATCTATCATGTTCACCTCCCAAGAGGTGAACATATCCCCCAAAGCCTGATAGAGAGAAGATGCTCATTAATATTTAATGCATGACCATGTGCAGACTTGGGAGGAAAAATATGCCTCAGCCTATCAATATTGGACCTTAATAAACAAGGATGTTTCTGCATCATTTCCCCACAACACCGAACAAGTGTGGCTCACTGTGGATGTTTAAGCAAATGCATTGTTTTTCCAGTTATATATCTGGTAGAGATGAGGCCATTGATAGGAATGGGAAGACGATCTCCTTTTATTTTGATGACCCAGCATGGCTGAACACTCAGTGACTACCACTGCACTTTGTTGTACTTTCAGCATTAGAGATGCCAGCCCTGTAGGATATAAAACAGGAACATCTAGTCCTCAATTATATTCAGAATTACTCAAGTCTTAGAAGCACCACTTGTCTTTTTTCAAGGGAGAGAAATGCTCAAGTGATGGGCTGAAGTGAAGGGAGGGAGTCACTCACTTGAACGGTTCCCTTAGGCTGTGTGGATGCAAACAGCATTAGACAATGACACTGACAGTGGGAAATGCACTGGAGACGATGACTGGCAAAGCCCTCCTTTTCTCCCCATCCACTATAGATACTGACAGCAAAGGGTTTGTCACAATGACAACTATACACTCCCAATATCACAGAAGAAGGAGGAATAAAAGGGTATATTATGAGTGACTGAAGTTTAGAATAAATTAATAAATATTATGTCCCTCATCCATAGAAACCACAAAGGTCTAGTAAGGCTAAGGATATAACAAGAAAATAATATGAATATTTGCTTCCCCTTCCTAGTGTAATAGAGTAAGTTACAAATGGCTTCAGGAAGGGGAGAGAGGAAGAAGAGTGGATGAGATACGTAAGAGTGCTTGAGGGCTAATTTTATGAAAGCTTTGGGAAGTTTTAAGAAAAAGAAAAGCTATTTTTCAAGGTACATGTGTGTATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGAAAGACAGAAGAAAGAGGGAGACCTAAGAAGACTATGAGACACTAAGAGAAAAATTAAGGTAAAAAAGACACACACTTAGAAAAACACACATAGGGAGGAGGGAGGAGGTTAAGACATTTTACTATGTGCTGTGAATGGAAACTACAAACCATTTTTGATATATGCAATATATATACATATATACACACATATACATATGTATTTAAATATTTAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTATGTCACTCTGCCCAGGCTGCAGTACAGTGGTTGTTCACAGTCATGATCATAGCACATTATAGCCTTGAACTCCTGGGCTCAAGCAACCCTCCTGTATTAGTCTCCCCAGTAGTTGGGATTACTAGCATATGCCACCATGTCCACCTTTATGCTTTTTAAAGTGAAAAACCATACTAAGAATGAGGCAGCTCAACTTAATAATAAAAACATTTCAAATGTAAAGAAATTTACAAAAGAAAAACAATCAACCCCATTAAAATTGGGCAAAGGGAATGAACAGACACTTTTCAAAAGAATACATGCATGCAGCCAACAAACATACAAAAAAAAAGTTCAACATCACTGATCATTAGAGAAATGCAAATCAAAACCATAATGAGATACCATCTCACACCAGTCAGAATAGCTATCATTAAAAAGTCAAAAAATAACAGATGCTAGTGAGGCTATGGAGAAAAGGGAATGCTTATACACTGTTGTTGGGTGTGCAAATCAGTTCAATCATTGTGCAAGGAAAGTGATTCCTCAAAGAGCTAAAAGCAGAGCTACCATTCGACCCAGTAATCCCACTACTGGGTATATACCCAGATGAATATAAACCATTCTACCATAAAGACACATGCATACAAATGTTCATTGCAGCACTGTTCACAATAGCAAAAGTATGGGATCAACCTAATGCCCATCAATGACAGATTGGATAAAGAAAATGTGGTACATATACACCATGGAATACTATGCCGCCATTAAAAATGATATCATGTCTTTTGCTGGAATATGGATGGACCTTCTATTATCCTTAGCAAACTAATGCAGGAACAGAAAACCAAATATAGCATACTCTCAGTTATAAGTGGGAGCTAAA
transcription translation DNA mRNA Protein Molecular network Organism Transcription and translation
Functional Genomics:"The Human Transcriptome Project" • Goals: • o experimentally identify all transcripts • o characterize their "function" • - many levels of detail • - contingent on what is experimentallly accessible • oultimately: math. models with explanatory power, involving genes, transcripts, proteins ("systems biology")
samples: mRNA from tissue biopsies, cell lines fluorescent detection of the amount of sample-probe binding arrays: probes = gene-specific DNA strands tissue A tissue B tissue C ErbB2 0.02 1.12 2.12 VIM 1.1 5.8 1.8 ALDH4 2.2 0.6 1.0 CASP4 0.01 0.72 0.12 LAMA4 1.32 1.67 0.67 MCAM 4.2 2.93 3.31 microarrays
DNA microarrays and cancer Cancer: somatic cells mutate to become anti-social, proliferate excessively, eventually kill the parent organism Current cancer taxonomies: o affected organ o cell type of origin o apparent grade of de-differentiation Goals: o molecular taxonomy: more precise, causal o molecular diagnosis: better estimation of risk and thus treatment strategy o molecular therapy (new, more specific drugs)
log-ratio which genes are differentially transcribed? same-same tumor-normal
Statistics 101: biasaccuracy precision variance
Basic dogma of data analysis: Can always increase sensitivity on the cost of specificity, or vice versa, the art is to find the best trade-off. X X X X X X X X X (It can also be possible to increase both by better choice of method / model)
Raw data are not mRNA concentrations The problem is less that these steps are ‘not perfect’; it is that they may vary from array to array, experiment to experiment.
Statistical error modeling approach Parameters a, b, s1, s2 Maximum-Likelihood Robustification a la "Least Quantile Sum of Squares" Variance stabilizing transformation Huber et al. Bioinformatics (2002), SAGMB (2003) Software package vsn (Bioconductor project / LGPL)
evaluation: a benchmark for Affymetrix genechip expression measures o Data: Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex background Dilution series: from GeneLogic 60 x HGU95Av2, liver & CNS cRNA in different proportions and amounts o Benchmark: 15 quality measures regarding -reproducibility -sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu
ROC curves See Affycomp website at Johns Hopkins for full details
Conclusion for this part o Around 2002, microarray data had come to be perceived as very noise and unreliable. o Some of this caused by immature hardware, but a lot by naïve, ill-guided processing software (incl. that from the instrument vendor) o Proper statistical analysis by academic researchers has improved sensitivity and selectivity dramatically. o Open, academic discussion (incl. software sources) was essential.
Candidate gene sets from microarray studies: dozens…hundreds How to close the gap? Capacity of detailed in-vivo functional studies: one…few
Drowning by numbers How to separate a flood of ‘significant’ secondary effects from causally relevant ones? VHL: tumor suppressor with “gatekeeper” role in kidney cancers Boer et al. Genome Res. 2001: kidney tumor/normal profiling study
Drowning by numbers Boer et al. Genome Res. 2001
Tumor profiling Gene lists are great as features for classification …but what about the underlying biology? • Strategies: • Clever experimental design – ask directed questions • 2. Deeper analysis – use all available “meta”data (GO, KEGG, Ensembl, Entrez, …) • 3.Deeper experiments - new technologies: protein interaction assays, protein profiling, cell-based functional assays Multivariate (regression, prediction) pn overfitting regularization (…arbitrary) One gene at a time multiple testing loss of power Cohort studies provide associations not causal interactions de novo discovery of a gene/protein’s role is hard
Bioconductor an open source and open development software project for the analysis of biomedical and genomic data. Started in the fall of 2001 by Robert Gentleman (then at Harvard) and now includes 23 core developers in the US, Europe, and Australia. R and the R package system are used to design and distribute software. Initial focus on microarray statistics, now also: proteomics, cell-based assays, bioinformatic metadata, graph-theoretic methods
Bioconductor Strict 6-monthly release cycle, starting with about 15 packages 1.0 in March 2003, now at 1.5 with ca. 100 packages Ca. 2000 downloads in 4 weeks after release 1.5 Aggressive development, no backward compatibility Packages vary in their maturity: software ecosystem
Core contributors Ben Bolstad, UC Berkeley Andreas Buness, DKFZ Heidelberg Vince Carey, Biostatistics, Harvard Sandrine Dudoit, Biostatistics, UC Berkeley Jane Fridlyand, UCSF Laurent Gautier, Technical University of Denmark Jeff Gentry, Dana-Farber Cancer Institute Rafael Irizarry, Biostatistics, Johns Hopkins Denise Scholtens, Dana-Farber Cancer Institute Gordon Smyth, WEHI Yee Hwa (Jean) Yang, Biostatistics, UCSF Jianhua (John) Zhang, Dana-Farber Cancer Institute
Goals Provide access to powerful statistical and graphical methods for the analysis of genomic data Facilitate the integration of biological metadata (GenBank, GO, LocusLink, PubMed) in the analysis of experimental data Allow the rapid development of extensible, interoperable, and scalable software Provide high-quality documentation Promote reproducible research Provide training in computational and statistical methods
Tools in bioconductor Main platform: R But we also use many other tools: • PostgresQL • Perl • Python • Java • graphviz • Boost Graph Library (BGL) • libxml • MAGEstk • tcl/tk, Gtk Philosophy: don’t reinvent the wheel
Component software Most interesting problems will require the coordinated application of many different techniques. Thus we need integrated interoperable software. Don’t think your method is the end of it all. Design your piece to be a cog in a big machine. Software modules with standardized I/O instead of stand-alone applications Web service instead of web site
Why are we Open Source so that you can find out what algorithm is being used, and how it is being used so that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs so that they can be used as components Transparency Pursuit of reproducibilty Efficiency of development Training
Good scientific software is like a good scientific publication oReproducibility oPeer-review oEasy accessibility by other researchers, society o Build on the work of others o Others will build their work on top of it o Commercialize those developments that are successful and have a market
Commercial distributions We coordinate with Insightful to help provide ArrayAnalyzer (which contains many Bioconductor packages and resources) Bioconductor is also interfaced by Genespring, Spotfire, ExpressionNTI Our software remains free, but some users are willing to pay money for professional support (hotlines, handbooks, stricter enforcement of uniform user interfaces,…) Win-win arrangement
Bioconductor packagesRelease 1.5, Nov 2004 Ca. 100 Packages • General infrastructure: Biobase, DynDoc, reposTools, ruuid, tkWidgets, widgetTools, BioStrings • Annotation: annotate, AnnBuilder& data packages • Graphics: geneplotter, hexbin • Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, makecdfenv, vsn, gcrma • Pre-processing two-color spotted DNA microarray data: marray, vsn, arrayMagic, arrayQuality • Differential gene expression: edd, genefilter, limma, multtest, ROC, siggenes, EBArrays, factDesign • Graphs and networks: graph, RBGL, Rgraphviz • Other data: SAGElyzer, DNAcopy, PROcess, aCGH, prada N.B. Many new packages in Bioconductor development version.
Classification, class prediction, machine learning: Predict outcome on basis of past observations of some explanatory variables (features) Outcome: E.g. tumor class, type of bacterial infection, response to treatment, survival Features: gene expression measures, covariates such as age, sex
R class prediction packages class: k-nearest neighbor (knn), learning vector quantization (lvq) classPP: projection pursuit e1071: support vector machine (svm) MASS: linear and quadratic discriminant analysis (lda, qda) sma: diagonal linear and quadratic discriminant analysis, naïve Bayes nnet: feed-forward neural networks and multinomial log-linear models rpart: classification and regression trees knnTree: k-nn classification with variable selection inside leaves of a tree randomForest: random forests LogitBoost: boosting for tree stumps ipred: bagging, resampling based estimation of prediction error mlbench: machine learning benchmark problems. pamR: prediction analysis for microarrays gpls: generalized partial least squares
graph, RBGL, Rgraphviz graph basic class definitions and functionality RBGL graph algorithms (e.g. shortest path, connectivity) Rgraphviz rendering. Different layout algorithms. Seamlessly combinable with R graphics.
Algorithms Shortest paths (edge weights: 1, positive, real) Connectivity (strong, weak) Graph traversal Minimal spanning tree
Probabilistic tree model DNA aberrations in cancer (matrix CGH)
domain combination graph Apic, Huber, Teichmann, J. Struct. Fct. Genomics (2003)
Association vs Interference Microarrays: Input: A few phenotype (e.g. disease) or treatments Readout: genome-wide transcription levels Mostly association and correlation Ensemble average over many cells / tissues Marriage of pathology and genomics Cellular assays: Input: perturbation of protein level or activities, genome-scale Output: phenotypes Causality Individual cells (microscopy / FACS) Marriage of cell biology and genomics
Interference/Perturbation tools RNAi + genome wide o specificity - efficiency / monitoring? Transfection (expression) + 100% specific + monitoring - library size Small compounds …
Monitoring tools Plate reader 96 or 384 well, 1…4 measurements per well FACS ca. 2000 x 4…8 measurements per well Automated Microscopy practically unlimited. Here: 30 x 1280 x 1024 x 3
full-length cDNAs Sequence AND physical clone with complete and non-interrupted protein coding region of a gene FLJ/IMS UTokyo 12,560 MGC/NIH (*) 11,414 FLJ/HRI (Kazusa) 8,057 DKFZ&MIPS (*) 5,521 FLJ/KDRI 2,000 others 1,096 unique 21,037 (*) publicly available http://www.h-invitational.jp Imanishi et al. (2004). Integrative annotation of 21,037 human genes validatedby full-length cDNA clones. PLoS Biol.2(6) FLJ: Full-Length Long Japan, IMS: Institute for Medical Science,HRI: Helix Research Institute, MGC: Mammalian Gene Collection
21,000+ human cDNAs (genes) O(100) disease-relevant 'novel' genes ‘hot’ candidates Strategy microarray study (humans, in vivo) HT functional assay (cell culture)
assays to challenge the cell-cycle and beyond Other assays: - Protein localisation - Protein interaction
GFP-ORF- protein BrdU incorporation DAPI: identification effect on proliferation CFP: expression automated microscope BrdU: proliferation Data analysis HT functional assays cDNA library (>100 clones) expression clones
A A T BrdU C C S-Phase Detection of DNA-replication Cy5 BrdU Quantification of BrdU incorporation by specificantibodies Dorit Arlt
Liquid handling Transfection & antibody incubation Sterile housing Transfection reagents Buffers & antibodies Chamber slides Expression plasmids Transfection mixing plate Christian Schmitt
High Content Screening Microscope - Cube - Urban Liebel EMBL
YFP channel Cy5 channel ... 621 258 2101 144 1732 401 1183 120 493 219 66 297 232 421 182 120 286 332 ... DAPI YFP Cy5 Anti-BrdU Automated image analysis Urban Liebel EMBL
Distribution of expression intensities Frequency (# of cells) Signal intensity (expression) YFP
Proliferation assay:within-well analyis ORF expression ORF expression ORF expression