1 / 64

Cellbased assays and the genome - data analysis and modeling of genetic networks

Cellbased assays and the genome - data analysis and modeling of genetic networks. Wolfgang Huber European Molecular Biology Laboratory European Bioinformatics Institute Hinxton/Cambridgeshire. 15 Feb 2001: "The human genome is sequenced". But what does the sequence do?.

aida
Download Presentation

Cellbased assays and the genome - data analysis and modeling of genetic networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cellbased assays and the genome - data analysis and modeling of genetic networks Wolfgang Huber European Molecular Biology Laboratory European Bioinformatics Institute Hinxton/Cambridgeshire

  2. 15 Feb 2001: "The human genome is sequenced"

  3. But what does the sequence do? >gi|22046029|ref|NT_029998.5|Hs7_30253 Homo sapiens chromosome 7 reference genomic contig GATCTTATCTATCATGTTCACCTCCCAAGAGGTGAACATATCCCCCAAAGCCTGATAGAGAGAAGATGCTCATTAATATTTAATGCATGACCATGTGCAGACTTGGGAGGAAAAATATGCCTCAGCCTATCAATATTGGACCTTAATAAACAAGGATGTTTCTGCATCATTTCCCCACAACACCGAACAAGTGTGGCTCACTGTGGATGTTTAAGCAAATGCATTGTTTTTCCAGTTATATATCTGGTAGAGATGAGGCCATTGATAGGAATGGGAAGACGATCTCCTTTTATTTTGATGACCCAGCATGGCTGAACACTCAGTGACTACCACTGCACTTTGTTGTACTTTCAGCATTAGAGATGCCAGCCCTGTAGGATATAAAACAGGAACATCTAGTCCTCAATTATATTCAGAATTACTCAAGTCTTAGAAGCACCACTTGTCTTTTTTCAAGGGAGAGAAATGCTCAAGTGATGGGCTGAAGTGAAGGGAGGGAGTCACTCACTTGAACGGTTCCCTTAGGCTGTGTGGATGCAAACAGCATTAGACAATGACACTGACAGTGGGAAATGCACTGGAGACGATGACTGGCAAAGCCCTCCTTTTCTCCCCATCCACTATAGATACTGACAGCAAAGGGTTTGTCACAATGACAACTATACACTCCCAATATCACAGAAGAAGGAGGAATAAAAGGGTATATTATGAGTGACTGAAGTTTAGAATAAATTAATAAATATTATGTCCCTCATCCATAGAAACCACAAAGGTCTAGTAAGGCTAAGGATATAACAAGAAAATAATATGAATATTTGCTTCCCCTTCCTAGTGTAATAGAGTAAGTTACAAATGGCTTCAGGAAGGGGAGAGAGGAAGAAGAGTGGATGAGATACGTAAGAGTGCTTGAGGGCTAATTTTATGAAAGCTTTGGGAAGTTTTAAGAAAAAGAAAAGCTATTTTTCAAGGTACATGTGTGTATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGAAAGACAGAAGAAAGAGGGAGACCTAAGAAGACTATGAGACACTAAGAGAAAAATTAAGGTAAAAAAGACACACACTTAGAAAAACACACATAGGGAGGAGGGAGGAGGTTAAGACATTTTACTATGTGCTGTGAATGGAAACTACAAACCATTTTTGATATATGCAATATATATACATATATACACACATATACATATGTATTTAAATATTTAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTATGTCACTCTGCCCAGGCTGCAGTACAGTGGTTGTTCACAGTCATGATCATAGCACATTATAGCCTTGAACTCCTGGGCTCAAGCAACCCTCCTGTATTAGTCTCCCCAGTAGTTGGGATTACTAGCATATGCCACCATGTCCACCTTTATGCTTTTTAAAGTGAAAAACCATACTAAGAATGAGGCAGCTCAACTTAATAATAAAAACATTTCAAATGTAAAGAAATTTACAAAAGAAAAACAATCAACCCCATTAAAATTGGGCAAAGGGAATGAACAGACACTTTTCAAAAGAATACATGCATGCAGCCAACAAACATACAAAAAAAAAGTTCAACATCACTGATCATTAGAGAAATGCAAATCAAAACCATAATGAGATACCATCTCACACCAGTCAGAATAGCTATCATTAAAAAGTCAAAAAATAACAGATGCTAGTGAGGCTATGGAGAAAAGGGAATGCTTATACACTGTTGTTGGGTGTGCAAATCAGTTCAATCATTGTGCAAGGAAAGTGATTCCTCAAAGAGCTAAAAGCAGAGCTACCATTCGACCCAGTAATCCCACTACTGGGTATATACCCAGATGAATATAAACCATTCTACCATAAAGACACATGCATACAAATGTTCATTGCAGCACTGTTCACAATAGCAAAAGTATGGGATCAACCTAATGCCCATCAATGACAGATTGGATAAAGAAAATGTGGTACATATACACCATGGAATACTATGCCGCCATTAAAAATGATATCATGTCTTTTGCTGGAATATGGATGGACCTTCTATTATCCTTAGCAAACTAATGCAGGAACAGAAAACCAAATATAGCATACTCTCAGTTATAAGTGGGAGCTAAA

  4. transcription translation DNA mRNA Protein Molecular network Organism Transcription and translation

  5. Gene expression matters

  6. Functional Genomics:"The Human Transcriptome Project" • Goals: • o experimentally identify all transcripts • o characterize their "function" • - many levels of detail • - contingent on what is experimentallly accessible • oultimately: math. models with explanatory power, involving genes, transcripts, proteins ("systems biology")

  7. samples: mRNA from tissue biopsies, cell lines fluorescent detection of the amount of sample-probe binding arrays: probes = gene-specific DNA strands tissue A tissue B tissue C ErbB2 0.02 1.12 2.12 VIM 1.1 5.8 1.8 ALDH4 2.2 0.6 1.0 CASP4 0.01 0.72 0.12 LAMA4 1.32 1.67 0.67 MCAM 4.2 2.93 3.31 microarrays

  8. DNA microarrays and cancer Cancer: somatic cells mutate to become anti-social, proliferate excessively, eventually kill the parent organism Current cancer taxonomies: o affected organ o cell type of origin o apparent grade of de-differentiation Goals: o molecular taxonomy: more precise, causal o molecular diagnosis: better estimation of risk and thus treatment strategy o molecular therapy (new, more specific drugs)

  9. log-ratio which genes are differentially transcribed? same-same tumor-normal

  10. Statistics 101: biasaccuracy  precision variance

  11. Basic dogma of data analysis: Can always increase sensitivity on the cost of specificity, or vice versa, the art is to find the best trade-off. X X X X X X X X X (It can also be possible to increase both by better choice of method / model)

  12. Raw data are not mRNA concentrations The problem is less that these steps are ‘not perfect’; it is that they may vary from array to array, experiment to experiment.

  13. Statistical error modeling approach Parameters a, b, s1, s2 Maximum-Likelihood Robustification a la "Least Quantile Sum of Squares" Variance stabilizing transformation Huber et al. Bioinformatics (2002), SAGMB (2003) Software package vsn (Bioconductor project / LGPL)

  14. evaluation: a benchmark for Affymetrix genechip expression measures o Data: Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex background Dilution series: from GeneLogic 60 x HGU95Av2, liver & CNS cRNA in different proportions and amounts o Benchmark: 15 quality measures regarding -reproducibility -sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu

  15.  ROC curves See Affycomp website at Johns Hopkins for full details

  16. Conclusion for this part o Around 2002, microarray data had come to be perceived as very noise and unreliable. o Some of this caused by immature hardware, but a lot by naïve, ill-guided processing software (incl. that from the instrument vendor) o Proper statistical analysis by academic researchers has improved sensitivity and selectivity dramatically. o Open, academic discussion (incl. software sources) was essential.

  17. Candidate gene sets from microarray studies: dozens…hundreds How to close the gap? Capacity of detailed in-vivo functional studies: one…few

  18.  Drowning by numbers How to separate a flood of ‘significant’ secondary effects from causally relevant ones? VHL: tumor suppressor with “gatekeeper” role in kidney cancers Boer et al. Genome Res. 2001: kidney tumor/normal profiling study

  19. Drowning by numbers Boer et al. Genome Res. 2001

  20.  Tumor profiling Gene lists are great as features for classification …but what about the underlying biology? • Strategies: • Clever experimental design – ask directed questions • 2. Deeper analysis – use all available “meta”data (GO, KEGG, Ensembl, Entrez, …) • 3.Deeper experiments - new technologies: protein interaction assays, protein profiling, cell-based functional assays Multivariate (regression, prediction) pn overfitting regularization (…arbitrary) One gene at a time multiple testing loss of power Cohort studies provide associations not causal interactions de novo discovery of a gene/protein’s role is hard

  21. Bioconductor an open source and open development software project for the analysis of biomedical and genomic data. Started in the fall of 2001 by Robert Gentleman (then at Harvard) and now includes 23 core developers in the US, Europe, and Australia. R and the R package system are used to design and distribute software. Initial focus on microarray statistics, now also: proteomics, cell-based assays, bioinformatic metadata, graph-theoretic methods

  22. Bioconductor Strict 6-monthly release cycle, starting with about 15 packages 1.0 in March 2003, now at 1.5 with ca. 100 packages Ca. 2000 downloads in 4 weeks after release 1.5 Aggressive development, no backward compatibility Packages vary in their maturity: software ecosystem

  23. Core contributors Ben Bolstad, UC Berkeley Andreas Buness, DKFZ Heidelberg Vince Carey, Biostatistics, Harvard Sandrine Dudoit, Biostatistics, UC Berkeley Jane Fridlyand, UCSF Laurent Gautier, Technical University of Denmark Jeff Gentry, Dana-Farber Cancer Institute Rafael Irizarry, Biostatistics, Johns Hopkins Denise Scholtens, Dana-Farber Cancer Institute Gordon Smyth, WEHI Yee Hwa (Jean) Yang, Biostatistics, UCSF Jianhua (John) Zhang, Dana-Farber Cancer Institute

  24. Goals Provide access to powerful statistical and graphical methods for the analysis of genomic data Facilitate the integration of biological metadata (GenBank, GO, LocusLink, PubMed) in the analysis of experimental data Allow the rapid development of extensible, interoperable, and scalable software Provide high-quality documentation Promote reproducible research Provide training in computational and statistical methods

  25. Tools in bioconductor Main platform: R But we also use many other tools: • PostgresQL • Perl • Python • Java • graphviz • Boost Graph Library (BGL) • libxml • MAGEstk • tcl/tk, Gtk Philosophy: don’t reinvent the wheel

  26. Component software Most interesting problems will require the coordinated application of many different techniques. Thus we need integrated interoperable software. Don’t think your method is the end of it all. Design your piece to be a cog in a big machine.  Software modules with standardized I/O instead of stand-alone applications Web service instead of web site

  27. Why are we Open Source so that you can find out what algorithm is being used, and how it is being used so that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs so that they can be used as components Transparency Pursuit of reproducibilty Efficiency of development Training

  28. Good scientific software is like a good scientific publication oReproducibility oPeer-review oEasy accessibility by other researchers, society o Build on the work of others o Others will build their work on top of it o Commercialize those developments that are successful and have a market

  29. Commercial distributions We coordinate with Insightful to help provide ArrayAnalyzer (which contains many Bioconductor packages and resources) Bioconductor is also interfaced by Genespring, Spotfire, ExpressionNTI Our software remains free, but some users are willing to pay money for professional support (hotlines, handbooks, stricter enforcement of uniform user interfaces,…) Win-win arrangement

  30. Bioconductor packagesRelease 1.5, Nov 2004 Ca. 100 Packages • General infrastructure: Biobase, DynDoc, reposTools, ruuid, tkWidgets, widgetTools, BioStrings • Annotation: annotate, AnnBuilder& data packages • Graphics: geneplotter, hexbin • Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, makecdfenv, vsn, gcrma • Pre-processing two-color spotted DNA microarray data: marray, vsn, arrayMagic, arrayQuality • Differential gene expression: edd, genefilter, limma, multtest, ROC, siggenes, EBArrays, factDesign • Graphs and networks: graph, RBGL, Rgraphviz • Other data: SAGElyzer, DNAcopy, PROcess, aCGH, prada N.B. Many new packages in Bioconductor development version.

  31. Classification, class prediction, machine learning: Predict outcome on basis of past observations of some explanatory variables (features) Outcome: E.g. tumor class, type of bacterial infection, response to treatment, survival Features: gene expression measures, covariates such as age, sex

  32. R class prediction packages class: k-nearest neighbor (knn), learning vector quantization (lvq) classPP: projection pursuit e1071: support vector machine (svm) MASS: linear and quadratic discriminant analysis (lda, qda) sma: diagonal linear and quadratic discriminant analysis, naïve Bayes nnet: feed-forward neural networks and multinomial log-linear models rpart: classification and regression trees knnTree: k-nn classification with variable selection inside leaves of a tree randomForest: random forests LogitBoost: boosting for tree stumps ipred: bagging, resampling based estimation of prediction error mlbench: machine learning benchmark problems. pamR: prediction analysis for microarrays gpls: generalized partial least squares

  33. graph, RBGL, Rgraphviz graph basic class definitions and functionality RBGL graph algorithms (e.g. shortest path, connectivity) Rgraphviz rendering. Different layout algorithms. Seamlessly combinable with R graphics.

  34.  Algorithms Shortest paths (edge weights: 1, positive, real) Connectivity (strong, weak) Graph traversal Minimal spanning tree

  35. Probabilistic tree model DNA aberrations in cancer (matrix CGH)

  36. domain combination graph Apic, Huber, Teichmann, J. Struct. Fct. Genomics (2003)

  37.  Association vs Interference Microarrays: Input: A few phenotype (e.g. disease) or treatments Readout: genome-wide transcription levels Mostly association and correlation Ensemble average over many cells / tissues Marriage of pathology and genomics Cellular assays: Input: perturbation of protein level or activities, genome-scale Output: phenotypes Causality Individual cells (microscopy / FACS) Marriage of cell biology and genomics

  38.  Interference/Perturbation tools RNAi + genome wide o specificity - efficiency / monitoring? Transfection (expression) + 100% specific + monitoring - library size Small compounds …

  39.  Monitoring tools Plate reader 96 or 384 well, 1…4 measurements per well FACS ca. 2000 x 4…8 measurements per well Automated Microscopy practically unlimited. Here: 30 x 1280 x 1024 x 3

  40.  full-length cDNAs Sequence AND physical clone with complete and non-interrupted protein coding region of a gene FLJ/IMS UTokyo 12,560 MGC/NIH (*) 11,414 FLJ/HRI (Kazusa) 8,057 DKFZ&MIPS (*) 5,521 FLJ/KDRI 2,000 others 1,096 unique 21,037 (*) publicly available http://www.h-invitational.jp Imanishi et al. (2004). Integrative annotation of 21,037 human genes validatedby full-length cDNA clones. PLoS Biol.2(6) FLJ: Full-Length Long Japan, IMS: Institute for Medical Science,HRI: Helix Research Institute, MGC: Mammalian Gene Collection

  41. 21,000+ human cDNAs (genes) O(100) disease-relevant 'novel' genes ‘hot’ candidates  Strategy microarray study (humans, in vivo) HT functional assay (cell culture)

  42. assays to challenge the cell-cycle and beyond  Other assays: - Protein localisation - Protein interaction

  43. GFP-ORF- protein BrdU incorporation DAPI: identification effect on proliferation CFP: expression automated microscope BrdU: proliferation Data analysis HT functional assays cDNA library (>100 clones) expression clones

  44. A A T BrdU C C S-Phase Detection of DNA-replication Cy5 BrdU Quantification of BrdU incorporation by specificantibodies Dorit Arlt

  45. Liquid handling Transfection & antibody incubation Sterile housing Transfection reagents Buffers & antibodies Chamber slides Expression plasmids Transfection mixing plate Christian Schmitt

  46.  High Content Screening Microscope - Cube - Urban Liebel EMBL

  47. YFP channel Cy5 channel ... 621 258 2101 144 1732 401 1183 120 493 219 66 297 232 421 182 120 286 332 ... DAPI YFP Cy5 Anti-BrdU  Automated image analysis Urban Liebel EMBL

  48. raw data

  49. Distribution of expression intensities Frequency (# of cells) Signal intensity (expression) YFP

  50.  Proliferation assay:within-well analyis ORF expression ORF expression ORF expression

More Related