1 / 44

Develop mathematical, statistical, and computational methods

Develop mathematical, statistical, and computational methods to analyse biologically or technologically novel experiments in order to understand disease-relevant regulatory and genetic interaction networks. What we do.

gema
Download Presentation

Develop mathematical, statistical, and computational methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Develop mathematical, statistical, and computational methods to analyse biologically or technologically novel experiments in order to understand disease-relevant regulatory and genetic interaction networks

  2. What we do High-density marrays for RNA transcription and protein-DNA binding: Regulatory networks in heart development Jörn Tödling (with Silke Sperling, MPI Molecular Genetics) Fundamentals of transcription and genetics in yeast Matt Ritchie (with Lars Steinmetz, EMBL)  High-throughput RNAi assays, high-content automated microscopy, genetic interaction networks Oleg Sklyar, Ligia Bras, Thomas Horn (with Michael Boutros, DKFZ, Robert Gentleman, FHCRC Seattle, Amy Kiger, UCSD)  Bioconductor

  3. Bioconductor an open source and open development software project for the analysis of biomedical and genomic data. Started in the fall of 2001 by Robert Gentleman (then at Harvard) and now includes 23 core developers in the US, Europe, and Australia. R and the R package system are used to design and distribute software. Initial focus on microarrays, now also: proteomics, cell-based assays, bioinformatic metadata, graph-theoretic methods

  4. Bioconductor Strict 6-monthly release cycle, starting with about 15 packages 1.0 in March 2003, now at 1.7 with ca. 140 packages Each release has several 1000 downloads Aggressive development - state of the art algorithms, backward compatibility is desired but sometimes not possible Packages vary in their maturity: somewhere between "software textbook" and "software journal"

  5. Acknowledgments Ben Bolstad, UC Berkeley Vince Carey, Biostatistics, Harvard Sandrine Dudoit, Biostatistics, UC Berkeley Seth Falcon, Fred Hutchinson Cancer Res Ctre, Seattle Robert Gentleman, FHCRC Jeff Gentry, Dana-Farber Cancer Institute Florian Hahne, DKFZ Rafael Irizarry, Biostatistics, Johns Hopkins Li Long, Swiss Institute of Bioinformatics, Switzerland. James MacDonald, University of Michigan, USA Martin Maechler, ETH Zürich, CH Denise Scholtens, U Chicago Gordon Smyth, WEHI … and many others

  6. Goals of Bioconductor Provide access to powerful statistical and graphical methods for the analysis of genomic data Facilitate the integration of biological metadata (GenBank, GO, LocusLink, PubMed, Ensembl) in the analysis of experimental data Allow the rapid development of extensible, interoperable, and scalable software Provide high-quality documentation Promote reproducible research Provide training in computational and statistical methods

  7. Tools in bioconductor Main platform: R But we also use many other tools: • graphviz • Boost Graph Library (BGL) • libxml • mySQL • Biomart/Ensembl • imageMagick • C/C++, Perl, Java • MAGEstk • tcl/tk, Gtk Philosophy: don’t reinvent the wheel

  8. Component software Most interesting problems will require the coordinated application of many different techniques. Thus we need integrated interoperable software. Don’t think your method is the end of it all. Design your piece to be a cog in a big machine.  Software modules with standardized I/O instead of stand-alone applications Web service instead of web site

  9. Why are we Open Source so that you can find out what algorithm is being used, and how it is being used so that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs so that they can be used as components Transparency Pursuit of reproducibilty Efficiency of development Training

  10. Good scientific software is like a scientific publication oReproducibility oPeer-review oEasy accessibility by other researchers, society o Build on the work of others o Others will build their work on top of it o Commercialize those developments that are successful and have a market

  11. S. cerevisiae

  12. Genechip S. cerevisiae Tiling Array 4 bp tiling path over complete genome (12 M basepairs, 16 chromosomes) Sense and Antisense strands 6.5 Mio oligonucleotides 5 mm feature size manufactured by Affymetrix designed by Lars Steinmetz (EMBL & Stanford Genome Center)

  13. RNA Hybridization

  14. Before normalization

  15. DNA hybridization Probe specific response normali-zation raw RNA hybridization data remove ‘dead’ probes

  16. Probe-specific response normalization siprobe-sequence specific response factor. Estimated from DNA hybridization data bi =b(si )probe-sequence specific background term. Estimation: for strata of probes with similar si, estimate b through location estimator of distribution of intergenic probes, then interpolate to obtain continuous b(s)

  17. After normalization

  18. Segmentation Two obvious options: Smoothing and thresholding: simple, but estimates of transcript boundaries will be biasedand depend on expression level Hidden Markov Model (HMM): but our “states” come from a continuum, unclear how to discretize Our solution: Fit a piecewise constant function change point

  19. The model t1,…, tS: change points Y: normalized intensities x: genomic coordinates mk: level of k-th segment

  20. Model fitting Minimize ... t1,…, tS: change points J: number of replicate arrays

  21. Maximization Naïve optimization has complexity ns, where n≈105 and s≈103. Fortunately, there is a dynamic programming algorithm with complexity O(n2), and good heuristic O(n): F. Picard, S.Robin, M. Lavielle, C. Vaisse, G. Celeux, JJ Daudin, BMC Bioinformatics (2005) Bai+Perron, Journal of Applied Econometrics (2003) Software: W. Huber, packagetilingArray, www.bioconductor.org A. Zeileis, package strucchange, CRAN

  22. Confidence Intervals Di level difference Qi no. data points per unit t Wi error variance (allowing serial correlations) true and estimated change points Vi(s) appropriately scaled and shifted Wiener process (Brownian motion) Bai and Perron, J. Appl. Econometrics 18 (2003)

  23. Segmentation Results 1. Compare to known 2. Discover new

  24. A closer look

  25. Mapping of UTRs

  26. UTR lengths for 2044 ORFs 68 nucleotides median On average 3’ UTRs are longer than 5’ UTRs No correlation between 3’ and 5’ lengths 91 nucleotides median

  27. Long 5' UTR including cotranscribed uORFs Mapped to precision of 9 bases to known

  28. Transcriptional architectures 921 ORFs were divided into at least two segments MET7- folylpolyglutamate synthetase, catalyzes extension of the glutamate chains of the folate coenzymes

  29. YCK2 GIM3 PCR product Operon-like structures 123 segments contained ORFs of more than one protein-coding gene YCK2 casein kinase I, involved in cytokinesis GIM3 tubulin binding, involved in microtubule biogenesis

  30. Adjacent transcripts of non-coding and coding genes Martens, J. A., Laprade, L. & Winston, F. Intergenic transcription is required to repress the Saccharomyces cerevisiae SER3 gene. Nature429, 571-574 (2004).

  31. Expressed Features 5654 ORFs with ≥ 7 unique probes 5104 (90%) detected above background (FDR=0.001) untranscribed: meiosis, sporulation, mating, sugar transport, vitamin metabolism 11,412,997 bp of unique sequence 75.2% density of prior annotation (either strand) 84.5% detected above background (") 16.2% of transcribed bp (exp growth in rich media) not yet annotated Fraction of transcribed basepairs

  32. Novel Transcripts

  33. Antisense transcripts CBF1-bs CBF1: regulatory module involved in cell cycle and stress response; DNA replication and chromosome cycle; defects in growth in rich media

  34. Novel transcripts Basis: multiple alignment of 4 yeast genomes: S.cerevisiae, S.bayanus, S.mikatae, S.paradoxus. Kellis et al. Nature (2003) Conservation analysis: fraction of segments for which there is a multiple alignment; total tree length Codon signature: 3-periodicity of mutation frequencies novel transcribed segments  untranscribed << annotated transcripts. with Lee Bofkin, Nick Goldman

  35. Antisense and UTR length 3’ UTRs have more antisense than 5’ UTRs UTRs with antisense are longer than UTRs without

  36. Antisense transcripts • microtubule-mediated nuclear migration • cell separation during cytokinesis • cell wall • single-stranded RNA binding (NAB2, NAB3, NPL3, PAB1, SGN1) • Meiosis genes

  37. Antisense transcripts: NAB2

  38. Antisense transcripts: NAB3

  39. RNA mediated regulation UTR lengths correlate with function, localization, regulation Antisense correlate with GO categories Antisense found predominantly to 3’ UTRs and longer UTRs 3’ UTRs are targets of miRNAs in other species … suggesting a functional role of antisense transcripts in S. cerevisiae

  40. Antisense to CLN2 – G1 cyclin

  41. Conclusions o Conventional microarrays: measure transcript levels o High resolution tiling arrays: also transcript structure introns, exons transcription start & stop sites overlapping populations of transcripts non-coding RNA: UTRs, ncRNAs, antisense o Probe-response normalization: make signal comparable across probes o Model-based segmentation method with exact algorithm, including confidence intervals o Genome-wide evidence for association of non-coding RNA (antisense, UTRs) with function of the corresponding genes

  42. Acknowledgements Marina Granovskaia EMBL Heidelberg Lior David, Curt Palm Stanford Genome Tech. Center Jörn Tödling, Lee Bofkin, Nick Goldman EMBL-EBI Cambridge Bionductor project Robert Gentleman Vince Carey Rafael Irizarry Ben Bolstad Paul Murrell Achim Zeileis EBI Matt Ritchie Lígia Brás Florian Hahne Oleg Sklyar

  43. RRB1 – essential regulator of ribosome biogenesisJPL2 – protein of unknown function

More Related