1 / 18

Purpose of the Session

Thoughts on Integrative Genome Analysis -- as Stimulus for a Discussion towards Consortium Publication(s).

aliza
Download Presentation

Purpose of the Session

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thoughts on Integrative Genome Analysis -- as Stimulus for a Discussion towards Consortium Publication(s) AWG Members: Srinka Ghosh, Sandra Michelsen, David MacAlpine, Matt Eaton, Steve Henikoff, Ann Hammonds, Gos Micklem, Manolis Kellis, Peter Park, Xiaole Liu, Mark Gerstein, Sue Celniker, Eric Lai, Kris Gunsalus, Bob Waterston, David Miller, Lincoln Stein modENCODE Consortium meeting Rockville, MD 2008.06.17, 9:15-10:30(10' near beginning of session) Slides downloadable from Lectures.GersteinLab.org. (Please read permissions statement.) Paper references mostly from Papers.GersteinLab.org. (Quick overview of the ENCODE pilot results, focusing on G&T results, pgenes + DART. [I:ENCODE], fit into time )

  2. Purpose of the Session • Our charge • "NHGRI would like the Consortiumto think early on about what would be involved in an integrative paperand what would be the steps forward to accomplish this goal." • Specific Things to think about • What integrative analyses do we want to do? • What have we done, a year in? • What do we need to do? • Data freezes ? Analysis discussions ? • Bringing in new types of data or analysts ? • Fast-track particular experiments or analyses ?

  3. What is Integrative Analysis?Brief presentations to give us "data" for our deliberations • What were integrative analyses in the framework of pilot ENCODE? (case study) • Where is modENCODE with respect to these? • 2 Case studies on • microRNAs in fly (E Lai) and • the 3' UTRome in worm (K Gunsalas)

  4. Vignette drawn from activity in pilot ENCODE Genes & Transcripts group (Gingeras TR, Guigó R, Snyder M, Birney E, Zhang ZD, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S...) Case study in the annotation of un-annotated transcription and relating it other annotation A Tale of TARs (& TxFrags)

  5. Tech Production Integrated Different types of analyses carried out in the modENCODE/ENCODE consortia • Development of Sequence (and Array) Technology • Output of Production Pipelines and Surveying of Single Type of Annotation • Integrated Analysis Connecting Different Types of Annotation Where are we now in modENCODE?

  6. Tech Production A Starting Point: Noisy Raw Signal from Tiling Arrays Johnson et al. (2005) TIG, 21, 93-102. Li et al., PLOS one (2007)

  7. Tech Production Signal Processing to Normalize and Standardize Signals to Get a Useable and Cross-experiment comparable Signal Map • Array data can be normalized by mean, median, quantile, &c. How to do this consistently? • Tile scoring using a (smoothing) sliding window generates the signal map and the P-value map. Source: Bolstad, B.M., et al (2003), Bioinformatics, 19, 185-93. Zhang et al. (2007) GenomeBiology

  8. Tech Calibrating Error Rates for Each Platform • Sens. v Spec. for different platforms • How does this extrapolate genome-wide, across samples? • Attempting to score experiments in uniform fashion • Understand "lab-effects" vs real ones • Where is NextGen seq. technology in rel. to this? Emanuelsson et al. ('07) Gen. Res.

  9. Iterative Process of Building a Model, Segmenting Signal into discrete, easily useable "hits" (TARs/TxFrags), validating some of them Defining consistent definitions of "Hits" and TARs (e.g. point sources) Defining consistent thresholds Tech Production Segmentation: Finding Discrete Annotation Blocks (TARs/TxFrags) from Processed Signal [Du et al. (2006) Bioinformatics; Fig. from Gerstein et al., Gen. Res. (’07)]

  10. Production Statistics on the TxFrags: Surveys of a Single Type of Annotation • Annotated and unannotated TxFrags detected in different cell lines. [ENCODE Consortium, Nature 447, 2007]

  11. Production Integrated More Developed Annotation: Clustering and Classifying Blocks of Un-annotated Transcription into larger units Phylogenetic Profiles or Rozowsky et al. Gen. Res. (2007)

  12. Vast Amounts of Different Data Types to Integrate in pilot ENCODE Determining experimental signals for biochemical activity across each base of genome Large-scale sequence comparison in relation to the human genome Integrated 14 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu [ENCODE Consortium, Nature 447, 2007]

  13. Single Ex. of PseudogeneIntersecting with Transcriptional and Regulatory Evidence Are integrated experiments comparable -- i.e. done on consistent cell lines, on same coordinate sys., &c. Integrated Composite ChIP hit Special yG tracks in browser diTAG CAGE TARs ChIP-chip Connecting TARs (TxFrags) in Integrative fashion to different types of Annotation Zheng et al. (2007) Gen. Res.

  14. Integrating Transcriptional Evidence with Gene Annotation and Sequence Constraints Avg. Integration over many instances No Greater Tendency for Transcribed Pseudogenes to be under Selective Constraint Need a way of easily defining degree of constraint on sequence (not so easy for non-coding) Integrated Processed pseudogene Non-processed pseudogene Gene Transcribed Ka/Ks Measurement of Short-time variation (pN+pS) Zheng et al. (2007) Gen. Res.

  15. Integrating & averaging results over larger and larger sets Comparison of integrated quantities Integrated Biochemically Active Regions Don't all Appear to be Under Constraint [ENCODE Consortium, Nature 447, 2007]

  16. Not all constrained sequence annotated in some fashion Exactly how things are defined in terms of overlap? "At the outset of the ENCODE Project, many believed that the broad collection of experimental data would nicely dovetail with the detailed evolutionary information derived from comparing multiple mammalian sequences to provide a neat ‘dictionary’ of conserved genomic elements, each with a growing annotation about their biochemical function(s). In one sense, this was achieved; the majority of constrained bases in the ENCODE regions are now associated with at least some experimentally-derived information about function. However, we have also encountered a remarkable excess of unconstrained experimentally-identified functional elements, and these cannot be dismissed for technical reasons. This is perhaps the biggest surprise of the pilot phase of the ENCODE Project,and suggests that we take a more ‘neutral’ view of many of the functions conferred by the genome. " Integrated Grand Summary: Biochemical Activity vs. Sequence Constraints [ENCODE Consortium, Nature 447, 2007]

  17. Making thresholds and statistics comparable across organisms (so we can really say whether or not worm has or less novel transcription than fly) Can we relate tissues and developmental states betw. organisms? Can we deal with seq. constraint in a uniform fashion? Defining orthologs betw. organisms and lineage-specific genes Proper liaison with Encode Integrated ++ An Added Challenge for modENCODE - Comparing Results among Evolutionary Distant Organisms

  18. Tale of TARs: Where is modENCODE? Where are we on Tech, Production, Integrated Spectrum? More action on seq. tech call than AWG Tech stuff is important, towards getting useable, comparable error-parameterized annotation blocks Importance of comparative genomics issues Constraint, orthologs Tech Production Integrated ++ ? 21 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu

More Related