1 / 10

McPromoter – an ancient tool to predict transcription start sites

McPromoter – an ancient tool to predict transcription start sites. Uwe Ohler uwe.ohler@duke.edu Institute for Genome Sciences and Policy Duke University (BDGP/Univ Erlangen). An extremely simplified view of eukaryotic transcription.

Download Presentation

McPromoter – an ancient tool to predict transcription start sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. McPromoter – an ancient tool to predict transcription start sites UweOhler uwe.ohler@duke.edu Institute for Genome Sciences and Policy Duke University (BDGP/Univ Erlangen)

  2. An extremely simplified view of eukaryotic transcription • Specific information about functional context of genes: proximal promoter/enhancers • Binding sites of specific transcription factors confer activation at the right developmental stage or tissue • General information: the core promoter • Region around the transcription start site (TSS) where RNA polymerase II (pol-II) interacts with general transcription factors • Potentially far away from the translation start site

  3. Probabilistic modeling of promoters • Goal: find TSS / proximal promoters ab initio • Alternative to cDNA alignments • Independent of and in addition to gene prediction • Probabilistic modeling allows to deal with uncertainty • Models for classes of related sequences • Models represent our knowledge about sequences in form of parameters • Parameters are automatically estimated using a representative set of sequences • Model gives probability of sequence to belong to class, here: promoter or non-promoter (coding, non-coding)

  4. McPromoter system structure

  5. Non-promoter classes:Stationary Markov chains • Markov chain as tree • Every node corresponds to a context • Contains probability distribution • Typical order: 6 (4,096 overall parameters) • Probability of a sequence • Approximation: Restrict context to the last N symbols (N-th order chain) • Variations on Markov chains • Variable Order: Leaves on different levels • Interpolated: Combination of parameter values from different levels

  6. Promoter model • Simple approach: Markov chain model • Better: Take structure into account • Generalized hidden Markov model • Each state contains a submodel for a specific promoter part, including an explicit length distribution • Interpolated Markov chains as submodels Ohler et al., Bioinformatics 1999, PSB 2000

  7. Example: stat6 promoter http://genes.mit.edu/McPromoter.html

  8. Evaluation of ENCODE regions • Similar problem to alternative splicing: alternative transcription start sites • Traditionally, the window to count false positives has been very large (e.g., -2,000/+2,000),and close predictions within a large window are merged • Evaluate on a per gene basis, i.e. count a true positive if it hits at least one of the annotated TSSs • Second problem: False negatives • After GASP, counting only those predictions internal to the annotated transcripts is the de facto standard • 435 genes / 1,022 different TSSs • Another problem: Circularity? (use of Eponine) Reese et al., Genome Res (2000)

  9. Results in the ENCODE region • Standard paramters, NO repeat masking, merging predictions within 2,000 nt:695 predictions • Positive region -2,000/+2,000:204 TP / 197 genes (sn 47%); 77 FP (sp 73%); 414 unknown • More stringent: -500/+500169 TP (sn 39%)101 FP (sp 63%) • Does it make sense to move towards a more detailed evaluation?

  10. Thanks to... Berkeley Drosophila Genome Project Gerry Rubin Martin Reese Suzi Lewis Erlangen – Institute for Computer Science Heinrich Niemann Stefan Harbeck Georg Stemmer

More Related