1 / 19

Fprom promoter predictions

Fprom promoter predictions. Victor Solovyev & Igor Seledtsov Royal Holloway College, University of London Softberry Inc. Results of promoter search on genes with known mRNAs by different promoter-finding programs. Reproduced from Liu and States (2002) Genome Research 12:462-469.

nora
Download Presentation

Fprom promoter predictions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fprom promoter predictions Victor Solovyev & Igor Seledtsov Royal Holloway College, University of London Softberry Inc.

  2. Results of promoter search on genes with known mRNAs by different promoter-finding programs. Reproduced from Liu and States (2002) Genome Research 12:462-469.

  3. RegSite functional motifs DB FORMAT for Regulatory motifs DB AC Accession number ... RSP***** /Plant RegSite) or RSA***** (Animal RegSite/ DT Date, Author ... 19-05-2002 IS OS Organism/species GS Gene short name GF Gene full name SN RegSite short name SD Description /full name/ of REgSite DE More detailed description of RegSite BF Binding Factor SA RegSite activity character SL RegSite localization in relation to the gene /5'-end;, Intron; Exon;3'-end/ SP Accurate localization of RegSite /for example, -243:-226/ SB Number of RegSite /1-box, 2-box,.../ SC Structural type of RegSite /1-comp, 2-comp,.../ DI Distance between the boxes of RegSite of the "2-box", 3-box, etc. type SQ Sequences of RegSite /only conservative nucleotides in the upper case/ SS AC of Similar sites /for exapmle, RSP00013, RSP00045/ CS Consensus of RegSite AL Reference(s) for the Alignment of related RegSites /for example, "see: [1]"/ MA Reference to the Matrix for Related RegSites PT Promoter type /TATA, TATA-less/ PS Promoter sequence reference OR Organ specificity TI Tissue specificity CE Cell specificity ON Ontogenesis specificity ES Enviromental signal EC Enviromental conditions MT Methods of discovery of RegSite RE References

  4. Fprom learning and testing Extracted 2072 non-redundant vertebrates promoters from EPD 1) Learning set: 1425 promoters2) Control set: 650 promotersConsidered 2 typesLearning set includes: Type: 0, 1059 promoters. (non-TATA promoters) Type: 1, 366 promoters. (TATA promoters)

  5. TATA-promoters selected features Feature 0: Sites in region -200 : -101 in Direct chain Feature 1: Sites in region -100 : -1 in Direct chain Feature 2: Sites in region -100 : -1 in Reverse chain Feature 3: TATA box maximal weight on interval -45 -25 Feature 4: TATA box average score on interval -45 -25 Feature 5: CG-content Feature 6: GGCG, CCGCC, CGCC, GCGG - content. Feature 7: Position triplet matrix in TSS-50 : TSS+30 region. Feature 8: TSS -100 : TSS +100 region correlation. Feature 9: Protein-DNA-twist. Feature 10: Protein-induced deformability. Feature 11: Hexaplets in region -200 : -45 Feature 12: Hexaplets in region 0 : 40 Feature 13: Triplets in region -200 : -45 Feature 14: Triplets in region 0 : 40

  6. Improving discrimination Step: 1, Selected 1 features, MD:15.121549 Significant Features list: 7 Step: 2, Selected 2 features, MD:15.911261 Significant Features list: 7 3 Step: 3, Selected 3 features, MD:16.480181 Significant Features list: 7 3 0 Step: 4, Selected 4 features, MD:16.691770 Significant Features list: 7 3 0 9 Step: 5, Selected 5 features, MD:16.959790 Significant Features list: 7 3 0 9 4 Step: 6, Selected 6 features, MD:17.273613 Significant Features list: 7 3 0 9 4 10 Step: 7, Selected 7 features, MD:17.533121 Significant Features list: 7 3 0 9 4 10 12 Step: 8, Selected 8 features, MD:19.514215 Significant Features list: 7 3 0 9 4 10 12 14 Step: 9, Selected 9 features, MD:19.727513 Significant Features list: 7 3 0 9 4 10 12 14 2 Step: 10, Selected 10 features, MD:19.915504 Significant Features list: 7 3 0 9 4 10 12 14 2 8 Step: 11, Selected 11 features, MD:19.998658 Significant Features list: 7 3 0 9 4 10 12 14 2 8 11 Step: 12, Selected 12 features, MD:21.665564 Significant Features list: 7 3 0 9 4 10 12 14 2 8 11 13 Step: 13, Selected 13 features, MD:22.282032 Significant Features list: 7 3 0 9 4 10 12 14 2 8 11 13 5 Step: 14, Selected 14 features, MD:22.287819 Not significant! Features list: 7 3 0 9 4 10 12 14 2 8 11 13 5 6 Step: 15, Selected 15 features, MD:22.287839 Not significant! Features list: 7 3 0 9 4 10 12 14 2 8 11 13 5 6 1

  7. Accuracy of TATA promoter identification e1:0.05 e2:0.081882( 1523) thr: -5.081 L:196 e1:0.10 e2:0.035430( 659) thr: -3.149 L:454 e1:0.20 e2:0.007903( 147) thr: -0.4503 L:2038 e1:0.30 e2:0.002097( 39) thr: +1.615 L:7684 e1:0.40 e2:0.000538( 10) thr: +3.729 L:29967 e1:0.50 e2:0.000269( 5) thr: +5.972 L:59935 e1:0.60 e2:0.000161( 3) thr: +7.345 L:99893 e1:0.70 e2:0.000054( 1) thr: +8.841 L:299679

  8. Features selected for non_TATA promoters Feature 0: Sites in region -200 : -101 in Direct chain Feature 1: Sites in region -100 : -1 in Direct chain Feature 2: Sites in region -100 : -1 in Reverse chain Feature 3: TATA box maximal weight on interval -45 –25 Feature 4: TATA box sub-maximal weight on interval -45 -25 Feature 5: TATA box average score on interval -45 -25 Feature 6: CG-content (number of CG dinucleotides) Feature 7: GGCG, CCGCC, CGCC, GCGG - content. Feature 8: Position triplet matrix in TSS-5 : TSS+40 region. Feature 9: Position triplet matrix in TSS-50 : TSS+-10 region. Feature 10: TSS -100 : TSS +100 region correlation. Feature 11: Duplex stability - disrupt energy. Feature 12: Protein-induced deformability. Feature 13: Z-DNA stabilizing energy. Feature 14: Hexaplets in region -150 : -1 Feature 15: Hexaplets in region 0 : 40 Feature 16: Triplets in region -150 : -1 Feature 17: Triplets in region 0 : 40

  9. Accuracy of prediction of non-TATA promoters e1:0.2 e2:0.014117( 3968) thr: +1.654 L:75 e1:0.3 e2:0.005244( 1474) thr: +3.747 L:203 e1:0.4 e2:0.002316( 651) thr: +5.344 L:460 e1:0.5 e2:0.001256( 353) thr: +6.523 L:848 e1:0.6 e2:0.000559( 157) thr: +7.708 L:1908 e1:0.7 e2:0.000320( 90) thr: +8.629 L:3329

  10. Examples of prediction in –5000 - + 200 regions Sequence 1 of 1, Name: EP07055 (+) Hs MT2A; Metallothionein 2A. Length of sequence: 5199 4 promoter/enhancer(s) are predicted Promoter Pos: 5001 LDF: +8.891 TATA box at 4972 +3.581 TATAAACA Enchancer at: 4813 Score: +6.979 Promoter Pos: 3042 LDF: +1.532 TATA box at 3011 +6.455 TATATAAT Promoter Pos: 2498 LDF: +1.127 TATA box at 2469 +7.713 TATATATA Promoter Pos: 1124 LDF: +1.036 TATA box at 1096 +3.900 TATAAACA ----------------------------------------------------------------------------------------------------------------------- Sequence 1 of 1, Name: EP11068 (+) Hs histone H2A; Histone 1 Length of sequence: 5200 2 promoter/enhancer(s) are predicted Promoter Pos: 4993 LDF: +4.034 TATA box at 4964 +7.155 TATAAATA Enchancer at: 5068 Score: +6.316 Promoter Pos: 437 LDF: -0.325 TATA box at 406 +3.711 TATAAGAA ------------------------------------------------------------------------------------------------------------------------------ Sequence 1 of 1, Name: EP11104 (+) Hs b'-globin; beta-globin. hg17_ Length of sequence: 5200 3 promoter/enhancer(s) are predicted Promoter Pos: 5001 LDF: +6.455 TATA box at 4969 +7.614 CATAAAAG Promoter Pos: 4491 LDF: +0.770 TATA box at 4448 +7.817 TATATATA Promoter Pos: 2405 LDF: +0.645 TATA box at 2371 +4.563 AATAAAAG

  11. Promoter prediction in Genomic sequences • Annotate genes • Take region upstream 5’-CDS of predicted genes • or upsream of known mRNA • (for example for encode sequences • we selected –5000 bp) • Run Fprom in these regions • Select the Most right located or • MAX scoring promoters

  12. Location relative to mRNA start We have 251 genes that derived from Refseq mRNA with > 40 bp 5’-noncoding sequence For 90% of them (226) promoters were predicted 95 - with TATA boxes (TATA+) 131 - without TATA boxes (non-TATA)

  13. Accuracy of prediction

  14. What we have not done

  15. PromH with ortologous sequences

  16. To Encode: • Provide and improve promoter/TSS annotations of 44 Encode regions to use them as a test bed for promoter prediction software • Start work with functional motif characterization of promoters. Motifs database.

More Related