1 / 28

Introduction & Motivation Dataset used Part I – Unbiased word counting

Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April 27 2005 Bioinformatics Capstone presentation. Introduction & Motivation Dataset used Part I – Unbiased word counting Part II – TCAGT-centric word counting Conclusions and Future work. Introduction.

katina
Download Presentation

Introduction & Motivation Dataset used Part I – Unbiased word counting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Analysis for Word countingin Drosophila Core PromotersYogita MantriApril 27 2005 Bioinformatics Capstone presentation

  2. Introduction & Motivation • Dataset used • Part I – Unbiased word counting • Part II – TCAGT-centric word counting • Conclusions and Future work

  3. Introduction • Regulatory elements are short DNA sequences that control gene expression. • They are often found around the Transcription Start Site (TSS), sometimes further upstream. • Identification of promoters and regulatory elements is a major challenge in bioinformatics: • Regulatory elements are not well-conserved • Computational discovery of TSS in not straightforward • Promoter sequences do not have distinguishable statistical properties • Transcription is a highly cooperative process including competitive or cooperative binding which is not completely determined from the rest of the genome’s DNA sequence

  4. Drosophila Core Promoters “Computational analysis of core promoters in the Drosophila Genome”, Ohler, Rubin et. al, Genome Biology 2002, 3(12):research0087.1–0087.12 Above image edited from: http://163.238.8.180/~davis/Bio_327/lectures/Transcription/TranscriptionOver.html

  5. Motivation for project • Database of Core Promoters with TSS experimentally determined is a huge advantage over other approaches using only gene upstream regions. • Word Counting method to determine significant patterns, inspired by Dr. Peter Cherbas’ earlier work. • “The arthropod initiator: the capsite consensus plays an important role in transcription”,Cherbas L, Cherbas P., Insect Biochem Mol Biol.1993 Jan;23(1):81-90

  6. Introduction & Motivation • Dataset used • Part I – Unbiased word counting • Part II – TCAGT-centric word counting • Conclusions and Future work

  7. The Database of Drosophila Core Promoters • Compiled by Sumit Middha. It consists of Drosophila core promoters from three experimental sources. • Ohler, Rubin et al: • 1941 promoters • Stringent criteria for identifying TSSs, requiring 5’ ends of multiple cDNAs to lie in close proximity. • Kadonaga et al: • 205 promoters • Changed TSS to coincide with A of Inr consensus TCAGT even if experimental results reported TSS in the vicinity. • The discrepancy was fixed by taking the experimentally reported TSS. • Eukaryotic Promoter Database: • 1926 promoters • Assigned TSS based on experimental data with a precision of +/- 5bp or better. • 3458 sequences after removing redundant entries in the dataset.

  8. Introduction & Motivation • Dataset used • Part I – Unbiased word counting • Part II – TCAGT-centric word counting • Conclusions and Future work

  9. Word Analysis – Part IUnbiased search • Used various statistical measures like Z-score on all possible n-mers in the entire dataset and in specific windows. • The goal was to see whether known patterns of interest were significantly enriched in promoter sequences than other patterns.

  10. Basic Statistics of the dataset • 3458 promoter sequences in the database. • First step was a word-frequency analysis (pentamers used for initial analysis) • Performed analysis on the following sets: • Entire dataset (DS-1) • Subset of above dataset, with only -20 to +20 region (DS-2) • 2 types of analyses, differing in “Random” sequences used: • 1st Order Markov Chains based on base and transition probabilities of respective dataset • “non-coding” regions

  11. Random set • Generated 100 sets of 1st order Markov chains • Each set contained same number of sequences as original datase (3458), and having same length (350) • Computed occurrence of each pentamer in actual and random sequences • For random sequences, calculated average and S.D over all sets

  12. Z-score • A test of significance • Mean and S.D calculated over 100 sets • Calculated Z-scores for all pentamers • Looking for pentamers with very high or very low Z-scores

  13. Rank of TCAGT and variants in entire dataset

  14. Summary of known pentamers in different windows -20+20 Non-overlapping windows Sliding Windows

  15. Z-score Plots of tcagt and variants using sliding windows of 10 bp

  16. Lesson • Cannot ignore position preference of regulatory motifs!

  17. Introduction & Motivation • Dataset used • Part I – Unbiased word counting • Part II – TCAGT-centric word counting • Conclusions and Future work

  18. Word Analysis – Part IIGuided search, starting with known INR element TCAGT • Identification of INR enriched regions • Identification of synonyms • Correlation analysis of INR synonyms • Guided search

  19. TCAGT-centric word analysis Window Zscore (-3,3) 130.58 (-4,2) 116.27 (-2,4) 105.67 (-5,1) 98.96 (-6,-1) 95.71 (-7,-2) 85.83 (-1,5) 59.23 (1,6) 47.68 (2,7) 43.30 (3,8) 28.79

  20. INR Synonyms Group1 CTCAG--- ATCAG--- TTCAG--- GTCAG--- -TCAGT-- ---AGTTG ---AGTCG --CAGTT- --CAGTC- Group 4 -TCACA- GTCAC-- --CACAC Group 5 TCACTCT Group 6 -CATTC TCATT- Group 2 TTAGT “Computational analysis of core promoters in the Drosophila Genome”, Ohler, Rubin et. al, Genome Biology 2002, 3(12):research0087.1–0087.12 Group 3 ACACT--- -CACTCTG

  21. Binary Tree Representation of Dataset TOTAL: 3412 INR+ INR- 1801 1611 TATA+ TATA- TATA+ TATA- 397 1404 1201 410 DPE+ DPE- DPE- DPE+ DPE- DPE+ DPE- DPE+ 1172 232 79 321 331 369 76 832

  22. 250.0 INR (-10, +2) 200.0 ggtcacact ggtcacac ttcagtcg 150.0 cggtcacac tcagt DPE (+20, -30) TATA (-40, -35) 100.0 cggacgtg tataaaag 50.0 0.0 3 Clusters in INR-positive set

  23. Contingency Matrices for INR, TATA, DPE

  24. Possible Alternative TATA and INR Synonyms ?? 90.0 TATA – 2 ? INR – 2 ? 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0

  25. Enrichment further upstream – New Binding Sites?

  26. Next Level of Binary Tree analysis TOTAL: 3412 INR+ INR- 1611 1801 TATA+ TATA- INR_2+ INR_2- 410 1201 DPE- DPE+ 397 1404 TATA_2- TATA_2+ DPE- DPE- DPE+ ? ? DPE+ DPE- DPE- DPE+ DPE+

  27. Conclusions & Future steps • The main goal of this project was to try to identify significant words based on only statistical over-representation. • The first part of the analysis using an unbiased searching method was successful only in a very narrow range of positions around the TSS. • However, the biased search starting with the Inr consensus revealed the 3 known regulatory elements in that region. • An analysis of the Inr-negative set showed over-expression of patterns in the same positions as the Inr, TATA and DPE should be, and could be possible synonyms. • Thus the word-counting strategy has the potential to reveal: • Regulatory motifs and interrelationships that other motif discovery programs cannot • Synonyms for regulatory motifs • Dependencies among regulatory motifs

  28. Acknowledgements • Dr. Haixu Tang • Dr. Sun Kim • Dr. Peter Cherbas • Sumit Middha • Bioinformatics Research Group

More Related