1 / 16

Martin Tompa University of Washington

An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem. Martin Tompa University of Washington. Slides courtesy Yoonkyong Lee. Ribosome Binding Site Problem.

hagop
Download Presentation

Martin Tompa University of Washington

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem Martin Tompa University of Washington Slides courtesy Yoonkyong Lee

  2. Ribosome Binding Site Problem • Identifying the short mRNA motif in the 5’ untranslated region, called the ribosome binding site, of a typical prokaryote • Why Ribosome? • The SD site is complementary to a short sequence near the 3’ end of the ribosome’s 16S RNA • Why Prokaryote? • The great similarity among SD sites in several prokaryotes • Shine-Dalgarno sequence: AAGGAGG

  3. Ribosome Binding Site Problem Shine-Dalgarno sequence: AAGGAGG

  4. Problem Definition • Search for instances of a motif of length 5 within 20-mer just 5’ to the translation start site of each N≈4000 open reading frame • Instances of the motif maymatch inexactly • Given 4000 sequences, each of length 20,search the approximately equal sequence, s, of length 5

  5. Contributions • The given solution is applicable of other sequence analysis problems involving the identification of short motifs • This problem is important as a step in the validation of true genes and in the identification of the correct translation start sites

  6. Statistical Significance of Motif Occurrences • Observation • A good measure for comparing sequences should take into account both the absolute number of occurrences and the background distribution • Solution • For each k-mer s, record the number Ns of sequences containing s, where c substitutions of residues allowed • Estimate how unlikely to have Ns occurrences if the sequences were generated based on the background distribution

  7. How to estimate “how unlikely” • X: single random sequence of the specified length L, according to the background distribution • ps: the probability that X contains at least one occurrence of the k-mer s, allowing for c substitutions • Assumption: N sequences are independent • The associated z-score: • Measuring how unlikely it is to have Ns occurrences of s, given the background distribution Expected number containing at least 1 occurrence of s Standard deviation

  8. How to estimate ps – Step 1 • Construct a deterministic FA, M, accepting strings containing a substring matching s with at most one substitution • States: One for every string u matching a prefix of s with at most one substitution → 1.5 |s|2 + O(|s|) states • Transition function: Given the string u and the input char σ, transit to the state corresponding to the longest suffix of uσ • Construction time: O(|s|2)

  9. How to estimate ps – Step 2 • Given the transition probabilities aij of the Markov chain G that generates X, transform M into a Markov chain M’ • by assigning aij to those transitions of M labeled j out of those states whose corresponding string u ends with i • Estimate ps, the prob. of going from the start state to the accepting state in |X| steps in M’, through the product of a vector and a matrix of size Θ(|s|4). Since the matrix is sparse, this can be done in O(|X|·|s|2)

  10. Why O(|X|·|s|2)? Transition Probability Matrix: Θ(|s|4) 4 non-zero entries(Σ={A,T,G,C}) 1.5 |s|2 + O(|s|) … Comp. of matrix-vector product:O(|s|2) 1.5 |s|2 + O(|s|) |X| products required O(|X|·|s|2)

  11. Experimental Results • 14 prokaryotic genomes • 10 bacteria • 9 of 10: strong dominance of SD sequence, AAGGAGG • One exceptional case: M. genitalium • 4 archaea • Predominance of GGTGA or GGTG AAGGAGG Archaea GGTGA

  12. Bacterial Genomes- 1 • H. influenzae • TAAGGAGGTGATCCAA • The highest simulated statistical significance score: 4 TAAGGAGGTGATCCAA

  13. Bacterial Genomes- 2 • M. genitalium • GAGGTGATCCAC • The simulated statistical significance score: 5 - 7 No Significance Lechel[1991] Describing a possible alternative ribosome recognition site specifically in M. genitalium

  14. Interesting Motifs • Synechocystis sp. • 2nd highest scoring 7-mer: CATCGCC (Ms=16) • Results of highest scoring 7-mers of sequences (L=40), allowing no sub.: GGCGATCGCC (HIP1) • H. influenzae • Results of highest scoring 7-mers of sequences (L=40), allowing no sub.: AAGTGCCGGT

  15. Archaea • M. jannaschii • GGAGGTGATCCAG GGAGGTGATCCAG

  16. Conclusion • Enumerating short motifs together with exact z-score • Exhaustive and exact • Not efficient for longer and more complex motifs allowing multiple insertions, deletions, and substitutions

More Related