1 / 36

Remote RNA homology detection

Remote RNA homology detection. Sean R. Eddy HHMI Janelia Farm Research Campus. Probability theory is nothing but common sense reduced to calculation. Laplace (1819). RNAs conserve both secondary structure and sequence. There are many RNA structures of interest.

dannon
Download Presentation

Remote RNA homology detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Remote RNA homology detection Sean R. Eddy HHMI Janelia Farm Research Campus Probability theory is nothing but common sense reduced to calculation. Laplace (1819)

  2. RNAs conserve both secondary structure and sequence

  3. There are many RNA structures of interest Wade Winkler and Ron Breaker, ChemBioChem 4:1024 2003

  4. Recognizing homologous RNAs is not easy

  5. RNAs have a lot of information in pairwise correlations

  6. Models before algorithms: a short sermon Probabilistic (Bayesian) inference : no arbitrary scores Prob(data | model, parameters) Always write down the probability of everything. - Steve Gull DJC MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge, 2003

  7. Single residue scores extend to pairwise residue scores probability of observing residue a (or residues a,b) ‘here’ in the model (‘here’ = aligned to a particular residue, site, or state) background frequencies of residue a (or residues a,b) Steve Altschul, J. Mol. Biol. 219:555, 1991

  8. Formal grammars as models of biological sequences Noam Chomsky, 1958

  9. An SCFG parse tree corresponds to an RNA structure David Searls, Am. Scientist 80:579, 1992

  10. Probabilistic models of biological sequences HMM algorithms (sequence) Viterbi Forward Forward-Backward O(MN) O(M2N) O(MN) SCFG algorithms (RNA structure) CYK Inside Inside-Outside O(MN2) O(M3N3) O(MN3) Goal optimal alignment P(sequence | model) EM parameter estimation memory complexity: time complexity (general): time complexity (as used): • we can analyze target sequences with secondary structure models; • but the algorithms are computationally expensive. Richard Durbin, Sean Eddy, Graeme Mitchison, Anders Krogh Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Cambridge, 1998

  11. Covariance models (profile SCFGs): query construction Infernal User’s Guide, 2007: http://infernal.janelia.org/

  12. CM is organized on a consensus ‘guide tree’ Infernal User’s Guide, 2007: http://infernal.janelia.org/

  13. Each node contains one or more SCFG ‘states’ Infernal User’s Guide, 2007: http://infernal.janelia.org/

  14. CM state graph looks complex but follows simple rules Infernal User’s Guide, 2007: http://infernal.janelia.org/

  15. Homologous structures: different parses from same model P(seq, parse tree | CM) is the product of all transition and emission probabilities used by the parse tree Infernal User’s Guide, 2007: http://infernal.janelia.org/

  16. Structure, parse, alignment : different views of same data

  17. Covariance models (profile SCFGs): summary • Query is nonpseudoknotted RNA secondary structure/sequence • either consensus of a multiple RNA alignment, or a single RNA structure • Position-specific scoring parameters are derived from probabilities • easiest: frequencies of events in a deep alignment (maximum likelihood estimation) • usually: maximum a posteriori estimation from counts and a Dirichlet prior • in the limit of one sequence: position-independent substitution matrix • Generative probabilistic model assigns P(seq, parse tree | CM), by factorizing into a product of transition (indel) and emission probabilities • affine gap (gap-open, gap-extend) • position-specific, but 0th order Markov: no stacking correlations • Now, to be useful, we also need: • algorithm for identifying best alignment given a sequence: the CYK algorithm • algorithm for calculating likelihood P(seq | CM): theInside algorithm

  18. Dynamic programming algorithms for SCFGs Recursively calculates log probability that CM subgraph rooted at v generates subsequence i..j. This is a 3D dynamic programming lattice, requiring O(ML^2) memory. most states: Bifurcation calculations cost O(L) per cell, and there are O(ML^2) cells in the lattice: so algorithm is O(ML^3) time worst case. Most models have few bifurcations, so running time is between ML^2 and ML^3 in practice. bifurcation states:

  19. 3D SCFG dynamic programming lattices answer here: root state 0, i=1, j=L complete sequence initialize here: end states, length 0 subsequences on diagonal

  20. Bacillus subtilis RNase P example (E. coli query)

  21. C. elegans RNase P (found with human RNase P query)

  22. The glycine riboswitch A glycine-dependent riboswitch that uses cooperative binding to control gene expression. Mandal et al. (Breaker lab), Science 306:275 (2004)

  23. Memory is no longer a limitation, but time is The divide and conquer algorithm (2002): Myers/Miller extended to 3D SCFG lattices. Memory requirement now O(L^2 log M) A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. S.R. Eddy, BMC Bioinformatics, 3:18 (2002)

  24. Ways to accelerate CM searches • custom filtering programs (tRNAscan-SE, Lowe and Eddy, 1997) • BLAST prefilter (Rfam database does this) • linear profile-HMM filters, including “rigorous filters” (Zasha Weinberg, Larry Ruzzo) • extend the BLAST algorithm to 3D (Diana Kolbe, work in progress) • more generally: banded dynamic programming (various strategies possible) • query-dependent banding (QDB): Nawrocki and Eddy, 2007

  25. QDB algorithm QDB recursively calculates the probability that a subgraph rooted at state v will generate a subsequence of length d, for all v and d, using the generative model. Subsequence lengths with negligible probability may then be ignored in DP alignment to any target sequence. Query-dependent banding (QDB) for faster RNA similarity searches. Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

  26. Examples of QDB bands Query-dependent banding (QDB) for faster RNA similarity searches. Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

  27. CYK dynamic programming with QDB Query-dependent banding (QDB) for faster RNA similarity searches. Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

  28. One free parameter: negligible probability mass threshold Query-dependent banding (QDB) for faster RNA similarity searches. Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

  29. Four- to six-fold acceleration with QDB, not far from O(MN), the same complexity as BLAST or Smith/Waterman (albeit with a big constant) Query-dependent banding (QDB) for faster RNA similarity searches. Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

  30. QDB has little effect on sensitivity/specificity

  31. BRAliBase III benchmark Eva Freyhult, Jonathan Bollback, and Paul Gardner, Genome Research 17:117 (2007)

  32. Software integration & availability: a final sermon http://pfam.janelia.org/ http://hmmer.janelia.org/ soon: first release of Easel, the code library underlying HMMER and Infernal http://rfam.janelia.org/ http://infernal.janelia.org/ all freely available; currently GPL, soon to be under the (BSD-like) Janelia Software License

  33. The Rfam Consortium led by Sam Griffiths-Jones (Sanger, Cambridge UK) Zasha Weinberg (Yale) Larry Ruzzo (U Washington, Seattle) Ron Breaker (Yale) Norm Pace (U Colorado, Boulder) Infernal, RNA homology search Diana Kolbe Eric Nawrocki HMMER, protein homology search Sergi Castellano Alex Coventry ncRNA genefinding: Jennifer Davila-Aponte Seolkyoung Jung Elena Rivas secret agent man: Tom Jones Now that scientific research has become a regular profession on the payroll of the state, the observer can no longer afford to concentrate for extended periods of time on one subject, and must work even harder. Gone are the days of yore... Santiago Ramon y Cajal, 1916 HHMI Janelia Farm http://selab.janelia.org/

  34. Mixture Dirichlet priors: base pairs Query-dependent banding (QDB) for faster RNA similarity searches. Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

  35. Mixture Dirichlet priors: singlets Query-dependent banding (QDB) for faster RNA similarity searches. Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

More Related