1 / 37

Noncoding RNA Genes Pt. 2 SCFGs

Noncoding RNA Genes Pt. 2 SCFGs. CS374 Vincent Dorie. Motivation. Noncoding RNA genes can be anywhere Noncoding RNA genes can do anything. Location. rRNA, snRNA Exons? Introns Viral vectors. Function. Function, pt. 2. Overview.

lorin
Download Presentation

Noncoding RNA Genes Pt. 2 SCFGs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Noncoding RNA Genes Pt. 2SCFGs CS374 Vincent Dorie

  2. Motivation • Noncoding RNA genes can be anywhere • Noncoding RNA genes can do anything

  3. Location • rRNA, snRNA • Exons? • Introns • Viral vectors

  4. Function

  5. Function, pt. 2

  6. Overview • “RSEARCH: Finding homologs of single structured RNA sequences” by Klein and Eddy (2003) • “Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars” by Holmes and Rubin (2002)

  7. RSEARCH DART (Stemloc) Comparison - Methodology Sequence

  8. RSEARCH Find parts of a genome which may be homologous to query sequence More practical in comparative genomics DART (Stemloc) Investigate a specific sequence suspected of being homologous to query sequence Comparison, Pt. 2 - Uses

  9. RSEARCH O((M - B)LD + BLD2) to scan O(M4) to calculate statistics DART (Stemloc) Between O(LM) and O(L3M3) Comparison, Pt. 3 - Complexity

  10. Background:Context Free Grammars • Four-tuple {N, T, S, P} • N is a set of nonterminals • T is a set of terminals • S is the start symbol, S  N • P is a set of productions

  11. Context Free Grammars, pt. 2Sample Grammar • N = {S, A, B} • T = {a, u, c, g, } • P = { S -> A | B, A -> aAc | aBc | g, B -> g }

  12. Context Free Grammars, pt. 3Parse Trees Parse: aagcc S S A A a A c a A c a c a c B g g

  13. Stochastic CFG • Each production associated with a probability • Probabilities for all productions starting from a given nonterminal sum to one • Superset of HMM • Assigns a probability to a parse • E.g. S -> A, 0.3 | B, 0.7

  14. Pairwise (profile) SCFG • Terminals in each production can exist in each of two strings • E.g. W -> xiykVxjyl

  15. Each secondary structure specifies (most of) a grammar, creating a “Model Architecture” Eschews probabilistic interpretation Problem becomes fitting target to model architecture RSEARCH: pSCFG Simplified Sequence

  16. Node Types vs. Node States • Nodes types are what we want to do given model (e.g. MATP is match pair) • Node state represents what happens when scanning a target sequence • E.g. Node type is MATP, target sequence doesn’t have a pair in that location -> insert a gap

  17. Node States • Set of node states possible for node type

  18. Gap Classes • Gap class per node type/state pair

  19. Transition Scores • Gap class determines transition scores • Gap penalties are affine

  20. Emission Scores • Emission scores determined empirically

  21. Parameterizing the ModelEmission Scores Substitution Matrices Scores are observed / random

  22. RIBOSUM Matrices • Start with MSA • Whose MSA? • RIBOSUM[X, Y] • Sequences X% identical are reweighted to sum to 1 • Only sequences Y% identical are counted in making matrices

  23. Model Parameters • Gap open penalty (single and pair) • Gap extension penalty (single and pair) • Internal start penalty • Internal end penalty

  24. Solution • Guess and check • “We might have been able to derive a more robust parameter set had we used a more comprehensive set of tests, but the long running time required by RSEARCH makes such an approach infeasible.”

  25. Digression: Biostatistics • Confidence intervals • Expectation values

  26. Gumbel Distribution • Parameterized by  and K • E = KNe-x, P = 1 - e-E

  27. Gumbel Distriubtion, pt. 2 • K and  depend on G+C content of target database • For database with heterogeneous G+C content, compute K and  for G+C bins

  28. Putting it All Together • Run against database substrings of length two times the query • Greedily take K best, non-overlapping hits • Recover alignments • Report: score, position in database, alignment, E-value, P-value • Statistics need to be calculated for every query and target database

  29. Time • For a 113 nt sequence against 2.1 * 107 nt database, 2.9 CPU days. 2% computing statistics • For a 330 nt sequence against 2.1 * 107 nt database, 38 CPU days. 7% computing statistics • Parallelized to 33 minutes and 7.4 hours respectively

  30. Pre-enumerates pSCFGs search space Presents conditional versions of dynamical programming algorithms User defined complexity Shifting GearsFold Envelopes

  31. Fold Envelopes, pt. 2 • Conceptualize search over grammars and parse trees • Each node in tree accounts for subsequence … Outside sequence Accounts for X0..i and Xj..L Wu Inside sequence Accounts for Xi..j …

  32. Analogy: Message Passing • Inside algorithm: likelihood of sequence over all possible parses • Cocke-Younger-Kasami algorithm: maximum likelihood parse of a sequence • Inside-Outside algorithm: expected number each grammar production is used • Use fold envelopes to limit messages by restricting subsequences considered

  33. The Inside Algorithm To compute a(i, j, V) = P(xi…xj, produced by V) a(i, j, v) = X Y k a(i, k, X) a(k+1, j, Y) P(V  XY) V X Y j i k k+1 Batzolgou

  34. Constructing Fold Envelopes • Constrain to possible 2ndary structures • Constrain to primary sequence alignment

  35. Summary • RSEARCH to find a set of possible homologs, sorted by score and statistics • Fold Envelopes permit greater search depth in case of unfolded comparisons • RSEARCH employs simplified pSCFGs • Fold Envelopes are useful over full spectrum of comparisons but represent more computationally complex situations

More Related