1 / 60

Prediction of RNA structure

Wi'07. Bafna. The central dogma. The DNA contains the code to all of the essential machinery of the cell, mainly proteins.Information about proteins is in genes. The genic information is first copied to an intermediate molecule (transcription)The transcript goes outside the nucleus and is transla

Ava
Download Presentation

Prediction of RNA structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Wi’07 Bafna Prediction of RNA structure Vineet Bafna

    2. Wi’07 Bafna The central dogma The DNA contains the code to all of the essential machinery of the cell, mainly proteins. Information about proteins is in genes. The genic information is first copied to an intermediate molecule (transcription) The transcript goes outside the nucleus and is translated into a protein.

    3. Wi’07 Bafna

    4. Wi’07 Bafna The Central Dogma However, not all genes are translated!

    5. Wi’07 Bafna Example: tRNA

    6. Wi’07 Bafna RNA Could it be that, there is a large trove of non-coding RNA, that is not discovered yet? The answer seems to be YES Today’s lecture: How can we computationally identify ncRNA?

    7. Wi’07 Bafna Novel ncRNAs are abundant: Ex: miRNAs miRNAs were the second major story in 2001 (after the genome). Subsequently, many other non-coding genes have been found

    8. Wi’07 Bafna

    9. Wi’07 Bafna Scientific American, 2006

    10. Wi’07 Bafna Bacterial Riboswitches -Breaker Lab

    11. Wi’07 Bafna ncRNA gene finding The RNA world hypothesis: RNA are as important as protein coding genes. Many undiscovered ncRNA exist Computational methods for discovering ncRNA are not mature. What are the clues to non-coding genes? Structure: Given a sequence, what is the structure into which it can fold with minimum energy?

    12. Wi’07 Bafna RNA structure: Basics Key: RNA is single-stranded. Think of a string over 4 letters, AC,G, and U. The complementary bases form pairs. A <-> U, C <-> G, G <-> U Base-pairing defines a secondary structure. The base-pairing is usually non-crossing.

    13. Wi’07 Bafna RNA structure: Basics Key: RNA is single-stranded. Think of a string over 4 letters, AC,G, and U. The complementary bases form pairs. Base-pairing defines a secondary structure. The base-pairing is usually non-crossing.

    14. Wi’07 Bafna RNA structure: pseudoknots Sometimes, unpaired bases in loops form ‘crossing pairs’. These are pseudoknots.

    15. Wi’07 Bafna De novo RNA structure prediction Any set of non-crossing base-pairs defines a secondary structure. Abstract Question: Given an RNA string find a structure that maximizes the number of non-crossing base-pairs Incorporate the true energetics of folding Incorporate Pseudo-knots

    16. Wi’07 Bafna

    17. Wi’07 Bafna De novo RNA prediction: a combinatorial formulation Input: A string over A,C,G,U A pairs with U, C pairs with G, G with U Output: A subset of possible base-pairings of maximum size (score) such that No two base-pairs intersect Each nucleotide pairs with at most one other nucleotide How can we compute this set efficiently?

    18. Wi’07 Bafna RNA structure Nussinov’s algorithm Score B for every base-pair. No penalty for loops. No pesudo-knots. Let W(i,j) be the score of the best structure of the subsequence from i to j.

    19. Wi’07 Bafna Obtaining RNA structure

    20. Wi’07 Bafna Obtaining RNA Structure

    21. Wi’07 Bafna RNA structure: example

    22. Wi’07 Bafna RNA Structure: Details

    23. Wi’07 Bafna Base-pairing & Loops Base-pairs arise from complementary nucleotides Single-stranded Stack is when 2 base-pairs are contiguous Loops arise when there are unpaired bases. They are characterized by the number of base-pairs that close it. Hairpin: closed by 1 base-pair Bulge/Interior Loops (2 base-pairs) Multiple Internal loops (k base-pairs)

    24. Wi’07 Bafna Scoring Loops, multi-loops Zuker-Turner Energy Rules http://www.bioinfo.rpi.edu/~zukerm/rna/energy/node2.html Stacking Energies Energy for Bulges and Interior Loops Energy for Multi-loops

    25. Wi’07 Bafna Improving the efficiency Computing structure on a 2000nt sequence is about 20s (2GHz, 2Gb) (Wexler et al., 2006) How much time would it take to search 10000 candidate regions? What would be the impact of an O(n) speedup on this search?

    26. Wi’07 Bafna A triangle inequality After the computations are done, A triangle inequality is satisfied

    27. Wi’07 Bafna Towards more efficient algorithms Consider the main loop again We only need to branch, when i forms a base-pair. Therefore, w.l.o.g

    28. Wi’07 Bafna A candidate list approach Note that if C is the maximum size of the candidate list, then the running time is O(Cn2) How can we reduce the size of the candidate list? Consider only those k s.t. bk is complementary to bi

    29. Wi’07 Bafna Candidate list Consider i,j,k s.t. W(i,j) = B(ri,rk)+W(i+1,k-1)+W(k+1,j) (1) Claim: For all j’ >= j j is not a candidate in computing W(i,j’)

    30. Wi’07 Bafna Proof

    31. Wi’07 Bafna The candidate list algorithm For all i = n down to 1 For all j = i+1 to n

    32. Wi’07 Bafna How large is the candidate list? Here, the answer depends very much on the energy computations. If we simply count the number of base-pairs, the candidate list should be large (probably) For more realistic functions, it is very small (experiments with n upto 1000, showed C=6)

    33. Wi’07 Bafna Empirical estimates of C

    34. Wi’07 Bafna A thermodynamic argument It is conjectured, and empirically shown that RNA has the polymer-zeta property Probability that a string of length m folds into an RNA structure is b/mc for constants b,c>0 Note that if c>0, the number of indices j that base-pair with i is O(1) In practice c~1.5

    35. Wi’07 Bafna End of Lecture

    36. Wi’07 Bafna Incorporating pseudoknots in structure prediction Q: Determine optimal structure, when pseudoknots are allowed. Pseudoknots are only loosely defined. If any interleaving is allowed, then simply select a structure in which a max number of nucleotides can be paired. Under some restrictive notions, the structure problem becomes NP-hard.

    37. Wi’07 Bafna Akutsu’s Simple pseudoknots A region [i0,k0] forms a simple pseudoknot if there exist positions j’0, j0 s.t. Each (i,j) ? M satisfies either i0 ? i < j’0 ? j < j0 or j’0 ? i < j0 ? j ? k0 Within one of the two groups, there is no interleaving. i<i’<j’0 or j’0 ?i<i’ implies j>j’.

    38. Wi’07 Bafna Simple pseudoknots A region [i0,k0] forms a simple pseudoknot if it can be partioined into 3 regions s.t. every base-pair has the following properties It either connects (A,B) , or (B,C) Two base-pairs within (A,B) (or, (B,C)) do not interleave

    39. Wi’07 Bafna Simple pseudoknotted structure Collection of simple pseudoknots (Mi) and other basepairs M’. None of the simple pseudoknot regions overlap. M’ is a secondary structure without pseudoknots for the sequence obtained by excising all pseudoknotted regions. Can we identify a sub-structure for the pseudo-knots?

    40. Wi’07 Bafna Akutsu’s algorithm (Main idea) Rotate the sequence so that it forms two loops. Each allowed base-pair is a horizontal line in exactly one of the two loops. The horizontal lines in the two loops are non-crossing. All base-pairs can be ordered.

    41. Wi’07 Bafna Main idea The base-pairs have a total order that a d.p. can exploit. (i,j) < (i’,j’) if one of the following holds i<i’<j<j’ i<i’<j’<j<j0 j’0<i’<i<j<j’

    42. Wi’07 Bafna Frontiers of D.P. We need to construct an increasing path of base-pairs. Consider a ‘frontier’ triple (i,j,k), corresponding to the region (i0,i) and (j,k) We have the following cases: (i,j) form a base-pair (j,k) form a base-pair Neither (i,j) nor (j,k) form a base-pair

    43. Wi’07 Bafna Ordering frontiers ‘frontier’ triple (i,j,k) >= (i’,j’,k’) if and only if i’ <= i j <= j’ k’ <= k

    44. Wi’07 Bafna Ordering Frontiers (i,j,k) >= (i’,j’,k’) if and only if

    45. Wi’07 Bafna Recurrences SL(i,j,k) is the optimum score of a frontier (i,j,k) assuming that i<j’0<j<j0<k (i,j) form a base-pair SR(i,j,k) is the optimum score of a frontier (i,j,k) assuming that i<j’0<j<j0<k, and (j,k) form a base-pair SM(i,j,k) is the optimum score of a frontier (i,j,k) assuming that i<j’0<j<j0<k, and Neither (i,j) nor (j,k) form a base-pair

    46. Wi’07 Bafna Computing SL

    47. Wi’07 Bafna Putting it all together

    48. Wi’07 Bafna Computing optimal pseudo-knotted structures Let S(i,j) be the opt score of a simple pseudoknotted structure. Either (i,j) is a simple pseudoknot S(i,j) = Spseudo(i,j) Or, not S(i,j) = max{ v(ai,aj) + S(i+1,j-1), max k { S(i,k-1)+S(k,j)} }

    49. Wi’07 Bafna Time Complexity We compute Spseudo(i,j) for all (i,j). Each computation is O(n3). Total time? For each i0, perform the following computation: Spseudo(i0,k0) = max i0?i<j ?k0{SL(i,j,k0),…} Total time O(n4) Can you do better?

    50. Wi’07 Bafna Recursive pseudoknots Each loop of the RNA structure is a recursive pseudoknotted RNA structure. Optimal recursive pseudoknotted RNA structure problem can be solved in O(n5) time.

    51. Wi’07 Bafna Open Questions There should be a direct generalization of simple pseudoknots to the following structure. Rivas and Eddy do consider such a generalization, but a systematic treatment is missing. Q: Given a pseudo-knotted structure, is it an Akutsu simple pseudoknotted structure? Linear time algorithm was devised recently for this problem.

    52. Wi’07 Bafna Structure as a proxy for ncRNA Any genomic region with an energetically favorable fold is a candidate ncRNA? Rivas and Eddy show otherwise. They use LOD score L = Pr(s|RNA)/Pr(s|Null) Z-score Z = G-?/?

    53. Wi’07 Bafna LOD scores for putative ncRNA A region of C. elegans is containing two tRNAs is chosen. The position of tRNA is indicated by stars. The true positions and the LOD-score correspond. Stronger results in M. janaschii (AT%=70) Weak results in E.coli (GC%=50)

    54. Wi’07 Bafna Correlation with GC content GC content detector has results similar to structure prediction!

    55. Wi’07 Bafna Significance of detected regions Test for significance was using Z-scores on shuffled sequences. Earlier tests shuffled sequences using a large window. Shuffling high scoring windows led to lower Z-scores.

    56. Wi’07 Bafna Testing chimeric sequence Test on chimeric sequence. A real tRNA was embedded in 2000bp random sequence with similar GC-composition.

    57. Wi’07 Bafna Increasing window size tRNA Z-score computation with a larger window (85nt).

    58. Wi’07 Bafna For very strong signals, a Z-score > 4 may still be significant.

    59. Wi’07 Bafna Z-score distribution Z-scores of 415 tRNA genes. 98% have Z-score lower than 4.

    60. Wi’07 Bafna Structure We hoped that RNA structure computation would give us a clue into detecting novel ncRNA Turns out that random DNA can ‘fold’ into a low energy structure.

More Related