1 / 55

CS5263 Bioinformatics

CS5263 Bioinformatics. Lecture 21 RNA Secondary Structure Prediction. Road map. Biological roles for RNA What’s “secondary structure”? How is it represented? Why is it important? How to predict?. Central dogma. The flow of genetic information. transcription. translation. DNA. RNA.

sondrat
Download Presentation

CS5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS5263 Bioinformatics Lecture 21 RNA Secondary Structure Prediction

  2. Road map • Biological roles for RNA • What’s “secondary structure”? • How is it represented? • Why is it important? • How to predict?

  3. Central dogma The flow of genetic information transcription translation DNA RNA Protein Replication

  4. Classical Roles for RNA • mRNA - Message RNA • tRNA - Transfer RNA (~61 kinds, ~ 75nt) • rRNA - Ribosomal RNA (~4 kinds, 120-5k nt) RNA Protein Ribosome

  5. Classical Roles for RNA • mRNA • tRNA • rRNA Ribosome

  6. “Semi-classical” RNA • snRNA - small nuclear RNA (splicing: U1, etc, 60-300nt) • RNaseP - tRNA processing (~300 nt) • SRP - signal recognition particle; membrane targeting (~100-300 nt) • tmRNA - resetting stalled ribosomes, destroy aberrant mRNA • Telomerase - (200-400nt) • snoRNA - small nucleolar RNA (many varieties; 80-200nt)

  7. New Roles for RNA • Riboswitch: an mRNA regulates its own activity • siRNA (Nobel prize 2006, Fire & Mello) • microRNAs • saRNA: small activating RNA • Hundreds of families • Rfam release 1, 1/2003: 25 families, 55k instances • Rfam release 7, 3/2005: 503 families, 300k instances

  8. Example: Riboswitch

  9. Non-coding RNAs Dramatic discoveries in last 5 years • 100s of new families • Many roles: regulation, transport, stability, catalysis, … • 1% of DNA codes for protein, but 30% of it is copied into RNA, i.e. ncRNA >> mRNA

  10. Take-home message • RNAs play many important roles in the cell beyond the classical roles • Many of which yet to be discovered • RNA functions are determined by structures

  11. RNA structure • Primary: sequence • Secondary: base-pairing • Tertiary: 3D shape

  12. RNA base-pairing • Watson-Crick Pairing • C-G ~3kcal/mole • A-U ~2kcal/mole • “Wobble Pair” G – U ~1kcal/mole • Non-canonical Pairs

  13. tRNA structure

  14. Secondary structure prediction • Given: CAUUUGUGUACCU…. • Goal: • How can we compute that?

  15. Terminology Hairpin Loops Interior loops Stems Multi-branched loop Bulge loop

  16. 5’ 5 10 15 20 30 25 35 40 45 3’ Pseudoknot • Makes structure prediction hard. Not considered in most algorithms. ucgacuguaaaaaagcgggcgacuuucagucgcucuuuuugucgcgcgc 5’- -3’ 10 20 30 40

  17. The Nussinov algorithm • Goal: maximizing the number of base-pairs • Idea: Dynamic programming • Loop matching • Nussinov, Pieczenik, Griggs, Kleitman ’78 • Too simple for accurate prediction, but stepping-stone for later algorithms

  18. A C C A G C C G G C A U A U U A U A C A G A C A C A G U A A G C U C G C U G U G A C U G C U G A G C U G G A G G C G A G C G A U G C A U C A A U U G A The Nussinov algorithm Problem: Find the RNA structure with the maximum (weighted) number of nested pairings Nested: no pseudoknot ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACCGCGAGAGGGAAGACUCGUAUAAGCG

  19. The Nussinov algorithm • Given sequence X = x1…xN, • Define DP matrix: F(i, j) = maximum number of base-pairs if xi…xj folds optimally • Matrix is symmetric, so let i < j

  20. The Nussinov algorithm • Can be summarized into two cases: • (i, j) paired: optimal score is 1 + F(i+1, j-1) • (i, j) unpaired: optimal score is maxk F(i, k) + F(k+1, j) • a number of other ways to summarize, all equivalent

  21. The Nussinov algorithm • F(i, i) = 0 F(i+1, j-1) + S(xi, xj) • F(i, j) = max maxk F(i, k) + F(k+1, j) • S(xi, xj) = 1 if xi, xj can form a base-pair, and 0 otherwise • Generalize: S(A, U) = 2, S(C, G) = 3, S(G, U) = 1 • Or other types of scores (later) • F(1, N) gives the optimal score for the whole seq

  22. How to fill in the DP matrix? F(i+1, j-1) + S(xi, xj) • F(i, j) = max maxk F(i, k) + F(k+1, j) i i+1 j–1 j

  23. How to fill in the DP matrix? F(i+1, j-1) + S(xi, xj) • F(i, j) = max maxk F(i, k) + F(k+1, j) j – i = 1

  24. How to fill in the DP matrix? F(i+1, j-1) + S(xi, xj) • F(i, j) = max maxk F(i, k) + F(k+1, j) j – i = 2

  25. How to fill in the DP matrix? F(i+1, j-1) + S(xi, xj) • F(i, j) = max maxk F(i, k) + F(k+1, j) j – i = 3

  26. How to fill in the DP matrix? F(i+1, j-1) + S(xi, xj) • F(i, j) = max maxk F(i, k) + F(k+1, j) j – i = N - 1

  27. Minimum Loop length • Sharp turns unlikely • Let minimum length of hairpin loop be 1 • F(i, j) = 0 for j – i < 2 U  A G  C C  G G C

  28. Algorithm Initialization: F(i, i) = 0; for i = 1 to N F(i, i+1) = 0; for i = 1 to N-1 Iteration: For L = 1 to N-1 For i = 1 to N – l j = min(i + L, N) F(i+1, j -1) + s(xi, xj) F(i, j) = max max{ i  k < j } F(i, k) + F(k+1, j) Termination: Best score is given by F(1, N) (Need to trace back; refer to the Durbin book)

  29. Complexity For L = 1 to N-1 For i = 1 to N – l j = min(i + L, N) F(i+1, j -1) + s(xi, xj) F(i, j) = max max{ i  k < j } F(i, k) + F(k+1, j) • Time complexity: O(N3) • Memory: O(N2)

  30. Example • RNA sequence: GGGAAAUCC • Only count # of base-pairs • A-U = 1 • G-C = 1 • G-U = 1 • Minimum hairpin loop length = 1

  31. G G G A A A U C C G G G A A A U C C

  32. G G G A A A U C C G G G A A A U C C

  33. G G G A A A U C C G G G A A A U C C

  34. G G G A A A U C C G G G A A A U C C

  35. G G G A A A U C C G  U G  C G  C AAA G G G A A A U C C A  U G  C G  C G A  U G G  C G  C AA AA

  36. G G G A A A U C C G  U G  C G  C AAA G G G A A A U C C A  U G  C G  C G A  U G G  C G  C AA AA

  37. G G G A A A U C C G  U G  C G  C AAA G G G A A A U C C A  U G  C G  C G A  U G G  C G  C AA AA

  38. G G G A A A U C C G  U G  C G  C AAA G G G A A A U C C A  U G  C G  C G A  U G G  C G  C AA AA

  39. Energy minimization For L = 1 to N-1 For i = 1 to N – l j = min(i + L, N); E(i+1, j -1) + e(xi, xj) E(i, j) = min min{ i  k < j } E(i, k) + E(k+1, j) e(xi, xj) represents the energy for xi base pair with xj • Energy are negative values. Therefore minimization rather than maximize. • More complex energy rules: energy depends on neighboring bases

  40. Terminology Hairpin Loops Interior loops Stems Multi-branched loop Bulge loop

  41. The Zuker algorithm – main ideas • Instead of base pairs, pairs of base pairs (more accurate) • Separate score for bulges • Separate score for different-size & composition of loops • Separate score for interactions between stem & beginning of loop • Use additional matrix to remember current state. similar to affine-gap alignment.

  42. Two popular implementation • mFold by Zuker • RNAfold in the Vienna package (Hofacker) • Includes several useful utilities, such as structure comparison, searching, base-paring probability from partition functions, etc.

  43. Accuracy • 50-70% for sequences up to 300 nt • Not perfect, but useful • Possible reasons: • Energy rule not perfect: 5-10% error • Many alternative structures within this error range • Alternative structure do exist • Structure may change in presence of other molecules

  44. Comparative structure prediction Given K homologous aligned RNA sequences: Human aagacuucggaucuggcgacaccc Mouse uacacuucggaugacaccaaagug Worm aggucuucggcacgggcaccauuc Fly ccaacuucggauuuugcuaccaua Orc aagccuucggagcgggcguaacuc If ith and jth positions are always base paired and covary, then they are likely to be paired

  45. Mutual information fab(i,j): # of times the pair a, b are in positions i, j fa (i): # of times the base a is in positions i aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc fc(13) = 3/5 fg(13) = 1/5 fu(13) = 1/5 fgc(3,13) = 3/5 fcg(3,13) = 1/5 fau(3,13) = 1/5 fg(3) = 3/5 fc(3) = 1/5 fa(3) = 1/5

  46. Mutual information • Also called covariance score • M is high if base a in position i always follow by base b in position j • Does not require a to base-pair with b • Advantage: can detect non-canonical base-pairs • However, M = 0 if no mutation at all, even if perfect base-pairs aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc One way to get around is to combine covariance and energy scores

  47. Comparative structure prediction • Given a multiple alignment, can infer structure that maximizes the sum of mutual information, by DP • However, alignment is hard, since structure often more important than sequence

  48. Comparative structure prediction In practice: • Get multiple alignment • Find covarying bases – deduce structure • Improve multiple alignment (by hand) • Go to 2 A manual EM process!!

  49. Comparative structure prediction • Align then fold • Align and fold • Fold then align

  50. Context-free Grammar for RNA Secondary Structure • S = SS | aSu | cSg | uSa | gSc | L • L = aL | cL | gL | uL |  S ag u cg aaacgg ugcc S S S L S S a L L a L  a c g g a g u g c c c g u

More Related