1 / 42

Lecture 11. RNA Secondary Structure Prediction

Lecture 11. RNA Secondary Structure Prediction. The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics. Lecture outline. From sequences to functions RNA secondary structures. Part 1. From Sequences to Functions. From sequences to functions.

fernald
Download Presentation

Lecture 11. RNA Secondary Structure Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 11. RNA Secondary Structure Prediction The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

  2. Lecture outline • From sequences to functions • RNA secondary structures CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  3. Part 1 From Sequences to Functions

  4. From sequences to functions • One of the biggest questions in molecular biology: Can one tell the function of a molecule (DNA/RNA/protein) from its sequence alone? • Sometimes, but usually not (yet) • Easier if we also know the structure • Common believe:sequence  structure  function • Of course, also depends on the environment CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  5. Molecular structures • Four levels: • Primary structures • The sequence • Secondary structures • First formed • Local • Tertiary structures • Global • Sometimes called “folds” or “domains” • Quaternary structures • Multiple molecules Image credit: http://www.personal.psu.edu/jms5704/blogs/simmons/levels_of_protein_s_c_la_784.jpg CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  6. Structure and function • Why function depends on structure? • Structure itself is the function (e.g., tubulins) • Binding • Complementarity of interacting structures • Formation of special bonds Image credit: http://www.nigms.nih.gov/NR/rdonlyres/54BEAC37-47A9-454A-BC4F-B94EA127FA1E/0/fig1a_large.jpg, http://upload.wikimedia.org/wikimedia/en-labs/7/7f/Protein_Protein_Docking.JPG CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  7. Structure and function • Why function depends on structure? (cont’d) • Functional group (e.g., catalytic site) • Determining localization (e.g., transporter membrane proteins) Image credit: http://www.catalysis-ed.org.uk/principles/images/enzyme_substrate.gif, Spudich , Science 288(5470):1358-1359, 2000 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  8. Part 2 RNA Secondary Structures

  9. Important RNA classes • Coding: • Messenger RNAs (mRNAs) • For translating into proteins • Non-coding: • Ribosomal RNAs (rRNAs) • Parts of the ribosome complex • Transfer RNAs (tRNAs) • Delivering free amino acids during translation • Micro RNAs (miRNAs) • Binding mRNA targets to promote RNA degradation or repress translation • Small nucleolar RNAs (snoRNAs) • Guiding chemical modifications of other RNAs • Small nuclear RNAs (snRNAs) • Involved in mRNA splicing • Long non-coding RNAs (lncRNAs) • Some involved in gene regulation • ... Image source: http://legacy.hopkinsville.kctcs.edu/sitecore/instructors/Jason-Arnold/VLI/Module%201/m1DNAfunction/m1DNAfunction3.html CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  10. Importance of RNA structures • Structure is important to many classes of RNA • Examples: tRNA snoRNA Image sources: http://www.bio.miami.edu/dana/pix/tRNA.jpg, http://lowelab.ucsc.edu/images/CDBox.jpg CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  11. RNA secondary structures • Largely possible to be projected onto a 2D plane (all the T’s in the figure should be U’s) Stem/hairpin loop Stacking pairs Bulge Internal loop Multi-loop Exterior loop Dangling nucleotides Less stable pair Coaxial stacking Image credit: http://www.clcbio.com/scienceimages/rna_prediction/RNA_structure_prediction_web.png CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  12. RNA secondary structures • Pseudoknots: complex structures Image credit: Wikipedia, Sperschneider and Datta, RNA 14(4):630-640, (2008) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  13. Representing RNA secondary structures • Formats: (see http://projects.binf.ku.dk/pgardner/bralibase/RNAformats.html): • Dot-bracket format • Stockholm format • ... CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  14. Dot-bracket format • Sequence (nucleotides 10, 20, 30, etc. marked in red):GUGAAUGAUGAAUUUAAUUCUUUGGUCCGUGUUUAUGAUGGGAAGUAAGACCCCCGAUAUGAGUGACAAAAGAGAUGUGGUUGACUAUCACAGUAUCUGACG • Structure:......((((.......((((((.(((....((((((.((((..........)))).)))))).))).)))))).((((((.....)))))).))))..... Image credit: Xihao Hu CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  15. Predicting RNA secondary structures • A basic assumption in structure predictions: • Real structure has the lowest free energy • In a simplified view, more stable bonds  lower free energy • In the case of RNA secondary structures: • Good to form more pairs • Canonical pairs: A-U, C-G • Sometimes G-U (a “wobble base pair”) • Good to form more stable pairs. Stability: • C-G > A-U > G-U • Good to have stable sub-structures • E.g., stacking pairs CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  16. Predicting RNA secondary structures • We will assume there are no pseudoknots • With pseudoknots, currently there is no known algorithm that can find the optimal solution efficiently • We need two things: • A thermodynamic model for computing the free energy of a structure • A method for finding the structure with the minimum free energy • This setting sounds familiar? A pseudoknot Image credit: Wikipedia CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  17. Further assumptions • The free energy of a secondary structure is the sum of the free energy of the sub-structures. • Not the sum of individual bases/base pairs, as one base pair can participate in multiple sub-structures. • We will count each sub-structure exactly once. For example, to count a hairpin loop, we consider the base pair that closes the loop. • The free energy values of the sub-structures are independent. CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  18. Problem definition • Given an RNA sequence, find a set of base pairs so that each base is paired at most once • Example: • Input sequence: GUGAAUGAUGAAUUU...ACG • Output set of base pairs: • (7, 97) • (8, 96) • ... • (18, 74) • ... • (81, 87) Image credit: Xihao Hu CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  19. Linear view 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... . ... ( ( ( ( . . . . . . . ( ( ( ( ( ( . ( ... ) . ) ) ) ) ) ) . ( ( ( ( ( ( . . . . . ) ) ) ) ) ) . ) ) ) ) ... CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  20. Thermodynamics model • We will consider four types of sub-structures here: • Stacking pairs: both (i, j) and (i+1, j-1) are in the set • Hairpin loop: there is a pair (i, j), where all bases from i+1 to j-1 are not paired • Bulge/Internal loop: there are two pairs (i, j) and (i1, j1), where i<i1<j1<j, and all bases from i+1 to i1-1 and from j1+1 to j-1 are not paired • Multi-loop: there are pairs (i, j), (i1, j1), ..., (ik, jk), where i<i1<j1<...<ik<jk<j, and all bases from i+1 to i1-1, from j1+1 to i2-1, ..., jk-1+1 to ik-1 and from jk+1 to j-1 are unpaired • One base pair can participate in multiple structures CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  21. Stacking pairs • Both (i, j) and (i+1, j-1) are in the set, with j>i (we require ji+2) • E.g., i:20, j:72 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i i+1 j-1 j CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  22. Hairpin loop • There is a pair (i, j), with j>i (unless specified, we require ji+2), where all bases from i+1 to j-1 are not paired • E.g., i: 81, j: 87 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i j Image source: http://img.ehowcdn.com/article-new/ds-photo/getty/article/151/226/87820768_XS.jpg CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  23. Bulge/Internal loop • Internal loop: There are two pairs (i, j) and (i1, j1), where i<i1<j1<j, and all bases from i+1 to i1-1 and from j1+1 to j-1 are not paired • Called a bulge if only one side has unpaired bases • Unless specified, we allow i1=i+1 or j=j1+1 (but not both) • E.g., i:23, j:69, i1:25, j1:67 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i i1 j1 j CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  24. Multi-loop • Multi-loop: There are pairs (i, j), (i1, j1), ..., (ik, jk), where k2 and i<i1<j1<...<ik<jk<j, and all bases from i+1 to i1-1, from j1+1 to i2-1, ..., jk-1+1 to ik-1 and from jk+1 to j-1 are unpaired • E.g., k=2, i:10, j:94, i1:18, j1:74, i2:76, j2:92 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i i1 j1 i2 j2 j CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  25. One possible thermodynamic model • Unpaired bases have 0 free energy and all the terms below have negative free energy • eS(i, j): for the stacking pairs (i, j) and (i+1, j-1) • eH(i, j): for the hairpin loop closed at (i, j) • eBI(i, j, i1, j1): for a bulge or internal loop enclosed by the pairs (i, j) and (i1, j1) • eM(i, j, i1, j1, ..., ik, jk): for a multi-loop that consists of the pairs (i, j), (i1, j1), ..., (ik, jk) and satisfying i<i1<j1<...<ik<jk<j CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  26. Finding the optimal structure • Dynamic programming • Let s be the RNA sequence with n nucleotides • Tables: • V(j): free energy of the optimal structure for s[1..j] • Final answer is based on V(n) • VP(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair • VBI(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a budge or internal loop • VM(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a multi-loop CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  27. Update formulas • V(j): free energy of the optimal structure for s[1..j] • V(1) = 0 • For j > 1, ... 1 ... j j is unpaired j-1 j ... 1 ... j pairs with i 1 ... i-1 i ... j ... CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  28. Update formulas • VP(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair • We require that i < j ... ... i ... j Stacking pairs ... i i+1 ... j-1 j ... Hairpin loop ... i ... j ... All unpaired CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  29. Update formulas • VBI(i, j):free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a budge or internal loop (i.e., i and j take the roles of i1 and j1) ... ... i ... j ... i ... i1 ... j1 ... j ... Budge or internal loop All unpaired All unpaired CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  30. Update formulas • VM(i, j):free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a multi-loop ... ... i ... j CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  31. Time and space requirements • V: n entries, each takes O(n) time • VP(i, j): O(n2) entries, each takes constant time CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  32. Time and space requirements • VBI: O(n2) entries, each takes O(n2) time • VM: O(n2) entries, each takes O(n2k) time CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  33. Time and space requirements • Summary: • V: n entries, each takes O(n) time • VP: O(n2) entries, each takes constant time • VBI: O(n2) entries, each takes O(n2) time • VM: O(n2) entries, each takes O(n2k) time • Total: O(n2) space, O(n2k+2) time • Exponential if k is unbounded • Some approximations could bring the time down to O(n4) – still huge for large n, but feasible for small or median n CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  34. Some remarks • If we allow general pseudoknots, there is currently no efficient way to find the optimal RNA secondary structure with the minimum free energy • Other methods to predict RNA secondary structures: • Conservation and covariation • High conservation: 2 and 4 • Strong covariation: 1 and 5 • Experimental methods (e.g., RNA footprinting) 12345 ACGGU ACUGU CCAGG UCCGA CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  35. Representing pseudoknots • Without pseudoknots, RNA secondary structures can be unambiguously represented by dots (single bases) and brackets (base pairs) • What if there are pseudoknots? • Need more types of brackets 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... . ... ( ( ( ( . . . . . . . ( ( ( ( ( ( . ( ... ) . ) ) ) ) ) ) . ( ( ( ( ( ( . . . . . ) ) ) ) ) ) . ) ) ) ) ... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 G A A G U A C A A U A U G U A A C C G . { . ( ( ( ( . . . . . ) ) } ) ) . . Image source: http://ultrastudio.org/upload/RNAPseudoKnot-25005810.jpg CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  36. Epilogue Case Study, Summary and Further Readings

  37. Case study: Drug finding/design • Drugs are mostly chemicals with a specific structure that interacts with some biological objects • Examples: • Inhibiting the activities of an important protein of bacteria • Blocking the interaction between virus and receptors of host cell • Simulating the production of a hormone CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  38. Case study: Drug finding/design • Suppose we want to identify/design a chemical to target a particular object (e.g., a protein), we need to make sure that they have tight bindings through a process called docking Image source: http://vds.cm.utexas.edu/ CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  39. Case study: Drug finding/design • Computational problem: • Input: a target protein and a list of chemicals • Goal: find a chemical that binds the target well • Try different locations and orientations • Binding depends on structure and chemistry • Output: One or more chemicals that bind the target well • Difficulties: • Computational complexity • Large search space for each protein-chemical combination • Need to try many chemicals • Need to ensure specificity (not to target other proteins and cause side effects) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  40. Case study: Drug finding/design • There is a game for players to try folding proteins called FoldIt (http://fold.it/) • Score based on free energy • Real time update of scores and ranks • Players can discuss and share solutions • Resulted in some amazingly good folds as compared to automatic predictions by computer programs Image source: http://fold.it/portal/site_files/theme/science/competition.png CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  41. Summary • Functions depend on structures • Different levels of structures: • Primary (sequence) • Secondary (local) • Tertiary (global) • Quaternary (interactions) • RNA secondary structures can be predicted by dynamic programming based on a thermodynamic model • Important sub-structures • Stacking pairs • Hairpin loops • Internal loops/bulges • Multi-loops • Pseoduknots CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

  42. Further readings • Chapter 11 of Algorithms in Bioinformatics: A Practical Introduction • Speed up of algorithm • Algorithm for RNA structure perdition with pseudoknots • Free slides available • Parts VII and VIII of Fundamental Concepts of Bioinformatics • Protein folding and protein structure prediction • Docking CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019

More Related