1 / 24

Pattern Discovery in RNA Secondary Structure Using Affix Trees

Pattern Discovery in RNA Secondary Structure Using Affix Trees. (when computer scientists meet real molecules) Giulio Pavesi & Giancarlo Mauri Dept. of Computer Science, Systems and Communication University of Milano – Bicocca Milan, Italy. Why Is RNA So Interesting?.

jaimin
Download Presentation

Pattern Discovery in RNA Secondary Structure Using Affix Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi & Giancarlo Mauri Dept. of Computer Science, Systems and Communication University of Milano – Bicocca Milan, Italy

  2. Why Is RNA So Interesting? • After the completion of various genome projects, the attention of many researchers has shifted from coding to non – coding parts • More than 95% of our genome is not coding: what about the rest? • Non – coding RNA: RNA that is transcribed from DNA, but does not encode directly for a protein(tRNA, microRNA, etc.) Giulio Pavesi

  3. A Motivating Example Post-transcriptional regulation of gene expression Giulio Pavesi

  4. The Problem • Functionally related RNA sequences present structural similarity, at least in some parts • Given two or more RNA molecules, find similar (supposedly functional) structural elements in them • Sequence similarity implies structure similarity, but this is not always that true for RNA..... • Given two or more RNA sequences of unknown structure, find similar structural elements in them (motifs) • Low sequence similarity can anyway correspond to high structure similarity Giulio Pavesi

  5. “Know Thine Enemy” • RNA secondary structure: list of the base pairs among nucleotides in the sequences, such that: • No nucleotide takes part in more than a single base pair (usually, Watson – Crick pairs and wobble pairs G – T, i.e. canonical base pairs) • Base pairs never cross: if nucleotide i is bound to nucleotide j and k with l, then either i < j < k < lor i < k < l < j Giulio Pavesi

  6. RNA Secondary Structure .((..(((.((....)))))...(((.(((...)))...))))) Giulio Pavesi

  7. Motifs in RNA Secondary Structure • Many functional motifs can be described by secondary structure alone • Two types of similarity: • sequence similarity (in unpaired nucleotides, mainly) • structure similarity Giulio Pavesi

  8. Data Structures? • When dealing with DNA or protein sequences, some significant advantages have been obtained by using suitable text—indexing structures (e.g. suffix trees) • RNA secondary structure can be described by a string • Is there a “good” structure that will do for RNA sequences, allowing us to consider sequence and structure at the same time? Giulio Pavesi

  9. Affix Trees Affix tree for string S = ATATC Suffix and prefix edges Suffix edges spell the substrings of string S Prefix edges (dotted) spell substrings of S-1 (the reverse Built in linear time Takes linear space Giulio Pavesi

  10. Affix Trees • The affix tree of a string S indexes all the substrings of bothS and S-1 • Once a substring of S has been located in the tree, we can extend it to the right (by following suffix edges) and to the left (by following prefix edges) • Good if we search for patterns in the sequences with some kind of symmetry Giulio Pavesi

  11. The Hairpin • The basic element of RNA secondary structure is the hairpin (or stem—loop) structure • The hairpin is symmetric!!!! ((((( ...... ))))) AGGTC CAGTCA GATCT Giulio Pavesi

  12. First Try • Predict the secondary structure of each input sequence • Build the affix tree for the folded sequences (in bracket notation) • Search exhaustively for patterns describing hairpin structures (possibly with differences) • Report those occurring in at least q sequences Giulio Pavesi

  13. Searching for Hairpins in Affix Trees • For each loop size l: • Find l dots in the tree, on suffix edges (hairpin loop) • Add a base pair: • Find a )on suffix edges • Find a ( on prefix edges • If the result appears in at least q sequences, jump to 2, else return from jump • Add internal loops: • Find a dot on prefix edges: jump to 2; • Find a dot on suffix edges: jump to 2; Giulio Pavesi

  14. Recursive Algorithm 1 .... (ok) 2 (....) (ok) 2 ((....)) (ok) 2 (((....))) (no) 3a .((....)) (ok) 2 (.((....))) (ok) • On each path, we keep a pointer for the prefix edge, and another for the suffix edge • Speed—up: represent the unpaired elements with a single symbol describing type and size, so to compare two symbols instead of two regions Giulio Pavesi

  15. Approximate Search • We can allow some approximation: • Hairpin loops of different size (range value at step 1) • Internal loops of different size at the same position along the stem • Internal loops or bulges at different positions along the stem • Stems of different size (base pairs) • Any combination of the previous Giulio Pavesi

  16. Complexity • Given a set of k folded sequences of overall length N : • Construction of the tree: O(N) • Annotation of the tree: O(kN) • Search: O(V(m)kN), where m is the length of the longest pattern found • V(m) depends on the degree of approximation • In practice, the most time consuming part is predicting the structure of the sequences Giulio Pavesi

  17. Does It Work? • Test: Iron Responsive Element, located in the UTRs of mRNA coding for proteins involved in iron metabolism (e.g. ferritin, transferrin) • Does it appear in all the predicted structures? • Alas, it does not!!!!!! Giulio Pavesi

  18. Why? The “real structure”often does not correspond to the optimal one!!!! The motif “disappears” from the (supposedly) optimal structure Giulio Pavesi

  19. One Possible Solution • Idea: for each sequence, consider also a number of alternative sub-optimal structures • All the possible structures can be enumerated • Check whether a motif appears in at least one alternative structure per sequence • The affix tree can handle efficiently even hundreds of alternative structures per input sequence • Downside: the number of potential secondary structures for a sequence of length n is O(2n) • If similarity is not stringent, we have too many candidates Giulio Pavesi

  20. But..... • If the same structure has to appear in a set of sequences, then the same pattern of complementary base pairs has to appear in the sequences ((((( ...... ))))) AGGTC CAGTCA GATCT GCGAG CAGTCT CTTGC CCCAG CAGTCA CTGGG Giulio Pavesi

  21. Idea! • Instead of working on folded sequences, build the affix tree for the sequences alone, and find complementary base pairs on the fly • The search can be implemented with the same parameters of the folded case Giulio Pavesi

  22. Building Hairpins on the Fly • By working on unfolded sequences, the theoretical time complexity is higher, since different paths correspond to the same structure • In practice it is much faster, since we do not have to run the prediction algorithm on the input sequences • We need to “validate” the candidate structures, e.g. according to their energy Giulio Pavesi

  23. Post - Processing • So far we have considered structure alone • More than a single motif occurrence per sequence is often reported, especially if structural constraints are loose • Post processing: compare the candidate occurrences by evaluating sequence similarity in unpaired elements • Find the group of instances that are more similar at the sequence level Giulio Pavesi

  24. Results and Work in Progress • The second approach gave better results, in terms of reliability and efficiency • Candidate hairpins can be validated according to their energy value (more reliable, in this case!) • Good results on “harder” tests • Too many input parameters yet • Extend to more complex structures Giulio Pavesi

More Related