Comparative Analysis of 5S rRNA Sequences and Structures Across Organisms

RNA multiple sequence alignment Craig L. Zirbel zirbel@bgsu.edu October 14, 2010

RNA primary sequences • Laboratory techniques make it possible to extract specific RNA molecules and determine the sequence of nucleotides. Here are the (unaligned) sequences of the 5S ribosomal RNA molecule from different organisms: UUAGGCGGCCACAGCGGUGGGGUUGCCUCCCGUACCCAUCCCGAACACGGAAGAUAAGCCCACCAGCGUUCCGGGGAGUACUGGAGUGCGCGAGCCUCUGGGAAACCCGGUUCGCCGCCACC A H.m. (structure) GCCUGGCGGCCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAGGC B E.coli (structure) UCCCCCGUGCCCAUAGCGGCGUGGAACCACCCGUUCCCAUUCCGAACACGGAAGUGAAACGCGCCAGCGCCGAUGGUACUGGGCGGGCGACCGCCUGGGAGAGUAGGUCGGUGCGGGG B T.th. (structure) AGUGGUGGCCAUAUCGGCGGGGUUCCUCCCCGUACCCAUCCUGAACACGGAAGAUAAGCCCGCCAGCGUCCGGCAAGUACUGGAGUGCGCGAGCCUCUGGGAAAUCCGGUUCGCCGCCAC A L27170.1/1-120 GUAGCGGCCACAGCGGUGGGGUUCCUCCCGUACCCAUCCCGAACACGGAAGAUAAGCCCACCAGCGUUCCGGGGAGUACUGGAGUGCGCGACCCUCUGGGAAACCGGGUUCGCCGCUAC A L27163.1/1-119 GCGGCCAGGGCGGAGGGGAAACACCCGUACCCAUUCCGAACACGGAAGUGAAGCCCUCCAGCGAACCAGCUAGUACUAGAGUGGGAGACCCUCUGGGAGCGCUGGUUCGCCGCC A L27343.1/3-116 UUUGGCGGUCAUGGCGUGGGGGUUUAUACCUGAUCUCGUUUCGAUCUCAGUAGUUAAGUCCUGCUGCGUUGUGGGUGUGUACUGCGGUUUUUUGCUGUGGGAAGCCCACUUCACUGCCAGAC A M36187.1/5-126 GUUGGCGGUCAUGGCGUGGGGUUUAUACCUGAUCUCGUUUCGAUCUCAGUAGUUAAGUCCUGCUGCGUUGUGGGUGUGUACUGCGGUUUUUUGCUGUGGGAAGCCCACUUCACUGCCAGAC A X62857.1/1-121 UUUGGCGGUCAUGGCGUGGGGGUUAUACCUGAUCUCGUUUCGAUCUCAGUAGUUAAGUCCUGCUGCGUUGUGGGUGUGUACUGCGGUGUUUUGCUGUGGGAAGCCCAUUUCACUGCCAGCC A X15364.1/6601-6721 GUCGGUGGUGUUAGCGGUGGGGUCACGCCCGGUCCCUUUCCGAACCCGGAAGCUAAGCCUGCCUGCGCCGAUGGUACUGCACCUGGGAGGGUGUGGGAGAGUAGGACCCCGCCGGCA B M16176.1/4-120 GUCGGUGGUUAUAGCGGUGGGGUCACGCCCGGUCCCAUUCCGAACCCGGAAGCUAAGCCCACCUGCGCCGAUGGUACUGCACCUGGGAGGGUGUGGGAGAGUAGGUCACCGCCGGCC B M16177.1/4-120 GUUGGUGGUUAUUGUGUCGGGGGUACGCCCGGUCCCUUUCCGAACCCGGAAGCUAAGCCCGAUUGCGCUGAUGGUACUGCACCUGGGAGGGUGUGGGAGAGUAGGUCGCUGCCAACC B X55255.1/4-120 UACGGCGGUCAAUAGCGGCAGGGAAACGCCCGGUCCCAUCCCGAACCCGGAAGCUAAGCCUGCCAGCGCCAAUGAUACUGCCCUCACCGGGUGGAAAAGUAGGACACCGCCGAAC B X55259.1/3-117 UACGGCGGUCCAUAGCGGCAGGGAAACGCCCGGUCCCAUCCCGAACCCGGAAGCUAAGCCUGCCAGCGCCGAUGAUACUACCCAUCCGGGUGGAAAAGUAGGACACCGCCGAAC B X55251.1/3-116 UACGGCGGCCACAGCGGCAGGGAAACGCCCGGUCCCAUUCCGAACCCGGAAGCUAAGCCUGCCAGCGCCGAUGAUACUGCCCCUCCGGGUGGAAAAGUAGGACACCGCCGAAC B X75601.1/91-203 UAAGGCGGCCAUAGCGGUGGGGUUACUCCCGUACCCAUCCCGAACACGGAAGAUAAGCCCGCCUGCGUUCCGGUCAGUACUGGAGUGCGCGAGCCUCUGGGAAAUCCGGUUCGCCGCCUACU A X03407.1/5927-6048 UUGGCGACCAUAGCGGCGAGUGACCUCCCGUACCCAUCCCGAACACGGAAGAUAAGCUCGCCUGCGUUUCGGUCAGUACUGGAUUGGGCGACCCUCUGGGAAAUCUGAUUCGCCGCCACC A L27168.1/1-120 GGCGGCCAGAGCGGUGAGGUUCCACCCGUACCCAUCCCGAACACGGAAGUUAAGCUCACCUGCGUUCUGGUCAGUACUGGAGUGAGCGAUCCUCUGGGAAAUCCAGUUCGCCGCCC A X02128.1/24-139 GGGCGGCCAGAGCGGUGAGGUUCCACCCGUACCCAUCCCGAACACGGAAGUUAAGCUCGCCUGCGUUCUGGUCAGUACUGGAGUGAGCGAUCCUCUGGGAAAUCCAGUUCGCCGCCCCU A X14441.1/5-123

Watson-Crick basepairs • Watson-Crick basepairs can substitute for one another freely without changing the structure of the RNA molecule. They are said to be isosteric, and changes between these basepairs is an example of neutral variability. They are held together by hydrogen bonds (dotted lines). Superposition

RNA sequence variability • To preserve RNA helices, compensating mutations must be made; to replace a GC basepair with an AU basepair, two letters must change in distant regions of the sequence; see below. Statistically, this is called “long-range dependence.” • Compensating mutations such as this do not change the secondary or tertiary structure of the molecule. UGCCUGGCGGCCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAGGCAU UGCCUGGCGACCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAAUUGCCAGGCAU

Comparative sequence analysis • By manually aligning similar RNA sequences and noting the pairs of columns where mainly AU, CG, GC, and UA pairs occur, one can infer the secondary structure of an RNA molecule. • This is the inferred secondary structure of the 5S RNA, with bases labeled as found in E. coli. There are five helical regions, with three “internal loops” and two “hairpin loops” separating them. Note the colors! Fox & Woese 1975; Peattie et al. 1981; Noller 1984; Cannone et al. 2002; http://www.rna.ccbb.utexas.edu UGCCUGGCGGCCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAGGCAU

RNA 3D structure • Starting late in the year 2000, high-resolution atomic structures of entire ribosomes have been published. These show the bases, the backbone, the Watson-Crick basepairs, and several new types of basepairs. The 2009 Nobel Prize in Chemistry went to Yonath, Ramakrishnan, and Steitz for their work on x-ray crystal structures of ribosomes. E. coli 5S

Three 5S rRNA 3D structures Haloarcula marismortui E. coli Thermus thermophilus

RNA multiple sequence alignment The same RNA in different organism can be presumed to have the same, or roughly the same, secondary and 3D structure. Compensating changes far apart in the sequence make it hard to use multiple sequence alignment tools that were developed for proteins.

Two situations for RNA multiple sequence alignment We have two or more sequences from the same RNA, but don’t know their common secondary structure or 3D structure We have RNA sequences and a common secondary structure or even a single 3D structure which we can assume they all share to some degree

RNA MSARNA Multiple Sequence AlignmentSlides by Anton Petrov, Ph.D. student, BGSU 10-14-2010

Why DNA and protein alignment methods don’t work for RNA RNA sequences may look dissimilar but still fold into the same structure. Gorodkin et al., 2010. Trends in biotechnology

Example Gorodkin et al., 2010. Trends in biotechnology

RNA-specific alignment methods and many others...

RNA MSA and ncRNA discovery • Conservation is a reliable indicator of biological importance. • If an RNA fragment is conserved across multiple species, it may function as ncRNA. • ncRNA discovery programs scan multiple genomic sequences in order to detect putative ncRNA candidates. • MSA is an essential part of the ncRNA discovery pipeline.

RNA MSA and ncRNA discovery Multiple sequence alignment Secondary structure prediction ncRNA discovery Align first Fold first Align and fold simultaneously

RNAz • Once you have a good MSA, you can use tools like RNAz to scan your alignment for conserved stable secondary structures, which may function as ncRNAs. http://rna.tbi.univie.ac.at/cgi-bin/RNAz.cgi

Suggested reading

Alignment to a common secondary structure One standard starting point is a “seed” alignment of 20-100 RNA sequences together with a “dot-bracket” secondary structure diagram. Infernal is a program that makes a “covariance model” based on the seed alignment and allows one to align new sequences to this model, thus aligning new sequences to an existing alignment.

Alignment to a model based on a 3D structure One focus of the BGSU RNA group Take an RNA 3D structure, with all of the detail it gives about Watson-Crick basepairs and other RNA basepairs Make a model for sequence variability Align RNA sequences to the model, and thus to one another.

H.m. 5S rRNA basepair diagram • Standard AU and GC Watson-Crick basepairs are denoted by = or – • In other pairs, a circle stands for the Watson-Crick edge, a square for the Hoogsteen edge, and a triangle for the Sugar edge. • The basepair diagrams for E.coli and T.Th. are similar • Working hypothesis: other organisms have largely the same basepair diagram, with neutral basepair substitutions that do not alter the 3D structure

Non-Watson-Crick basepairs • The 3D structures show a variety of planar basepair interactions other than Watson-Crick basepairs. These occur between helices and allow the RNA molecule to achieve tighter turns or other important 3D structural features. trans Hoogsteen / Sugar Edge A78-G98 in E.coli 5S A45-U40 in E.coli 5S cis Watson-Crick / Sugar Edge A57 – C30 in E.coli 5S A46 – A39 in E.coli 5S trans Sugar Edge / Sugar Edge G13 – G69 in E.coli 5S

Isostericity for non-Watson-Crick basepairs • Non-Watson-Crick basepairs have different basepair substitution (isostericity) rules than Watson-Crick pairs. Below are some examples of geometrically similar basepairs. trans Hoogsteen / Sugar Edge A78-G98 in E.coli 5S A45-U40 in E.coli 5S cis Watson-Crick / Sugar Edge A57 – C30 in E.coli 5S A46 – A39 in E.coli 5S trans Sugar Edge / Sugar Edge G13 – G69 in E.coli 5S

Stochastic grammars • Stochastic grammars are probabilistic models for sequences of characters or words. • They are capable of enforcing specified grammatical rules but allowing for variability in the specific sequence. • The classic example: Colorless green ideas slept furiously obeys English grammatical rules, but is a very unlikely sentence to occur in normal English. • Context free grammars have certain limitations on the grammatical rules that can be enforced. • Chomsky 1956, 1959; Durbin and Eddy 1994.

Simple SCFG model for RNA • From the basepair diagram, we construct a model which mimics the structure of the molecule but which allows for neutral basepair variability and other minor variations. • The 5S itself is too large, so we display a very small cartoon of the 5S molecule. 3’ 5’

Using the SCFG model to generate sequence variants • The Initial node generates letters independently with a given length and letter distribution. This time we get an A on the left and CA on the right. ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants • A Basepair node generates a (dependent) pair of letters and independent insertions. The first Basepair node generates a CG pair and inserts an A on the right (before the G). ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants • The Basepair node generates CG with no insertions. ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants • The Junction node generates nothing, but passes control to its two child nodes. ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants • The Initial node on the left branch generates U on the left and AC on the right; the Initial node on the right branch generates AU. ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants • The Basepair node on the left branch generates GC; the Basepair node on the right branch generates AG. ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants • The Basepair node on the left branch generates UA; the Basepair node on the right branch generates GA. ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants • The Hairpin node on the left branch generates UUCG (a variant of the UNCG hairpin) and the last Basepair node on the right branch generates GC. ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants • Finally, the last Hairpin generates GAAGA (a variant of the GNRA loop with one insertion) and generation stops. ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Parsing sequences according to a model • Typically, a model can generate the same sequence in several different ways. • Given a model and a sequence that was generated by the model, we want to determine the single way of generating the sequence that is most likely. • The most likely generation history tells which node generated which part of the sequence, and so aligns the sequence to the model.

Multiple ways to generate a sequence • Here is another way that the same simple model could have generated the sequence. This generation history would have very low probability, since the letters indicated to make certain basepairs are not isosteric with the originally observed basepair. ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Determining the maximum probability generation history for a sequence • The CYK (Cocke, Younger, Kasami) dynamic programming algorithm has each node (from leaves to root) consider each subsequence (from shorter to longer) to consider the maximum probability way that it and its children would generate the subsequence. • Here, the blue Basepair node considers how it and its children can generate the colored subsequence. ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Determining the maximum probability generation history for a sequence • The blue Basepair node considers another way for it and its children to generate the colored subsequence. • The red nodes have already considered every subsequence of this length and shorter. • The algorithm runs in O(L2M) time, where L is the length of the input sequence and M is the number of nodes. ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Uses for SCFG models for RNA • Multiple sequence alignment of RNA • Searching genomes for RNAs homologous to a given RNA • Infernal is a commonly used SCFG program • Note: Some RNAs are absolutely ancient: messenger RNA, ribosomal RNA, transfer RNA, but there are many RNAs that people are just learning about now, like microRNAs and other regulatory RNAs. They occur in UTRs, introns, and intergenic regions, and we need to be able to recognize them!

Comparative Analysis of 5S rRNA Sequences and Structures Across Organisms

Comparative Analysis of 5S rRNA Sequences and Structures Across Organisms

Presentation Transcript

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment