Rna multiple sequence alignment
Sponsored Links
This presentation is the property of its rightful owner.
1 / 38

RNA multiple sequence alignment PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

RNA multiple sequence alignment. Craig L. Zirbel [email protected] October 14, 2010. RNA primary sequences.

Download Presentation

RNA multiple sequence alignment

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

RNA multiple sequence alignment

Craig L. Zirbel

[email protected]

October 14, 2010

RNA primary sequences

  • Laboratory techniques make it possible to extract specific RNA molecules and determine the sequence of nucleotides. Here are the (unaligned) sequences of the 5S ribosomal RNA molecule from different organisms:




















Watson-Crick basepairs

  • Watson-Crick basepairs can substitute for one another freely without changing the structure of the RNA molecule. They are said to be isosteric, and changes between these basepairs is an example of neutral variability. They are held together by hydrogen bonds (dotted lines).


RNA sequence variability

  • To preserve RNA helices, compensating mutations must be made; to replace a GC basepair with an AU basepair, two letters must change in distant regions of the sequence; see below. Statistically, this is called “long-range dependence.”

  • Compensating mutations such as this do not change the secondary or tertiary structure of the molecule.



Comparative sequence analysis

  • By manually aligning similar RNA sequences and noting the pairs of columns where mainly AU, CG, GC, and UA pairs occur, one can infer the secondary structure of an RNA molecule.

  • This is the inferred secondary structure of the 5S RNA, with bases labeled as found in E. coli. There are five helical regions, with three “internal loops” and two “hairpin loops” separating them. Note the colors!

Fox & Woese 1975; Peattie et al. 1981; Noller 1984; Cannone et al. 2002; http://www.rna.ccbb.utexas.edu


RNA 3D structure

  • Starting late in the year 2000, high-resolution atomic structures of entire ribosomes have been published. These show the bases, the backbone, the Watson-Crick basepairs, and several new types of basepairs.

The 2009 Nobel Prize in Chemistry went to Yonath, Ramakrishnan, and Steitz for their work on x-ray crystal structures of ribosomes.

E. coli 5S

Three 5S rRNA 3D structures

Haloarcula marismortui

E. coli

Thermus thermophilus

RNA multiple sequence alignment

The same RNA in different organism can be presumed to have the same, or roughly the same, secondary and 3D structure.

Compensating changes far apart in the sequence make it hard to use multiple sequence alignment tools that were developed for proteins.

Two situations for RNA multiple sequence alignment

We have two or more sequences from the same RNA, but don’t know their common secondary structure or 3D structure

We have RNA sequences and a common secondary structure or even a single 3D structure which we can assume they all share to some degree

RNA MSARNA Multiple Sequence AlignmentSlides by Anton Petrov, Ph.D. student, BGSU


Why DNA and protein alignment methods don’t work for RNA

RNA sequences may look dissimilar but still fold into the same structure.

Gorodkin et al., 2010. Trends in biotechnology


Gorodkin et al., 2010. Trends in biotechnology

RNA-specific alignment methods

and many others...

RNA MSA and ncRNA discovery

  • Conservation is a reliable indicator of biological importance.

  • If an RNA fragment is conserved across multiple species, it may function as ncRNA.

  • ncRNA discovery programs scan multiple genomic sequences in order to detect putative ncRNA candidates.

  • MSA is an essential part of the ncRNA discovery pipeline.

RNA MSA and ncRNA discovery

Multiple sequence alignment

Secondary structure prediction

ncRNA discovery

Align first

Fold first

Align and fold simultaneously


  • Once you have a good MSA, you can use tools like RNAz to scan your alignment for conserved stable secondary structures, which may function as ncRNAs.


Suggested reading

Alignment to a common secondary structure

One standard starting point is a “seed” alignment of 20-100 RNA sequences together with a “dot-bracket” secondary structure diagram.

Infernal is a program that makes a “covariance model” based on the seed alignment and allows one to align new sequences to this model, thus aligning new sequences to an existing alignment.

Alignment to a model based on a 3D structure

One focus of the BGSU RNA group

Take an RNA 3D structure, with all of the detail it gives about Watson-Crick basepairs and other RNA basepairs

Make a model for sequence variability

Align RNA sequences to the model, and thus to one another.

H.m. 5S rRNA basepair diagram

  • Standard AU and GC Watson-Crick basepairs are denoted by = or –

  • In other pairs, a circle stands for the Watson-Crick edge, a square for the Hoogsteen edge, and a triangle for the Sugar edge.

  • The basepair diagrams for E.coli and T.Th. are similar

  • Working hypothesis: other organisms have largely the same basepair diagram, with neutral basepair substitutions that do not alter the 3D structure

Non-Watson-Crick basepairs

  • The 3D structures show a variety of planar basepair interactions other than Watson-Crick basepairs. These occur between helices and allow the RNA molecule to achieve tighter turns or other important 3D structural features.

trans Hoogsteen / Sugar Edge

A78-G98 in E.coli 5S

A45-U40 in E.coli 5S

cis Watson-Crick / Sugar Edge

A57 – C30 in E.coli 5S

A46 – A39 in E.coli 5S

trans Sugar Edge / Sugar Edge

G13 – G69 in E.coli 5S

Isostericity for non-Watson-Crick basepairs

  • Non-Watson-Crick basepairs have different basepair substitution (isostericity) rules than Watson-Crick pairs. Below are some examples of geometrically similar basepairs.

trans Hoogsteen / Sugar Edge

A78-G98 in E.coli 5S

A45-U40 in E.coli 5S

cis Watson-Crick / Sugar Edge

A57 – C30 in E.coli 5S

A46 – A39 in E.coli 5S

trans Sugar Edge / Sugar Edge

G13 – G69 in E.coli 5S

Stochastic grammars

  • Stochastic grammars are probabilistic models for sequences of characters or words.

  • They are capable of enforcing specified grammatical rules but allowing for variability in the specific sequence.

  • The classic example:

    Colorless green ideas slept furiously

    obeys English grammatical rules, but is a very unlikely sentence to occur in normal English.

  • Context free grammars have certain limitations on the grammatical rules that can be enforced.

  • Chomsky 1956, 1959; Durbin and Eddy 1994.

Simple SCFG model for RNA

  • From the basepair diagram, we construct a model which mimics the structure of the molecule but which allows for neutral basepair variability and other minor variations.

  • The 5S itself is too large, so we display a very small cartoon of the 5S molecule.



Using the SCFG model to generate sequence variants

  • The Initial node generates letters independently with a given length and letter distribution. This time we get an A on the left and CA on the right.


Using the SCFG model to generate sequence variants

  • A Basepair node generates a (dependent) pair of letters and independent insertions. The first Basepair node generates a CG pair and inserts an A on the right (before the G).


Using the SCFG model to generate sequence variants

  • The Basepair node generates CG with no insertions.


Using the SCFG model to generate sequence variants

  • The Junction node generates nothing, but passes control to its two child nodes.


Using the SCFG model to generate sequence variants

  • The Initial node on the left branch generates U on the left and AC on the right; the Initial node on the right branch generates AU.


Using the SCFG model to generate sequence variants

  • The Basepair node on the left branch generates GC; the Basepair node on the right branch generates AG.


Using the SCFG model to generate sequence variants

  • The Basepair node on the left branch generates UA; the Basepair node on the right branch generates GA.


Using the SCFG model to generate sequence variants

  • The Hairpin node on the left branch generates UUCG (a variant of the UNCG hairpin) and the last Basepair node on the right branch generates GC.


Using the SCFG model to generate sequence variants

  • Finally, the last Hairpin generates GAAGA (a variant of the GNRA loop with one insertion) and generation stops.


Parsing sequences according to a model

  • Typically, a model can generate the same sequence in several different ways.

  • Given a model and a sequence that was generated by the model, we want to determine the single way of generating the sequence that is most likely.

  • The most likely generation history tells which node generated which part of the sequence, and so aligns the sequence to the model.

Multiple ways to generate a sequence

  • Here is another way that the same simple model could have generated the sequence. This generation history would have very low probability, since the letters indicated to make certain basepairs are not isosteric with the originally observed basepair.


Determining the maximum probability generation history for a sequence

  • The CYK (Cocke, Younger, Kasami) dynamic programming algorithm has each node (from leaves to root) consider each subsequence (from shorter to longer) to consider the maximum probability way that it and its children would generate the subsequence.

  • Here, the blue Basepair node considers how it and its children can generate the colored subsequence.


Determining the maximum probability generation history for a sequence

  • The blue Basepair node considers another way for it and its children to generate the colored subsequence.

  • The red nodes have already considered every subsequence of this length and shorter.

  • The algorithm runs in O(L2M) time, where L is the length of the input sequence and M is the number of nodes.


Uses for SCFG models for RNA

  • Multiple sequence alignment of RNA

  • Searching genomes for RNAs homologous to a given RNA

  • Infernal is a commonly used SCFG program

  • Note: Some RNAs are absolutely ancient: messenger RNA, ribosomal RNA, transfer RNA, but there are many RNAs that people are just learning about now, like microRNAs and other regulatory RNAs. They occur in UTRs, introns, and intergenic regions, and we need to be able to recognize them!

  • Login