Efficient RNA Secondary Structure Alignment Method with Pseudoknots

2nd International Workshop on Natural Computing, Dec. 10-12, 2007 Noyori Conference Hall, Nagoya University, Japan An efficient multiple alignment method for RNA secondary structures including pseudoknots Shinnosuke Seki 1 & Satoshi Kobayashi 2 1 Department of Computer Science, University of Western Ontario, London, Ontario, Canada, N6A 5B7, sseki@csd.uwo.ca 2 Department of Computer Science, The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu, Tokyo, Japan, 182-8585, satoshi@cs.uec.ac.jp

Problem setting • INPUT • RNA secondary structures (2 or more) • Sequential info. • Structural info. (which can be obtained through database or a prediction algorithm based on the sequential info.) • OUTPUT • The alignment of the input RNA secondary structures as a grammatical model

Secondary structure alignment • DNA and RNA sequences fold into themselves so that they form a 2D (secondary) or 3D (tertiary) structures. • These highly-dimensional structures play an important role in determining biological functions. • Similar structures may have similar functions. • The structure alignment aims at finding a similarity between structures as well as between sequences.

3’ 5’ T D A Cloverleaf structure (tRNA) • Secondary structure • 1 multiple loop with 3 hairpin loops • Tertiary structure • L-shaped 3D-structure

Pseudoknotted structure (tmRNA) • E coli. transfer-messenger RNA • Hairpin loops • Bulge loops • Internal loops • Multiple loops • pseudoknots

NP-hardness of pseudoknotted structure alignment • The alignment based on the edit distance between pseudoknotted structures has proven NP-hard. • We focus on a subset of pseudoknotted structures which can be modeled by a grammar called SLTAGs. • Most of pseudoknots in reality can be modeled by SLTAGs.

Regular Context-free (CF) Context-sensitive (CS) Recursively enumerable Chomsky-Schützenberger hierarchy • Context-free grammars are strong enough to model pseudoknot-free secondary structures. • Modeling pseudoknotted structures requires stronger grammars like context-sensitive grammars.

Simple Linear Tree Adjoining Grammars (SLTAGs) • A mild context-sensitive grammar (between CF & CS) • Growing a tree by replacing *-node by a tree called the adjoining tree (bolded in left fig.) • Terminal symbols derived at the same time are considered to form a base-pair. • Descriptive power for pseudoknots (left fig.) S A S S S* S C G A S* S* S S λ λ λ U U 5’ A C U G 3’

Challenges in modeling by SLTAGs • Ambiguity • Based on a grammar, there may exist multiple derivations of a word. • When modeling something by a grammar, its ambiguity must be taken into account! • How to overcome the ambiguity? • Alignment of derivations by SLTAGs [Seki & Kobayashi, 2005] • Multiple pseudoknots modeling • SLTAGs can model an RNA secondary structure with 1 pseudoknot, not multiple pseudoknots.

Abstract RNA Structure (ARNAS) model • A tree structure to model an RNA secondary structure & to represent a relationship among its components. • Vertices of ARNAS models are • String (single base chain) • Tandem (also-called stem, cascade of base-pairs) • Pseudoknot

3’ 5’ T D A Example 1: ARNAS model for tRNA cloverleaf Secondary structure ARNAS model root SC tandem SC SC SC SC SC tandem tandem tandem D-arm T-arm SC SC SC A-arm SC: single-base chain

ARNAS components • String (can be modeled by regular grammar) • A single base chain of maximal length • Sequential information only • Tandem (can be modeled by context-free grammar) • A cascade of base-pairs • Information of sequence, of nested base-pairing, and of its child components. • Pseudoknot (requires context-sensitive grammar) • A pseudoknot in a biological sense • A pseudoknot structure which can be modeled by SLTAGs • Information of sequence, of crossing base-pairing, and of its child components.

5’ 3’ SC SC SC SC SC Example 2:ARNAS model for tmRNA Secondary structure ARNAS model root SC tandem AAAAAAUAGUGAC GCUUUAGCAG CUGC UAGAGC pseudoknot CUUAAUAAC U CGAGG GCGGUU CCUCG AGCCGC G GG UAAAA

Alignment of ARNAS components • ARNAS components can be modeled by SLTAGs. • The SLTAG parser [Uemura et al., 1999] provides the set of all derivations of each component to be aligned. • Based on the dynamic programming, the alignment algorithm for SLTAG models [Seki & Kobayashi, 2005] calculates alignments for all combinations of 2 derivations, and finds the optimal alignment among them. • The components to be aligned may have sub ARNAS models as their children. The alignments of these sub ARNAS models have been calculated previously, and accommodated in the alignment of these components.

Time-complexity of component alignment algorithm (Table 1) • The algorithm can employ context-free or regular grammars as its base-grammar depending on components to be aligned. • Its time-complexity varies as follows: where s1 and s2 are # of bases in components to be aligned.

ARNAS Alignment algorithm • Based on the tree alignment algorithm[Jiang et al., 1995] whose time complexity is , where n1 and n2 are #of nodes of trees to be aligned. • Scores to edit nodes of ARNAS models are alignment scores of corresponding ARNAS components. • Bottom-up approach • Given two ARNAS models, the algorithm • calculates alignments between leaf components (strings), • calculates alignments between their parent components based on their alignments, • repeat this process until it reaches the alignment of root components, which is the alignment between the ARNAS models.

The time-complexity of ARNAS alignment algorithm • Given RNA secondary structures of length n1 and n2, • Theoretical time complexity is . • In reality, it is not so intractable because of • The scarcity of pseudoknots • Almost all component alignments can be done in time. • Short-bp property • A pseudoknot is much shorter than the secondary structure itself.

Multiple alignment algorithm • Progressive alignment approach • Given multiple ARNAS models, find the two ARNAS models with the highest similarity. • The alignment result is also an ARNAS model so that we can repeat this process until all ARNAS models given are aligned. ARNAS((1, (2, 3)), 4) ARNAS(1, (2, 3)) ARNAS(2, 3) ARNAS1 ARNAS2 ARNAS3 ARNAS4

Experimental results (1) • How many pseudoknotted secondary structures can be converted into ARNAS models? • INPUT: 675 RNA pseudoknotted structures in comparative RNA (CRW) Website: http://www.rna.icmb.utexas.edu. • 561 of 675 (83.1%) can be converted into ARNAS models. • All but one RNAs of length up to about 2400 can be converted. • This means that RNA structures hardly contain a pseudoknot which cannot be modeled by SLTAGs.

Experimental results (2) • Short-bp property • INPUT: The 561 RNA secondary structures whose pseudoknots can be modeled by SLTAGs. • Compare the length of RNA secondary structure with the length of longest pseudoknots in it. • The least-square method provides the following theoretical curve, where xis the length of secondary structure, and P(x) is the length of longest pseudoknots.

Short-bp Property

Experimental results (3) • An experimental time complexity • SETTING: • Intel(R) Xeon processors 2.8GHz×2 with 2GB memory • Cf. on this environment, our original algorithm without ARNAS modification takes about 600 sec. to align pseudoknots of length around 80 nucleotides. • INPUT: 150 of 561 RNAs with structural info. • RESULT: A theoretical curve between x (the length of RNAs) and T(x) (the alignment time [sec.]) is as follows: • It can align RNAs of 2400 nucleotides in about 15 secs.

Running Time

Future work • Experiments on the accuracy of ARNAS alignment algorithm • Comparison with other algorithms for pseudoknotted RNA alignment

Efficient RNA Secondary Structure Alignment Method with Pseudoknots

Efficient RNA Secondary Structure Alignment Method with Pseudoknots

Presentation Transcript

Multiple Alignment

Multiple Alignment

Multiple Alignment

Multiple alignment method

Multiple Alignment

Multiple Alignment –

Multiple alignment

Multiple Alignment

Multiple Alignment

Multiple Alignment

Multiple Alignment

RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

Multiple Alignment

Multiple alignment

Multiple Alignment

Multiple-Alignment

An Efficient Method for Computing Alignment Diagnoses

Phylogenetic Reconstruction based on RNA Secondary Structural Alignment

Block Alignment: An Approach for Multiple Sequence Alignment Containing Clusters

Arc-Segment Alignment for RNA Secondary Structure

RNA Structures

RNA Matrices and RNA Secondary Structures