Finding Common RNA Pseudoknot Structures in Polynomial Time

Finding Common RNA Pseudoknot Structures in Polynomial Time Patricia Evans University of New Brunswick

Ribonucleic Acid (RNA) • RNA is an organic molecule that forms long chains • Each position in the chain can be one of 4 types (bases): A, G, C, U • RNA can code gene information (messenger RNA, viral RNA) • RNA can also form structures and take many functions within a cell (eg. tRNA, rRNA and other RNA-protein complexes)

RNA Bonds and Structures • RNA bases can form bonds, in a largely pairwise fashion (A-U, G-C, some exceptions) • RNA is single stranded; its bonds form mostly within a single chain, folding it into a complex structure held together by its bonds • RNA function is affected by its structure • If two bases are paired, it often does not matter what they are; only unpaired bases are ‘available’ • Common substructures can help investigate functional relationships

RNA Structural Complexity • Deceptively simple, since bases are usually paired • Stems are formed from two bonded strands, in an antiparallel orientation • These simple bonds can however combine to form complex structures • Some are nested (stems within loops) • Some are knotted (stems effectively crossing) • RNA molecules can be very long (eg. > 1000 bases), confounding exhaustive comparison techniques

Arc Representation • At a bond level, the bond structure of an RNA molecule can be represented as arcs superimposed onto the “stretched” RNA sequence. • Each arc represents a bonded pair, and the structure is a set of pairs. Nested Structure Pseudoknot

Maximum Common Ordered Substructure Input: Structures S1 and S2, where each structure is a set of pairs over n1 and n2 positions (resp.) Output:max.substructure Sc with nc positions, such that there exist 1-1 functions f1 and f2 where:

General Structures are Hard • The general MCOS problem, allowing positions to bond multiple times, is NP-hard (Goldman et al., 1999) • Comparing two RNA (pair-bond) structures is polynomial if they do not have knots (Bafna et al., 1995) • A structure S has a knot if and only if: there are pairs (i1, j1) and (i2, j2) in S where i1 < i2 < j1 < j2 ([)] • Comparing knotted arc structures is NP-hard for arbitrary pair-bond structures (Evans 1999, and others)

Comparing Knot-Free Structures If the two structures are composed only of nested bonds, they can be compared in O(n4) time using a dynamic programming algorithm that computes: M[i1, j1, i2, j2] = max { M[i1, j1-1, i2, j2] , M[i1, j1, i2, j2-1] , M[i1, k1-1, i2, k2-1] + M[k1+1, j1-1, k2+1, j2-1] +1 if (k1, j1) is in S1 and (k2, j2) is in S2 } our answer is in M[0,|A|-1,0,|B|-1] (result: Bafna et al. 1995)

Limited Context • The polynomial time DP algorithm for nested bond structures works due to the context-free nature of segments in the nested structures. • Knotted structures have segments that are not context-free, but we can limit the context that they need if we consider special cases that cover most known RNA structures.

Pseudoknot Observations • Three mutually crossing arcs generally do not occur in RNA structures (3-knot) • A structure without 3-knots can be separated into 2 layers of non-crossing arcs (2-colourable)

Pseudoknot Observations • Crossing arcs tend to be grouped into crossing stems, though there can be some nesting • Interleaving between left and right endpoints does not usually occur, and would be biochemically unstable

i j h l i j Forming LSPs To take advantage of these restrictions, we will consider that bonds group into stems, and that a stem can break the RNA sequence into linked segment pairs (LSPs): a matched pair of segments that are, or may be, linked by bonds. LSP: an ordered segment pair Segment

Merging LSPs The key to the use of LSPs is our ability to merge them to construct a larger LSP, as shown. The restrictions allow us to consider only pairwise LSP merges – we can always fill at least one existing “hole” when we merge.

Structure Pieces We can then consider two types of comparison cases, and build up our results from them: • Segment-to-segment (4 dimensions) • LSP-to-LSP (8 dimensions) We do not need to match LSPs to segments, as long as we allow both segments and LSPs to be broken into parts.

Segment Cases Segment cases are based on the BMR95 algorithm. s1: value of matching segment (i1, j1-1) to (i2, j2) s2: value of matching segment (i1, j1) to (i2, j2-1) s3: if j1 links to k1 and j2 links to k2: 1 + (value of matching segment (i1, k1-1) to (i2, k2-1)) + (value of matching segment (k1+1, j1-1) to (k2+1, j2-1))

Creating an LSP While a matched arc can break a segment into two (as in case s3), it can also create an LSP, if we allow the segments to be linked. s4: 1+ (value of matching LSP (i1, k1-1, k1+1, j1-1) to (i2, k2-1, k2+1, j2-1))

LSP Cases – Simple The first cases for matching LSPs are based on the segment matching: two paring and one split. a1: value of matching LSP (h1,l1,i1, j1-1) to (h2,l2,i2, j2) a2: value of matching LSP (h1,l1,i1, j1) to (h2,l2,i2, j2-1) a3: (value of matching segment (h1, l1) to (h2, l2)) + (value of matching segment (i1, j1) to (i2, j2)) Case a3 can be used with s4 to allow new LSPs to be made from right segments of matched LSPs.

LSP Cases – Within Right If the arcs link to positions within the right side of the LSPs, then the segments within the arcs can be the right sides of new LSPs. a4: 1 + (value of matching LSP (h1,l1,k1+1, j1-1) to (h2,l2, k2+1, j2-1)) + (value of matching segment (i1, k1-1) to (i2, k2-1))

LSP Cases – Within Right Alternatively, the arcs could bound segments that are within the structure of the right side of the LSPs. a5: 1 + (value of matching LSP (h1, l1, i1, k1-1) to (h2, l2, i2, k2-1)) + (value of matching segment (k1+1, j1-1) to (k2+1, j2-1))

LSP Cases – Cross Left If the arcs cross to the left side of the LSPs, then their left endpoints (k) can form a hole to start new LSPs. a6: 1 + (value of matching LSP (h1,k1-1, k1+1, l1) to (h2,k2-1, k2+1, l2)) + (value of matching segment (i1, j1-1) to (i2, j2-1))

LSP Cases – Cross Left The arcs can instead separate the LSP within them from initial segments. a7: 1 + (value of matching LSP (k1+1,l1,i1, j1-1) to (k2+1,l2,i2, j2-1)) + (value of matching segment (h1, k1-1) to (h2, k2-1)) We do not try to link the first and third segments as they would form part of a 3-knot.

LSP Cases – Cross Left Matched arcs can break the LSPs into three segments. a8: 1 + (value of matching segment (h1, k1-1) to (h2, k2-1)) + (value of matching segment (k1+1, l1) to (k2+1, l2)) + (value of matching segment (i1, j1-1) to (i2, j2-1))

LSP Cases – Crossed LSPs Arcs crossingexisting LSPs could need a merging of the LSP types in a6 and a7 – but then we need to consider all places for the split to occur. a9: 1 + max [over all s1,s2 with k1<s1<l1, k2<s2<l2] (value of matching LSP (h1,k1-1, s1+1,l1) to (h2,k2-1, s2+1,l2)) +(value of matching LSP (k1+1,s1,i1, j1-1) to (k2+1,s2,i2, j2-1))

Dynamic Programming • These cases take care of all possibilities for how LSPs and segments can be broken down, and their results merged. • They can be turned straightforwardly into a dynamic programming algorithm that uses two tables (one for segments, one for LSPs) • The algorithm will need to weave between these two tables in a way consistent with the data

Making It Feasible This algorithm makes very heavy use of multidimensional dynamic programming tables, and looks more of theoretical interest than practical use. • Time complexity is high at O(n10) • Space complexity is even more crucial at O(n8) Careful implementation is needed to avoid these theoretical worst cases.

Engineering Space and Time • Space and time usage can be minimised by eliminating those computations that are not needed. • The recurrence should be computed recursively (using memoisation) to enable the data to help this pruning • Note that most segment pairs will not correspond to LSPs consistent with a given arc structure • The table can be allocated dynamically, in layers, so that a hyperplane of the table is only allocated if it will contain an entry (and note h < l < i < j ) • We can reduce this further by limiting hyperplane sizes to the corresponding segment within an arc

Experiments • Having reduced the space, experiments were run on a variety of RNA structural data to determine if the algorithm is of practical use • Large Subunit ribosomal RNA structures • RNAse P structures • Mosaic Virus structures • Structures of up to 400 arcs were compared effectively in 4Gb of space, with correct substructures found • allocating about 10-14 of the theoretical table • Even the O(n4) recurrence for unknotted structures would need too much space without the space saving technique

Conclusion and Future Work • Under these restrictions, RNA bond structures can be compared in polynomial time • With careful case pruning, the algorithm is feasible and produces useful results • The problem of comparing general 2-colourable bond structures (allowing endpoint interleaving) is still open • Extensions to pattern discovery for multiple structures can be explored • Weights can be added to model RNA more accurately

Finding Common RNA Pseudoknot Structures in Polynomial Time