Rna structure prediction including pseudoknots based on stochastic multiple context free grammar
Download
1 / 41

RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar. PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki Seki and Tadao Kasami Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). NAIST. Table of Contents.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar' - nairi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Rna structure prediction including pseudoknots based on stochastic multiple context free grammar

RNA Structure Prediction Including PseudoknotsBased on Stochastic Multiple Context-Free Grammar

PMSB2006, June 18, Tuusula, Finland

Yuki Kato, Hiroyuki Seki and Tadao Kasami

Graduate School of Information Science,

Nara Institute of Science and Technology (NAIST)



Table of contents
Table of Contents

  • Background

    • Grammatical approach to RNA structure modeling

  • Model

    • Stochastic multiple context-free grammar

  • Algorithms

    • Parsing and parameter estimation

  • Experimental results

    • RNA pseudoknot prediction

  • Summary


Rna secondary structure stem loop
RNA Secondary Structure:Stem-Loop

Complementary

base pairs

A•U

G•C

Connect base pairs with arcs.

U C

A A

nested

Loop

C•G

U•A

U•A

Stem

5’—C A A U G A C—3’

C U U C A U C A G A A A A U G A C


Modeling rna secondary structure by context free grammar cfg
Modeling RNA Secondary Structure by Context-Free Grammar (CFG)

  • RNA secondary structure can be modeled by parse structure of CFG.

    Structure predictionParsing

  • Example of CFG rules:

S

S

S

S

u u c a u c a g a a

U U C A U C A G A A

Secondary structure

Derivation tree


Rna secondary structure pseudoknot
RNA Secondary Structure: (CFG)Pseudoknot

  • CFGs cannot represent pseudoknots.

Connect base pairs with arcs.

crossed

A

5’—C U U C

A A G A C U

U G A C—3’

• • •

• • •

A

C U U C A U C A G A A A A U G A C

A


Early studies
Early Studies (CFG)

n: sequence length


Early studies cont
Early Studies (cont.) (CFG)

  • Grammars for fully describing RNA pseudoknots:

    • SL-TAG and ESL-TAG [Uemura et al., 1999]

    • RPG [Rivas and Eddy, 2000]

  • These grammars have been identified as subclasses ofmultiple context-free grammars. [Kato et al., 2005]


Motivation
Motivation (CFG)

  • Multiple context-free grammar (MCFG):

    • Natural extension of CFG

      • Easy to compare generative power and design algorithms

    • Generative power to represent pseudoknots

    • Polynomial time parsing algorithm

  • We have shown a candidate subclass of the minimum grammars of MCFGs for representing pseudoknots.

    [Kato et al., 2005]


What s new in the present work
What’s New in the Present Work (CFG)

  • Extension of MCFGs to a probabilistic model (stochastic MCFG, SMCFG)

  • Design of polynomial timeparsing andparameter estimationalgorithms for the subclass of SMCFGs

  • Experiments on RNApseudoknot prediction



Table of contents1
Table of Contents (CFG)

  • Background

    • Grammatical approach to RNA structure modeling

  • Model

    • Stochastic multiple context-free grammar

  • Algorithms

    • Parsing and parameter estimation

  • Experimental results

    • RNA pseudoknot prediction

  • Summary


Relation between smcfg and major probabilistic models

A G A C U U (CFG)

Pseudoknot

A G A C U

Stem-loop

genes

Gene finding

Relation between SMCFG and Major Probabilistic Models

Probabilistic

extension

Strong

SMCFG

MCFG

CFG

SCFG

Generative

power

HMM

FA

Weak



Stochastic multiple context free grammar smcfg
Stochastic Multiple Context-Free Grammar (SMCFG) (CFG)

  • G = (N, T, F, P, S)

    N: finite set of nonterminals, T: finite set of terminals,

    F: finite set offunctions,

    P: finite set of rules with probabilities, S N: start symbol


Functions of smcfg
Functions of SMCFG (CFG)

  • Example:


Rules of smcfg
Rules of SMCFG (CFG)

  • Rule:

    • : probability that the rule is applied

    • The sum of the probabilities of the rules with the same left hand side should be one.

  • Example:


Derivation trees in smcfg

A (CFG)1

Ak

Prob. p1

Prob. pk

A: f

Ak

A1

Prob.

Derivation Trees in SMCFG


Modeling pseudoknot by smcfg

A (CFG)

Prob. 0.7

(a g ,c u)

B

Prob. 0.35

(a g ,ac u)

A

Prob. 0.28

(a g ,ac uu)

Modeling Pseudoknot by SMCFG

UP2La[(x1, x2)] = (x1, ax2)

UP2Ru[(x1, x2)] = (x1, x2u)


Smcfg for rna pseudoknot modeling
SMCFG for RNA Pseudoknot Modeling (CFG)

  • W1,…,Wm:nonterminals

    • Note: W1 is the start symbol.

  • For each rule, two real values called

    transition probabilityp1(0 < p11) and emission probabilityp2(0 < p21) are specified.

  • Probability of each rule is defined as


Smcfg g s
SMCFG (CFG)Gs


Table of contents2
Table of Contents (CFG)

  • Background

    • Grammatical approach to RNA structure modeling

  • Model

    • Stochastic multiple context-free grammar

  • Algorithms

    • Parsing and parameter estimation

  • Experimental results

    • RNA pseudoknot prediction

  • Summary


Algorithms for smcfg
Algorithms for SMCFG (CFG)

  • CYK algorithm

    calculates the optimal alignment of a sequence to an SMCFG (the most likely derivation tree).

  • Inside algorithm

    calculates the probability of a sequence given an SMCFG.

  • Inside-outside algorithm

    estimates optimal probability parameters for an SMCFG given a set of example sequences.


Cyk algorithm
CYK Algorithm (CFG)

  • Input:

  • The following are calculated by dynamic programming:

    • : log maximum probability that Wv generates

    • : log maximum probability that Wy generates


Cyk algorithm cont
CYK Algorithm (cont.) (CFG)

  • Output: log maximum probability that

    W1 generates

    i.e.

    • : the most likely derivation tree

    • : entire set of probability parameters


Algorithm cyk
Algorithm [CYK] (CFG)

  • Initialization:

    fori←1ton+1, j←iton+1, v←1tom

    do if// : empty sequence

    then

    else

  • Iteration:

    fori←ndownto1, j←i1ton,

    k←n+1downtoj+1, l←k1ton, v←1tom

    // Some examples are shown.


Algorithm cyk cont

W (CFG)v

Wy

Wz

i

h

k

1

h+1

j

l

n

Algorithm [CYK] (cont.)

  • if

x1

x21

x22


Algorithm cyk cont1

W (CFG)v

Wy

l1

i

k

1

i+1

j

l

n

Algorithm [CYK] (cont.)

  • if

ai

x1

x2

al


Complexity of cyk algorithm
Complexity of CYK Algorithm (CFG)

  • m: # of nonterminals (m = a+b)

  • n: sequence length

  • Time complexity: O(amn4+bn5)

  • Space complexity: O(mn4)


Table of contents3
Table of Contents (CFG)

  • Background

    • Grammatical approach to RNA structure modeling

  • Model

    • Stochastic multiple context-free grammar

  • Algorithms

    • Parsing and parameter estimation

  • Experimental results

    • RNA pseudoknot prediction

  • Summary


Experimental method
Experimental Method (CFG)

  • Construction of a model

CUACUGUUC

SMCFG

Sample sequences

with structure annotation

RNA family

database

CYK

algorithm

Secondary

structure

prediction

CUAGUCUUA

Test sequence

parsing


Data sets for experiments
Data Sets for Experiments (CFG)

  • Three viral RNA families including pseudoknots from Rfam ver. 7.0


Corona pk 3 in rfam ver 7 0
Corona_pk_3 in Rfam ver. 7.0 (CFG)

  • Coronavirus 3' UTR pseudoknot

  • Sequence length:

    6264

Consensus structure


Hdv ribozyme in rfam ver 7 0
HDV_ribozyme in Rfam ver. 7.0 (CFG)

  • Hepatitis delta virus ribozyme

  • Sequence length:

    8791

Consensus structure


Tombus 3 iv in rfam ver 7 0
Tombus_3_IV in Rfam ver. 7.0 (CFG)

  • Tombusvirus 3' UTR region IV

  • Sequence length:

    8992

Consensus structure


Evaluation for prediction results
Evaluation for Prediction Results (CFG)

  • precision

    =

  • recall

    =

# of correct base pairs predicted by the algorithm

# of predicted base pairs

# of correct base pairs predicted by the algorithm

# of base pairs specified by the annotation


Experimental results
Experimental Results (CFG)

  • Prediction accuracy


Experimental results cont
Experimental Results (cont.) (CFG)

  • Running time

*: Implementation in ANSI C on a machine with Intel Pentium D CPU

2.80GHZ and 2.00GB RAM


Pair stochastic tree adjoining grammar pstag mss05
Pair Stochastic Tree Adjoining Grammar (PSTAG) (CFG)[MSS05]

CUACUGUUC

Sample sequences

with structure annotation

Derivation tree

representing

known structure

RNA family

database

PSTAG

algorithm

Secondary

structure

prediction

CUAGUCUUA

alignment

Test sequence

[MSS05] Matsui et al., “Pair stochastic tree adjoining grammars for aligning

and predicting pseudoknot RNA structures,” Bioinformatics, 2005.



Summary
Summary (CFG)

  • A new probabilistic model called SMCFG has been proposed for RNA pseudoknot modeling.

  • Polynomial time parsing and parameter estimation algorithms have been designed.

  • Experimental results on RNA pseudoknot prediction have shown good prediction accuracy.