Loading in 5 sec....

RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free GrammarPowerPoint Presentation

RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

- By
**nairi** - Follow User

- 82 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar' - nairi

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### RNA Structure Prediction Including PseudoknotsBased on Stochastic Multiple Context-Free Grammar

PMSB2006, June 18, Tuusula, Finland

Yuki Kato, Hiroyuki Seki and Tadao Kasami

Graduate School of Information Science,

Nara Institute of Science and Technology (NAIST)

Table of Contents

- Background
- Grammatical approach to RNA structure modeling

- Model
- Stochastic multiple context-free grammar

- Algorithms
- Parsing and parameter estimation

- Experimental results
- RNA pseudoknot prediction

- Summary

RNA Secondary Structure:Stem-Loop

Complementary

base pairs

A•U

G•C

Connect base pairs with arcs.

U C

A A

nested

Loop

C•G

U•A

U•A

Stem

5’—C A A U G A C—3’

C U U C A U C A G A A A A U G A C

Modeling RNA Secondary Structure by Context-Free Grammar (CFG)

- RNA secondary structure can be modeled by parse structure of CFG.
Structure predictionParsing

- Example of CFG rules:

S

S

S

S

u u c a u c a g a a

U U C A U C A G A A

Secondary structure

Derivation tree

RNA Secondary Structure: (CFG)Pseudoknot

- CFGs cannot represent pseudoknots.

Connect base pairs with arcs.

crossed

A

5’—C U U C

A A G A C U

U G A C—3’

• • •

• • •

A

C U U C A U C A G A A A A U G A C

A

Early Studies (CFG)

n: sequence length

Early Studies (cont.) (CFG)

- Grammars for fully describing RNA pseudoknots:
- SL-TAG and ESL-TAG [Uemura et al., 1999]
- RPG [Rivas and Eddy, 2000]

- These grammars have been identified as subclasses ofmultiple context-free grammars. [Kato et al., 2005]

Motivation (CFG)

- Multiple context-free grammar (MCFG):
- Natural extension of CFG
- Easy to compare generative power and design algorithms

- Generative power to represent pseudoknots
- Polynomial time parsing algorithm

- Natural extension of CFG
- We have shown a candidate subclass of the minimum grammars of MCFGs for representing pseudoknots.
[Kato et al., 2005]

What’s New in the Present Work (CFG)

- Extension of MCFGs to a probabilistic model (stochastic MCFG, SMCFG)
- Design of polynomial timeparsing andparameter estimationalgorithms for the subclass of SMCFGs
- Experiments on RNApseudoknot prediction

Table of Contents (CFG)

- Background
- Grammatical approach to RNA structure modeling

- Model
- Stochastic multiple context-free grammar

- Algorithms
- Parsing and parameter estimation

- Experimental results
- RNA pseudoknot prediction

- Summary

A G A C U U (CFG)

Pseudoknot

A G A C U

Stem-loop

genes

Gene finding

Relation between SMCFG and Major Probabilistic ModelsProbabilistic

extension

Strong

SMCFG

MCFG

CFG

SCFG

Generative

power

HMM

FA

Weak

From HMM to SCFG (CFG)

Stochastic Multiple Context-Free Grammar (SMCFG) (CFG)

- G = (N, T, F, P, S)
N: finite set of nonterminals, T: finite set of terminals,

F: finite set offunctions,

P: finite set of rules with probabilities, S N: start symbol

Functions of SMCFG (CFG)

- Example:

Rules of SMCFG (CFG)

- Rule:
- : probability that the rule is applied
- The sum of the probabilities of the rules with the same left hand side should be one.

- Example:

A (CFG)

Prob. 0.7

(a g ,c u)

B

Prob. 0.35

(a g ,ac u)

A

Prob. 0.28

(a g ,ac uu)

Modeling Pseudoknot by SMCFGUP2La[(x1, x2)] = (x1, ax2)

UP2Ru[(x1, x2)] = (x1, x2u)

SMCFG for RNA Pseudoknot Modeling (CFG)

- W1,…,Wm:nonterminals
- Note: W1 is the start symbol.

- For each rule, two real values called
transition probabilityp1(0 < p11) and emission probabilityp2(0 < p21) are specified.

- Probability of each rule is defined as

SMCFG (CFG)Gs

Table of Contents (CFG)

- Background
- Grammatical approach to RNA structure modeling

- Model
- Stochastic multiple context-free grammar

- Algorithms
- Parsing and parameter estimation

- Experimental results
- RNA pseudoknot prediction

- Summary

Algorithms for SMCFG (CFG)

- CYK algorithm
calculates the optimal alignment of a sequence to an SMCFG (the most likely derivation tree).

- Inside algorithm
calculates the probability of a sequence given an SMCFG.

- Inside-outside algorithm
estimates optimal probability parameters for an SMCFG given a set of example sequences.

CYK Algorithm (CFG)

- Input:
- The following are calculated by dynamic programming:
- : log maximum probability that Wv generates
- : log maximum probability that Wy generates

CYK Algorithm (cont.) (CFG)

- Output: log maximum probability that
W1 generates

i.e.

- : the most likely derivation tree
- : entire set of probability parameters

Algorithm [CYK] (CFG)

- Initialization:
fori←1ton+1, j←iton+1, v←1tom

do if// : empty sequence

then

else

- Iteration:
fori←ndownto1, j←i1ton,

k←n+1downtoj+1, l←k1ton, v←1tom

// Some examples are shown.

Complexity of CYK Algorithm (CFG)

- m: # of nonterminals (m = a+b)
- n: sequence length
- Time complexity: O(amn4+bn5)
- Space complexity: O(mn4)

Table of Contents (CFG)

- Background
- Grammatical approach to RNA structure modeling

- Model
- Stochastic multiple context-free grammar

- Algorithms
- Parsing and parameter estimation

- Experimental results
- RNA pseudoknot prediction

- Summary

Experimental Method (CFG)

- Construction of a model

CUACUGUUC

SMCFG

Sample sequences

with structure annotation

RNA family

database

CYK

algorithm

Secondary

structure

prediction

CUAGUCUUA

Test sequence

parsing

Data Sets for Experiments (CFG)

- Three viral RNA families including pseudoknots from Rfam ver. 7.0

Corona_pk_3 in Rfam ver. 7.0 (CFG)

- Coronavirus 3' UTR pseudoknot
- Sequence length:
6264

Consensus structure

HDV_ribozyme in Rfam ver. 7.0 (CFG)

- Hepatitis delta virus ribozyme
- Sequence length:
8791

Consensus structure

Tombus_3_IV in Rfam ver. 7.0 (CFG)

- Tombusvirus 3' UTR region IV
- Sequence length:
8992

Consensus structure

Evaluation for Prediction Results (CFG)

- precision
=

- recall
=

# of correct base pairs predicted by the algorithm

# of predicted base pairs

# of correct base pairs predicted by the algorithm

# of base pairs specified by the annotation

Experimental Results (CFG)

- Prediction accuracy

Experimental Results (cont.) (CFG)

- Running time

*: Implementation in ANSI C on a machine with Intel Pentium D CPU

2.80GHZ and 2.00GB RAM

Pair Stochastic Tree Adjoining Grammar (PSTAG) (CFG)[MSS05]

CUACUGUUC

Sample sequences

with structure annotation

Derivation tree

representing

known structure

RNA family

database

PSTAG

algorithm

Secondary

structure

prediction

CUAGUCUUA

alignment

Test sequence

[MSS05] Matsui et al., “Pair stochastic tree adjoining grammars for aligning

and predicting pseudoknot RNA structures,” Bioinformatics, 2005.

Comparison with PSTAG (CFG)

Summary (CFG)

- A new probabilistic model called SMCFG has been proposed for RNA pseudoknot modeling.
- Polynomial time parsing and parameter estimation algorithms have been designed.
- Experimental results on RNA pseudoknot prediction have shown good prediction accuracy.

Download Presentation

Connecting to Server..