Enhancing RNA Secondary Structure Prediction with Stochastic Context-Free Grammars

Improved RNA Secondary Structure Prediction Using Stochastic Context Free Grammars. Esposito, D., Heitsch, C. E., Poznanovik, S. and Swenson, M. S. Georgia Institute of Technology Conclusions Abstract Introduction There seems to exist multiple characteristics of RNA sequences and structures which can be used to infer the accuracy of the secondary structure prediction using current prediction methods. If it can be shown that a sequence is difficult for MFE to predict, then it is probable that my stochastic grammar algorithm will predict the secondary structure more accurately. • RNA • RNA has 4 main classes: • tRNA, 5S, 16S and 23S from shortest to longest.[2] • Each class folds to a similar 2D structure. • The secondary structure dictates the function of the RNA [2] • RNA secondary structure prediction • Open problem in the fields of Biology, Mathematics, and Computer Science. • Efficient algorithms exits. • MFE (Minimum Free Energy) algorithm scores structures chemical stability. • Arguably most accurate prediction method. • Accuracies according to F-Measure range from 0.1357-0.75510 for 16S RNA class. See Figure 3. • Based on Nearest Neighbor Thermodynamic Model [2] • SCFG (Stochastic Context Free Grammars) • CFG's are languages which describe structural possibilities of words within the language [3]. See Figure 1 and 2. • SCFG describe the likeliness of a word existing in the language [3] Accurate RNA secondary structure prediction is an important problem in computational biology. Different RNA nucleotide sequences often fold to similar structures causing current prediction algorithms to range widely in accuracy for RNA strands with similar structures. To understand the origins of these inaccuracies we trained a stochastic context free grammar on a hard-to-predict training set and an easy-to-predict training set which corresponds to a set of sequences with low and high prediction accuracy respectively. We found interesting statistical differences in the nucleotide composition of the sequence as well as the distribution of nucleotide base pairs between the two training sets. Stochastic context free grammars provide a means to quantify subtle difference in the composition of native secondary structures. The discovery of these differences could potentially lead to the improvement of current prediction algorithms. We are currently performing a parametric analysis of several prediction methods. Figures 4. 5S and 16S F-Measure Distribution Results • Examining base pair ratios • MFE only predicts canonical base pairs (gc, au, gu) while Pfold considers all base pairs. • Figure 6 shows MFE predicting well on sequences with many canonical base pairs. • Predicting with the trained grammar • My grammar performed similar to MFE scoring base pairs similarly • MFE gained greater rewards and suffered greater penalties on average for predicting canonical-base pairs • My grammar performed better than MFE on average for hard sets shown in Figure 5. Figure 6. Canonical vs. Non-Canonical base pair probabilities for 5S and 16S. • Examining nucleotide ratios • Because g can pair with c or u, the less options MFE has ( count(c) >> count(u)) the better MFE performed in prediction. See Figure 7 • The difference p(t→c|t) – p(t→u|t) was over 7 times larger on average for the easy sets. Figure 2. Pfold Grammar. Parse Tree Figure 1. The Pfold Grammar Figure 3. F-Measure Definition. Figure 5. F-Measure accuracy MFE vs. Stochastic Grammars Figure 7. p(t→c) vs. p(t→u) for 5S and 16S. References Methods Contact • Forming training sets • RNA 5S, 16S, 23S and tRNA structures and sequences were acquired from the Comparative RNA Web site[2]. • RNAfold[4] was used to predict on each unambiguous sequence recording accuracy score according to F-Measure. See Figure 4. • Thirty-two training sets were formed by differentiating size, prediction difficulty and RNA class. • Training the P-Fold grammar • Grammar trained on each set. The min, max, mean and standard deviation was recorded for each parameter set. • Original Pfold parameters most closely matched my Grammars parameters for the hard tRNA training set. • Secondary structure prediction • The trained grammar was used to predict a structure for each of the sequences within its training set, recording and comparing accuracy to MFE for each structure. Cannone J, Subramanian S, Schnare M, Collett J, D’Souza L, Du Y, Feng B, Lin N, Madabusi L, Miller K, Pande N, Shang Z, Yu N, Gutell R: The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 2002, 3. Mathews D.H., Schroeder S.J., Turner D.H., and Zuker M. RNA World. Cold Spring Harbor Labratory Press, 3rd edition, 2006. Sean R. Eddy Richard Durbin, Anders Krogh, and Graeme Mitchison. Biological Sequence Analysis. Cambridge University Press, 1998. Ivo L. Hofacker. Vienna rna secondary structure server. Nucleic Acids Research, 31(13):3429–3431, 2003. David Esposito Georgia Institute of Technology Email: desposito6@gatech.edu Phone: 404.444.2215 Websites: www.TheDavidEsposito.com http://www.DavidEsposito.info/REU/code.google.com/p/gt-jscfg-2/

Enhancing RNA Secondary Structure Prediction with Stochastic Context-Free Grammars

Enhancing RNA Secondary Structure Prediction with Stochastic Context-Free Grammars

Presentation Transcript

C M S

I C S M S

E S T E E M E D C L I E N T S:

M U L C H M A D N E S S

A C C E S S O R I E S A N D C O S M E T I C S

M O D E S

CLEO-c D + s         e   e  

S u c c e s s

I C S M S