The discovery of novel ncrna in genomes
Download
1 / 60

The Discovery of Novel ncRNA in Genomes - PowerPoint PPT Presentation


  • 198 Views
  • Uploaded on

The Discovery of Novel ncRNA in Genomes. Andrew Uzilov David Mathews. Uzilov, Keegan, Mathews. BMC Bioinformatics . 2006. In Press. Outline:. Background in ncRNA. Basic hypothesis. The Dynalign algorithm for prediction of an RNA secondary structure common to two sequences.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The Discovery of Novel ncRNA in Genomes' - anneliese


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The discovery of novel ncrna in genomes l.jpg

The Discovery of Novel ncRNA in Genomes

Andrew Uzilov

David Mathews


Slide2 l.jpg

Uzilov, Keegan, Mathews. BMC Bioinformatics. 2006. In Press.


Outline l.jpg
Outline:

  • Background in ncRNA.

  • Basic hypothesis.

  • The Dynalign algorithm for prediction of an RNA secondary structure common to two sequences.

  • Using Dynalign to find ncRNA sequences in genomes.

  • Optimizing Dynalign performance.




Slide6 l.jpg

What is ncRNA?

  • Non-coding RNA (ncRNA) is an RNA that functions without being translated to a protein.

  • Known roles for ncRNAs:

    • RNA catalyzes excision/ligation in introns.

    • RNA catalyzes the maturation of tRNA.

    • RNA catalyzes peptide bond formation.

    • RNA is a required subunit in telomerase.

    • RNA plays roles in immunity and development (RNAi).

    • RNA plays a role in dosage compensation.

    • RNA plays a role in carbon storage.

    • RNA is a major subunit in the SRP, which is important in protein trafficking.

    • RNA guides RNA modification.


Slide7 l.jpg

Predicting RNA Secondary and 3D Structure from Sequence:

AAUUGCGGGAAAGGGGUCAA

CAGCCGUUCAGUACCAAGUC

UCAGGGGAAACUUUGAGAUG

GCCUUGCAAAGGGUAUGGUA

AUAAGCUGACGGACAUGGUC

CUAACCACGCAGCCAAGUCC

UAAGUCAACAGAUCUUCUGU

UGAUAUGGAUGCAGUUCA

Cate, et al. (Cech & Doudna).

(1996) Science 273:1678.

Waring & Davies. (1984) Gene 28: 277.


Slide8 l.jpg

An RNA Secondary Structure:

R2 Retrotransposon

3’ UTR from D. melanogaster.

RNA 3:1-16.

On average, 46 % of

nucleotides are unpaired.


Slide9 l.jpg

Gibb’s Free Energy (DG°):

Ki =

=

= Ki/Kj =

DG° quantifies the favorability of a structure

at a given temperature.


Slide10 l.jpg

Nearest Neighbor Model for RNA Secondary Structure Free Energy at 37 OC:

Mathews, Disney, Childs, Schroeder, Zuker, & Turner. 2004. PNAS 101: 7287.


How is the lowest free energy structure determined l.jpg
How is the Lowest Free Energy Structure Determined?

  • Naïve approach would be to calculate the free energy of every possible secondary structure.

  • Number of secondary structures  1.8N (where N is the number of nucleotides)

  • The free energies of 1000 structures can be calculated in 1 second.

  • For 100 nucleotide sequence:

    • Number of secondary structures  3 × 1025

    • Time to calculate  1014 years


Dynamic programming algorithm l.jpg
Dynamic Programming Algorithm:

  • Not to be confused with molecular dynamics.

  • This is a calculation – not a simulation.

  • The lowest free energy structure is guaranteed given the nearest neighbor parameters used.

  • Reviewed by Sean Eddy. Nature Biotechnology. 2004. 11: 1457.


Dynamic programming algorithm13 l.jpg
Dynamic Programming Algorithm:

  • Named by Richard Bellman in 1953.

  • Applies to calculations in which the cost/score is built progressively from smaller solutions.

  • Other applications

    • Sequence alignment

    • Determining partition functions for RNA secondary structures

    • Finding shortest paths

    • Determining moves in games

    • Linguistics


Dynamic programming l.jpg
Dynamic Programming:

  • Recursion is used to speed the calculation.

    • The problem is divided into smaller problems.

    • The smaller problems are used to solve bigger problems.

  • Two Step Process

    • Fill – determines the lowest free energy folding possible for each subsequence

    • Traceback – determined the structure that has the lowest free energy


Rna secondary structure prediction accuracy l.jpg
RNA Secondary Structure Prediction Accuracy:

Percentage of Known Base Pairs Correctly Predicted:

Mathews, Disney, Childs, Schroeder, Zuker, & Turner. 2004. PNAS 101: 7287.


Pseudoknot l.jpg
Pseudoknot:

i < i’ < j < j’


Hypothesis l.jpg
Hypothesis:

  • ncRNAs have lower folding free energy change than non-structural sequences, e.g. mRNA, or random sequences.

  • Corollary:

    • ncRNAs, which are structured, can be found in genomic sequences because they have folding free energy change lower than background sequences.


Do structural rnas have lower folding free energy change than background l.jpg
Do Structural RNAs have Lower Folding Free Energy Change than Background?

  • Yes:

    • Le et al. 1990. NAR 18:1613.

    • Seffens & Digby. 1999. NAR 27:1578.

    • Clote et al. 2005. RNA 11:578.

  • No:

    • Workman & Krogh. 1999. NAR 27:418.

    • Rivas & Eddy. 2000. Bioinformatics 16:583.


Test of hypothesis l.jpg
Test of Hypothesis: than Background?

ncRNA

(tRNA or 5S rRNA)

Negative

(First order Markov

chain that preserves

dinucleotide frequencies)

(First order Markov chain

that preserves

dinucleotide frequencies)

100 Control

Sequences

100 Control

Sequences


Calculate z score of folding free energy change for positives and negatives l.jpg
Calculate Z Score of Folding Free Energy Change for Positives and Negatives:

  • Calculate the mean, <DG37>, and standard deviation, s, for the controls.

  • Z score is the number of standard deviations that a negative or positive’s free energy change is different from mean:

    Z = (DG37-<DG37>)/ s

  • Choose a Z-score cutoff for classification as ncRNA.


Scoring l.jpg
Scoring: Positives and Negatives:

  • Sensitivity =

    (True Positives)/(True Positives + False Negatives) =

    percent of ncRNA correctly classified as ncRNA

  • Specificity =

    (True Negatives)/(True Negatives + False Positives) =

    percent of non-ncRNA correctly classified as non-ncRNA


Distribution of z scores l.jpg
Distribution of Z Scores: Positives and Negatives:

Count


R eceiver operator characteristic roc curve l.jpg
R Positives and Negatives:eceiver-Operator Characteristic (ROC) Curve:


Why do structural rna sequences not have a significantly lower folding free energy change l.jpg
Why do Structural RNA Sequences Not Have a Significantly Lower Folding Free Energy Change?

  • Hypothesis is incorrect.

  • Secondary structure prediction has limited accuracy:

    • Kinetics may play a role in folding.

    • Free energy nearest neighbors are based on a limited number of experiments and have error.

    • The algorithms that are used for these studies cannot predict pseudoknots (non-nested pairs).


Slide25 l.jpg

Dynalign Lower Folding Free Energy Change?(a 4-D Dynamic Programming Algorithm):

Algorithm for

Secondary Structure Prediction

(2D dynamic programming algorithm)

Algorithm for

Sequence Alignment

(2D dynamic programming algorithm)

Simultaneously finds the sequence alignment and

thermodynamically favorable common secondary structure

for two sequences.

Dynalign requires no sequence identity.

Mathews & Turner. Journal of Molecular Biology. 317: 191-203 (2002)

Mathews. Bioinformatics. 21: 2246-2253 (2005)


Slide26 l.jpg

Inputs, Optimization, and Outputs: Lower Folding Free Energy Change?

Input:

Sequence 1

Sequence 2

Optimization (minimize DG°total):

DG°total = DG°sequence 1 + DG°sequence 2 + (DG°gap)(number of gaps)

Output:

Sequence Alignment, Structure of 1, Structure of 2

where each helix in 1 must be homologous to a BP in 2


Slide27 l.jpg

Optimization of Lower Folding Free Energy Change?DGºgap:

Seven 5S rRNAs with secondary structures predicted with 47.8% average

accuracy. Average of all 42 pair-wise combinations predicted by Dynalign.


Slide28 l.jpg

Improving the Accuracy of tRNA Lower Folding Free Energy Change?

Secondary Structure Prediction:

Conventional Free Energy Minimization Predicted Structures:

RD0260

RE6781


Slide29 l.jpg

Improving the Accuracy of tRNA Lower Folding Free Energy Change?

Secondary Structure Prediction:

Dynalign Predicted Structures:

RE6781

RD0260

RD0260 GCGACCGGGGCUGGCUUGGUAAUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA

RE6781 UCCGUCGUAGUCUAGGUGGUUAGGAUACUCGGCUCUCACCCGAGAGAC-CCGGGUUCGAGUCCCGGCGACGGAACCA

^^^^^^^ ^^^^ ^^^^ ^^^^^ ^^^^^ ^^^^^ ^^^^^^^^^^^^


Benchmarks l.jpg
Benchmarks: Lower Folding Free Energy Change?

  • Four databases:

    • All pairwise comparisons (21) of seven 5S sequences with widely varying accuracy of secondary structure prediction using a single sequence.

    • 3 calculations with 6 srp sequences.

    • All pairwise calculations (780) with 40 randomly chosen tRNA sequences.

    • All pairwise comparisons (105) of 15 randomly chosen 5S rRNA sequences.


Sensitivity l.jpg
Sensitivity: Lower Folding Free Energy Change?

Sensitivity = (Correctly Predicted Pairs)/(Total Known Pairs)


Improving dynalign performance l.jpg
Improving Dynalign Performance: Lower Folding Free Energy Change?

  • The original restriction on the alignments is: |i – k| ≤ M

    • For the 3’ ends of the sequence to align: M ≥ | N1 – N2|

    • For most applications, the ends of the sequences should align.

  • This suggests an alternative restriction: |i N2/N1 – k | ≤ M

    • This allows a smaller M parameter. Calculation time scales O(N3M3).


Heuristic to exclude base pairs l.jpg
Heuristic to Exclude Base Pairs: Lower Folding Free Energy Change?

  • There are many possible canonical base pairs that are not worth considering because any structure that contains them has a high free energy.

  • The “high energy” base pairs can be identified by secondary structure prediction using a single sequence (very fast). The high energy pairs can then be excluded from a Dynalign structure prediction.


Of known pairs within a energy increment from the lowest free energy structure l.jpg
% of Known Pairs within a % Energy Increment from the Lowest Free Energy Structure:


Time performance improvement l.jpg
Time Performance Improvement: from the Lowest Free Energy Structure:

3.2 GHz Intel Pentium 4 with 1 GB RAM; Red Hat Enterprise Linux 3;

gcc 3.2.3-42 compiler


Revised hypothesis l.jpg
Revised Hypothesis: from the Lowest Free Energy Structure:

  • Dynalign calculated folding free energies for sequence pairs derived from genome alignments can be used to find ncRNAs with high sensitivity and specificity.


Testing the hypothesis l.jpg
Testing the Hypothesis: from the Lowest Free Energy Structure:

ncRNA pair

(tRNAs or 5S rRNAs)

Negative pair

(Shuffle of global alignment)

(Shuffle of global alignment)

20 Control

Sequence Pairs

20 Control

Sequence Pairs


Dynalign roc curve has larger integral than single sequence l.jpg
Dynalign ROC Curve has Larger Integral than Single Sequence: from the Lowest Free Energy Structure:


Roc curves depend on m l.jpg
ROC Curves Depend on M: from the Lowest Free Energy Structure:


Roc curves for trna and 5s rrna l.jpg
ROC Curves for tRNA and 5S rRNA: from the Lowest Free Energy Structure:


Comparison to other state of the art methods l.jpg
Comparison to Other State of the Art Methods: from the Lowest Free Energy Structure:

  • QRNA:

    • Rivas & Eddy. 2001. BMC Bioinformatics 2:8.

    • Comparative analysis of aligned sequences, where compensating base pairs changes indicate ncRNA. Classification by stochastic context-free grammar.

  • RNAz:

    • Washietl, Hofacker, & Stadler. 2005. PNAS 102: 2454.

    • Folding free energy of two or more aligned sequences using RNAalifold. Classification by support vector machine (SVM).

  • Both Methods Use Fixed Alignments:

    • Faster than Dynalign.

    • Limited to sequence alignment algorithm (compensating base pair changes make accurate alignment difficult).


Qrna sequence types l.jpg
QRNA Sequence Types: from the Lowest Free Energy Structure:


Dynalign vs rnaz l.jpg
Dynalign vs. RNAz: from the Lowest Free Energy Structure:


What about low sequence identity pairs l.jpg
What About Low Sequence Identity Pairs? from the Lowest Free Energy Structure:


Human vs mouse alignment santa cruz genome server pairwise identities for 50 nucleotide windows l.jpg
Human vs. Mouse Alignment (Santa Cruz Genome Server) Pairwise Identities for 50 Nucleotide Windows:


Faster method using dynalign l.jpg
Faster Method Using Dynalign: Pairwise Identities for 50 Nucleotide Windows:

  • Run a single calculation and use a support vector machine (SVM) to classify sequence as ncRNA or not.

    • Each window only needs to be scanned once.

    • A probability is assigned to the classification.

  • SVM

    • Trained with tRNA and 5S rRNA sequences.

    • Input:

      • Dynalign total free energy change

      • Length of the shorter sequence

      • A,C,G content of each sequence


Roc of svm vs 20 controls l.jpg
ROC of SVM vs. 20 Controls: Pairwise Identities for 50 Nucleotide Windows:


Dynalign svm vs rnaz at low identity l.jpg
Dynalign-SVM vs. RNAz at Low Identity: Pairwise Identities for 50 Nucleotide Windows:


Unrolling the method on e coli l.jpg
Unrolling the Method on Pairwise Identities for 50 Nucleotide Windows:E. coli:

  • Look for ncRNA in E. coli using alignments to S. typhi.

    • MUMmer (Kurtz et al.. 2004. Genome Biol 5:R12)

      • 15,214 blocks of 50 to 150 nucleotides as above (where long alignment blocks were divided into 150 nucleotide windows that overlap 75 nucleotides)


Ncrna detection l.jpg
ncRNA Detection: Pairwise Identities for 50 Nucleotide Windows:


Epilogue improving dynalign performance l.jpg
Epilogue: Improving Dynalign Performance: Pairwise Identities for 50 Nucleotide Windows:

  • In collaboration with Gaurav Sharma, Electrical and Computer Engineering, University of Rochester, and Arif Harmanci, we pre-determine the sequence alignment probabilities with a Hidden Markov Model.

  • Then, we only allow alignments in Dynalign that have probability greater than 10-4.

    • This removes the need of using the M parameter heuristic.

    • This does not affect the accuracy of structure prediction by Dynalign.


Benchmarks against other programs using 2000 pairs of 5s rrna sequences l.jpg
Benchmarks Against Other Programs Using 2000 Pairs of 5S rRNA Sequences:

Percent of Known Pairs Correctly Predicted:


Performance benchmarks using 200 pairs of sequences l.jpg
Performance Benchmarks Using 200 Pairs of Sequences: rRNA Sequences:

Using a single core on a dual, dual-core Opteron 270 machine

running Fedora Core 5 and gcc 4.1.1.


Parallelizing dynalign for smp l.jpg
Parallelizing Dynalign for SMP: rRNA Sequences:

  • In collaboration with Paul Tymann, Computer Science, Rochester Institute of Technology and CS students Chris Connett, Glenn Katzen, Andrew Yohn, we developed an SMP version of Dynalign.

  • This takes advantage of the fact that there are a number of positions in the arrays that can be filled independently in the dynamic programming algorithm recursions.


Scaling l.jpg
Scaling: rRNA Sequences:

Two R2 3’ UTRs of length 234 and 217 nucleotides.

Using a dual, dual-core Opteron 270 machine running Fedora Core 5 and gcc 4.1.1.


Preliminary results with smp dynalign l.jpg
Preliminary Results with SMP-Dynalign: rRNA Sequences:

  • Single sequence secondary structure prediction of E. coli 16S rRNA (1542 nucleotides) has 43.6% sensitivity.

  • E. coli 16S rRNA run on Dynalign with:

    • B. subtilis 16S rRNA (1552 nucleotides) has 80.7% sensitivity and required 381 minutes on 4 cores and 983 MB or RAM.

    • Borrelia burgodorferi 16S rRNA (1532 nucleotides) has 76.4% sensitivity and required 408 minutes on 4 cores and 1.0 GB of RAM.


Conclusions l.jpg
Conclusions: rRNA Sequences:

  • The folding free energy of single sequences does not provide a sensitive and specific method of finding ncRNAs. It does, however, provide a pre-filtering method that can remove 30% of sequences from consideration.

  • Dynalign shows promise as a method for ncRNA detection, especially at low pairwise identities of sequences.


Acknowledgements l.jpg
Acknowledgements: rRNA Sequences:

  • Funding:

    • Alfred P. Sloan Foundation

    • National Institutes of Health

  • Computing:

    • CASCI Lab at Rochester Institute of Technology

  • Past Lab Members:

    • Andrew Uzilov

    • Shan Zhao

    • Eliany Sanchez-Baez

  • Lab Members:

    • Sumeet Chandha

    • Zhi Lu

    • Matthew Seetin

    • Rahul Tyagi

    • Keith VanNostrand


Mummer l.jpg
MUMmer: rRNA Sequences:


Wublastn l.jpg
WuBLASTn: rRNA Sequences:


ad