Finding aligning and analyzing non coding rnas
This presentation is the property of its rightful owner.
Sponsored Links
1 / 38

Finding, Aligning and Analyzing Non Coding RNAs PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on
  • Presentation posted in: General

Finding, Aligning and Analyzing Non Coding RNAs. Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program. They are Everywhere…. And ENCODE said…

Download Presentation

Finding, Aligning and Analyzing Non Coding RNAs

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Finding aligning and analyzing non coding rnas

Finding, Aligning and AnalyzingNon Coding RNAs

Cédric Notredame

Comparative Bioinformatics Group

Bioinformatics and Genomics Program


They are everywhere

They are Everywhere…

  • And ENCODE said…

    “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions”

  • Who Are They?

    • tRNA, rRNA, snoRNAs,

    • microRNAs, siRNAs

    • piRNAs

    • long ncRNAs (Xist, Evf, Air, CTN, PINK…)

  • How Many of them

    • Open question

    • 30.000 is a common guess

    • Harder to detect than proteins

      .


Searching

Searching

“…When Looking for a Needle in a Haystack, the optimistic Wears Gloves…”


Ncrnas can have different sequences and similar structures

ncRNAs can have different sequences and Similar Structures


Ncrnas can evolve rapidly

A

A

C

C

C

C

A

A

A

A

C

C

G

G

G

G

G

G

G

G

A

A

A

A

C

C

G

G

G

G

CTTGCCTCC

GAACGGACC

CTTGCCTGG

GAACGGAGG

ncRNAs Can Evolve Rapidly

CCAGGCAAGACGGGACGAGAGTTGCCTGG

CCTCCGTTCAGAGGTGCATAGAACGGAGG

**-------*--**---*-**------**


Ncrnas are difficult to align

ncRNAs are Difficult to Align

CCAGGCAAGACGGGACGAGAGTTGCCTGG

CCTCCGTTCAGAGGTGCATAGAACGGAGG

**-------*--**---*-**------**

Regular Alignment

--CCAGGCAAGACGGGACGAGAGTTGCCTGG

CCTCCGTTCAGAGGTGCATAGAACGGAGG--

* * *** * * *** *


Ncrnas are difficult to align1

ncRNAs are Difficult to Align

  • Same Structure Low Sequence Identity

  • Small Alphabet, Short Sequences  Alignments often Non-Significant


Obtaining the structure of a ncrna is difficult

Obtaining the Structure of a ncRNA is difficult

  • Hard to Align The Sequences Without the Structure

  • Hard to Predict the Structures Without an Alignment


The holy grail of rna comparison sankoff algorithm

The Holy Grail of RNA Comparison:Sankoff’ Algorithm


The holy grail of rna comparison sankoff algorithm1

The Holy Grail of RNA ComparisonSankoff’ Algorithm

  • Simultaneous Folding and Alignment

    • Time Complexity: O(L2n)

    • Space Complexity: O(L3n)

  • In Practice, for Two Sequences:

    • 50 nucleotides: 1 min.6 M.

    • 100 nucleotides 16 min.256 M.

    • 200 nucleotides 4 hours 4 G.

    • 400 nucleotides3 days3 T.

  • Forget about

    • Multiple sequence alignments

    • Database searches


The next best thing consan

The next best Thing: Consan

  • Consan = Sankoff + a few constraints

  • Use of Stochastic Context Free Grammars

    • Tree-shaped HMMs

    • Made sparse with constraints

  • The constraints are derived from the most confident positions of the alignment

  • Equivalent of Banded DP


Consan for databases infernal

Consan for Databases: Infernal

  • Infernal is a Faster version of Consan

  • For Database Search

  • Sill Very Slow

Receiver operating characteristic (ROC)

Comparison of Infernal with BLAST


Consan for databases infernal1

Consan for Databases: Infernal

  • BLAST: 360 s.

  • Fast Infernal: 182 000 s.

  • Slow Infernal: 5 320 000 s.


Searching databases for new rnas

Searching Databases for New RNAs


Rfam in practice

Rfam: In practice

  • Rfam contains RNA families

    • Families  Multiple Sequence Alignment  Models

    • Models are like Pfam Profiles

      • Use Consan or Cmsearch rather than HMMer

      • Much Slower

    • Too expensive to search the models

      • Models are used to build Rfam

      • People usually BLAST Rfam


Where do rfam families come from

Where do Rfam Families Come From?

  • Infernal Requires a Model

  • Models requires an MSA

  • The MSA requires a Family

  • It all starts with a BlastN

Rfam, Gardner et al. NAR 2008


Can we make blastn more accurate

Can we make BlastN more accurate ?

  • BlastN is not very accurate because:

    • Poor substitution models for Nucleic Acids

    • Low information density (4 symbols)

  • BlastN assumes

    • Equal evolution rates for all nucleotides

    • Independence form Neighbors


Love thy neighbor

Love Thy Neighbor

Measured Nearest Neighbor Dependencies on Rfam sequences


Finding aligning and analyzing non coding rnas

High Rate of CpG mutations


Measuring di nucleotide evolution

Measuring Di-Nucleotide Evolution

  • Each Nucleotide can be made more informative

  • It can incorporate the “name” of its Neighbor

    • AA => a

    • AG => b

    • AC => c

    • AT => d

  • A 16 Letter alphabet can be used to recode all nucleotide sequences

  • We name these extended Nucleotides


Blosum r and erna

Blosum-R and eRNA


Substitutions

Substitutions ??

  • How much does it cost to turn one nucleotide into another one ?

  • Blosum/Pam style matrix

  • Matrices estimated on Rfam families


Blosum r and erna1

Blosum-R and eRNA


Using blastr

Using BlastR

  • When Nucleic Acids look like Proteins

  • They can be aligned with Protein Methods

    • BlastN  BlastP

    • BlastP with eRNA is BlastR


Validating blast r

Validating Blast-R


Benchmarking blastr

Benchmarking BlastR

PP

Query

PN

E

V

A

L

U

E

S

Blast

Rfam


Benchmarking blastr1

Benchmarking BlastR

Blast

Rfam 001

Rfam 001

ROC

Blast

Rfam 002

Rfam 002

Blast

Rfam …

Rfam …


Benchmarking blastr2

Benchmarking BlastR

False Positives

Bad

Good

True Positive

Good

Bad


Benchmarking blastr3

Benchmarking BlastR

False Positives

Bad

Good

Area Under Curve

Small AUC  Better

True Positive


Blastr vs the world

BlastR vs The World


The 3 components of blast r

The 3 Components of Blast R

  • BlastP is better than BlastN

  • BlosumR makes BlastP a little bit better

Blast: wuBlast


The 3 components of blast r1

The 3 Components of Blast R

  • BlastP is better than BlastN

  • BlosumR makes BlastP a little bit better

  • And Faster


Blastr and clustering

BlastR and Clustering

Sensitivity

  • Given all Rfam in Bulk

  • How good is BlastR at reconstituting all the families

1-Specificty


Blastr and clustering1

BlastR and Clustering

Sensitivity

  • Given all Rfam in Bulk

  • How good is BlastR at reconstituting all the families

1-Specificty


Bllastr in practice

BllastR: In Practice


Bllastr in practice1

BllastR: In Practice

BlastR

-20

E-Value Threshold: 10

BlastN


Take home

Take Home

  • Searching Nucleotides is Difficult

  • BlastN is not a very good algorithm

  • Simple Adaptations can improve the situation

    • Changing the algorithm (BlastP)

    • Changing the Scoring Scheme (BlastP-Nuc)

    • Changing the alphabet (BlastR)


  • Login