Finding aligning and analyzing non coding rnas
Download
1 / 38

Finding, Aligning and Analyzing Non Coding RNAs - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Finding, Aligning and Analyzing Non Coding RNAs. Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program. They are Everywhere…. And ENCODE said…

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Finding, Aligning and Analyzing Non Coding RNAs' - saman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Finding aligning and analyzing non coding rnas

Finding, Aligning and AnalyzingNon Coding RNAs

Cédric Notredame

Comparative Bioinformatics Group

Bioinformatics and Genomics Program


They are everywhere
They are Everywhere…

  • And ENCODE said…

    “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions”

  • Who Are They?

    • tRNA, rRNA, snoRNAs,

    • microRNAs, siRNAs

    • piRNAs

    • long ncRNAs (Xist, Evf, Air, CTN, PINK…)

  • How Many of them

    • Open question

    • 30.000 is a common guess

    • Harder to detect than proteins

      .


Searching

Searching

“…When Looking for a Needle in a Haystack, the optimistic Wears Gloves…”



Ncrnas can evolve rapidly

A

A

C

C

C

C

A

A

A

A

C

C

G

G

G

G

G

G

G

G

A

A

A

A

C

C

G

G

G

G

CTTGCCTCC

GAACGGACC

CTTGCCTGG

GAACGGAGG

ncRNAs Can Evolve Rapidly

CCAGGCAAGACGGGACGAGAGTTGCCTGG

CCTCCGTTCAGAGGTGCATAGAACGGAGG

**-------*--**---*-**------**


Ncrnas are difficult to align
ncRNAs are Difficult to Align

CCAGGCAAGACGGGACGAGAGTTGCCTGG

CCTCCGTTCAGAGGTGCATAGAACGGAGG

**-------*--**---*-**------**

Regular Alignment

--CCAGGCAAGACGGGACGAGAGTTGCCTGG

CCTCCGTTCAGAGGTGCATAGAACGGAGG--

* * *** * * *** *


Ncrnas are difficult to align1
ncRNAs are Difficult to Align

  • Same Structure Low Sequence Identity

  • Small Alphabet, Short Sequences  Alignments often Non-Significant


Obtaining the structure of a ncrna is difficult
Obtaining the Structure of a ncRNA is difficult

  • Hard to Align The Sequences Without the Structure

  • Hard to Predict the Structures Without an Alignment



The holy grail of rna comparison sankoff algorithm1
The Holy Grail of RNA ComparisonSankoff’ Algorithm

  • Simultaneous Folding and Alignment

    • Time Complexity: O(L2n)

    • Space Complexity: O(L3n)

  • In Practice, for Two Sequences:

    • 50 nucleotides: 1 min. 6 M.

    • 100 nucleotides 16 min. 256 M.

    • 200 nucleotides 4 hours 4 G.

    • 400 nucleotides 3 days 3 T.

  • Forget about

    • Multiple sequence alignments

    • Database searches


The next best thing consan
The next best Thing: Consan

  • Consan = Sankoff + a few constraints

  • Use of Stochastic Context Free Grammars

    • Tree-shaped HMMs

    • Made sparse with constraints

  • The constraints are derived from the most confident positions of the alignment

  • Equivalent of Banded DP


Consan for databases infernal
Consan for Databases: Infernal

  • Infernal is a Faster version of Consan

  • For Database Search

  • Sill Very Slow

Receiver operating characteristic (ROC)

Comparison of Infernal with BLAST


Consan for databases infernal1
Consan for Databases: Infernal

  • BLAST: 360 s.

  • Fast Infernal: 182 000 s.

  • Slow Infernal: 5 320 000 s.



Rfam in practice
Rfam: In practice

  • Rfam contains RNA families

    • Families  Multiple Sequence Alignment  Models

    • Models are like Pfam Profiles

      • Use Consan or Cmsearch rather than HMMer

      • Much Slower

    • Too expensive to search the models

      • Models are used to build Rfam

      • People usually BLAST Rfam


Where do rfam families come from
Where do Rfam Families Come From?

  • Infernal Requires a Model

  • Models requires an MSA

  • The MSA requires a Family

  • It all starts with a BlastN

Rfam, Gardner et al. NAR 2008


Can we make blastn more accurate
Can we make BlastN more accurate ?

  • BlastN is not very accurate because:

    • Poor substitution models for Nucleic Acids

    • Low information density (4 symbols)

  • BlastN assumes

    • Equal evolution rates for all nucleotides

    • Independence form Neighbors


Love thy neighbor
Love Thy Neighbor

Measured Nearest Neighbor Dependencies on Rfam sequences



Measuring di nucleotide evolution
Measuring Di-Nucleotide Evolution

  • Each Nucleotide can be made more informative

  • It can incorporate the “name” of its Neighbor

    • AA => a

    • AG => b

    • AC => c

    • AT => d

  • A 16 Letter alphabet can be used to recode all nucleotide sequences

  • We name these extended Nucleotides



Substitutions
Substitutions ??

  • How much does it cost to turn one nucleotide into another one ?

  • Blosum/Pam style matrix

  • Matrices estimated on Rfam families



Using blastr
Using BlastR

  • When Nucleic Acids look like Proteins

  • They can be aligned with Protein Methods

    • BlastN  BlastP

    • BlastP with eRNA is BlastR



Benchmarking blastr
Benchmarking BlastR

PP

Query

PN

E

V

A

L

U

E

S

Blast

Rfam


Benchmarking blastr1
Benchmarking BlastR

Blast

Rfam 001

Rfam 001

ROC

Blast

Rfam 002

Rfam 002

Blast

Rfam …

Rfam …


Benchmarking blastr2
Benchmarking BlastR

False Positives

Bad

Good

True Positive

Good

Bad


Benchmarking blastr3
Benchmarking BlastR

False Positives

Bad

Good

Area Under Curve

Small AUC  Better

True Positive



The 3 components of blast r
The 3 Components of Blast R

  • BlastP is better than BlastN

  • BlosumR makes BlastP a little bit better

Blast: wuBlast


The 3 components of blast r1
The 3 Components of Blast R

  • BlastP is better than BlastN

  • BlosumR makes BlastP a little bit better

  • And Faster


Blastr and clustering
BlastR and Clustering

Sensitivity

  • Given all Rfam in Bulk

  • How good is BlastR at reconstituting all the families

1-Specificty


Blastr and clustering1
BlastR and Clustering

Sensitivity

  • Given all Rfam in Bulk

  • How good is BlastR at reconstituting all the families

1-Specificty



Bllastr in practice1
BllastR: In Practice

BlastR

-20

E-Value Threshold: 10

BlastN


Take home
Take Home

  • Searching Nucleotides is Difficult

  • BlastN is not a very good algorithm

  • Simple Adaptations can improve the situation

    • Changing the algorithm (BlastP)

    • Changing the Scoring Scheme (BlastP-Nuc)

    • Changing the alphabet (BlastR)


ad