finding aligning and analyzing non coding rnas
Download
Skip this Video
Download Presentation
Finding, Aligning and Analyzing Non Coding RNAs

Loading in 2 Seconds...

play fullscreen
1 / 38

Finding, Aligning and Analyzing Non Coding RNAs - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Finding, Aligning and Analyzing Non Coding RNAs. Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program. They are Everywhere…. And ENCODE said…

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Finding, Aligning and Analyzing Non Coding RNAs' - saman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
finding aligning and analyzing non coding rnas

Finding, Aligning and AnalyzingNon Coding RNAs

Cédric Notredame

Comparative Bioinformatics Group

Bioinformatics and Genomics Program

they are everywhere
They are Everywhere…
  • And ENCODE said…

“nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions”

  • Who Are They?
    • tRNA, rRNA, snoRNAs,
    • microRNAs, siRNAs
    • piRNAs
    • long ncRNAs (Xist, Evf, Air, CTN, PINK…)
  • How Many of them
    • Open question
    • 30.000 is a common guess
    • Harder to detect than proteins

.

searching

Searching

“…When Looking for a Needle in a Haystack, the optimistic Wears Gloves…”

ncrnas can evolve rapidly

A

A

C

C

C

C

A

A

A

A

C

C

G

G

G

G

G

G

G

G

A

A

A

A

C

C

G

G

G

G

CTTGCCTCC

GAACGGACC

CTTGCCTGG

GAACGGAGG

ncRNAs Can Evolve Rapidly

CCAGGCAAGACGGGACGAGAGTTGCCTGG

CCTCCGTTCAGAGGTGCATAGAACGGAGG

**-------*--**---*-**------**

ncrnas are difficult to align
ncRNAs are Difficult to Align

CCAGGCAAGACGGGACGAGAGTTGCCTGG

CCTCCGTTCAGAGGTGCATAGAACGGAGG

**-------*--**---*-**------**

Regular Alignment

--CCAGGCAAGACGGGACGAGAGTTGCCTGG

CCTCCGTTCAGAGGTGCATAGAACGGAGG--

* * *** * * *** *

ncrnas are difficult to align1
ncRNAs are Difficult to Align
  • Same Structure Low Sequence Identity
  • Small Alphabet, Short Sequences  Alignments often Non-Significant
obtaining the structure of a ncrna is difficult
Obtaining the Structure of a ncRNA is difficult
  • Hard to Align The Sequences Without the Structure
  • Hard to Predict the Structures Without an Alignment
the holy grail of rna comparison sankoff algorithm1
The Holy Grail of RNA ComparisonSankoff’ Algorithm
  • Simultaneous Folding and Alignment
    • Time Complexity: O(L2n)
    • Space Complexity: O(L3n)
  • In Practice, for Two Sequences:
    • 50 nucleotides: 1 min. 6 M.
    • 100 nucleotides 16 min. 256 M.
    • 200 nucleotides 4 hours 4 G.
    • 400 nucleotides 3 days 3 T.
  • Forget about
    • Multiple sequence alignments
    • Database searches
the next best thing consan
The next best Thing: Consan
  • Consan = Sankoff + a few constraints
  • Use of Stochastic Context Free Grammars
    • Tree-shaped HMMs
    • Made sparse with constraints
  • The constraints are derived from the most confident positions of the alignment
  • Equivalent of Banded DP
consan for databases infernal
Consan for Databases: Infernal
  • Infernal is a Faster version of Consan
  • For Database Search
  • Sill Very Slow

Receiver operating characteristic (ROC)

Comparison of Infernal with BLAST

consan for databases infernal1
Consan for Databases: Infernal
  • BLAST: 360 s.
  • Fast Infernal: 182 000 s.
  • Slow Infernal: 5 320 000 s.
rfam in practice
Rfam: In practice
  • Rfam contains RNA families
    • Families  Multiple Sequence Alignment  Models
    • Models are like Pfam Profiles
      • Use Consan or Cmsearch rather than HMMer
      • Much Slower
    • Too expensive to search the models
      • Models are used to build Rfam
      • People usually BLAST Rfam
where do rfam families come from
Where do Rfam Families Come From?
  • Infernal Requires a Model
  • Models requires an MSA
  • The MSA requires a Family
  • It all starts with a BlastN

Rfam, Gardner et al. NAR 2008

can we make blastn more accurate
Can we make BlastN more accurate ?
  • BlastN is not very accurate because:
    • Poor substitution models for Nucleic Acids
    • Low information density (4 symbols)
  • BlastN assumes
    • Equal evolution rates for all nucleotides
    • Independence form Neighbors
love thy neighbor
Love Thy Neighbor

Measured Nearest Neighbor Dependencies on Rfam sequences

measuring di nucleotide evolution
Measuring Di-Nucleotide Evolution
  • Each Nucleotide can be made more informative
  • It can incorporate the “name” of its Neighbor
    • AA => a
    • AG => b
    • AC => c
    • AT => d
  • A 16 Letter alphabet can be used to recode all nucleotide sequences
  • We name these extended Nucleotides
substitutions
Substitutions ??
  • How much does it cost to turn one nucleotide into another one ?
  • Blosum/Pam style matrix
  • Matrices estimated on Rfam families
using blastr
Using BlastR
  • When Nucleic Acids look like Proteins
  • They can be aligned with Protein Methods
    • BlastN  BlastP
    • BlastP with eRNA is BlastR
benchmarking blastr
Benchmarking BlastR

PP

Query

PN

E

V

A

L

U

E

S

Blast

Rfam

benchmarking blastr1
Benchmarking BlastR

Blast

Rfam 001

Rfam 001

ROC

Blast

Rfam 002

Rfam 002

Blast

Rfam …

Rfam …

benchmarking blastr2
Benchmarking BlastR

False Positives

Bad

Good

True Positive

Good

Bad

benchmarking blastr3
Benchmarking BlastR

False Positives

Bad

Good

Area Under Curve

Small AUC  Better

True Positive

the 3 components of blast r
The 3 Components of Blast R
  • BlastP is better than BlastN
  • BlosumR makes BlastP a little bit better

Blast: wuBlast

the 3 components of blast r1
The 3 Components of Blast R
  • BlastP is better than BlastN
  • BlosumR makes BlastP a little bit better
  • And Faster
blastr and clustering
BlastR and Clustering

Sensitivity

  • Given all Rfam in Bulk
  • How good is BlastR at reconstituting all the families

1-Specificty

blastr and clustering1
BlastR and Clustering

Sensitivity

  • Given all Rfam in Bulk
  • How good is BlastR at reconstituting all the families

1-Specificty

bllastr in practice1
BllastR: In Practice

BlastR

-20

E-Value Threshold: 10

BlastN

take home
Take Home
  • Searching Nucleotides is Difficult
  • BlastN is not a very good algorithm
  • Simple Adaptations can improve the situation
    • Changing the algorithm (BlastP)
    • Changing the Scoring Scheme (BlastP-Nuc)
    • Changing the alphabet (BlastR)
ad