The modern rna world
1 / 43

The modern RNA world: - PowerPoint PPT Presentation

  • Uploaded on

The modern RNA world:. computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis. The human genome sequence is (almost) done. The genome, famously, is digital.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'The modern RNA world:' - jorryn

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The modern rna world

The modern RNA world:

computational screens for noncoding RNA genes

Eddy lab

HHMI/Washington University, Saint Louis

The genome famously is digital
The genome, famously, is digital

1892: Miescher postulates that genetic information may be encoded in a linear form using a few different chemical units:

“...just as all the words and concepts in all languages can find expression in twenty-four to thirty letters of the alphabet.”

Symbolic texts can be cracked
Symbolic texts can be cracked

Michael Ventris and John Chadwick, 1953

“Cryptography has contributed a new

weapon to the student of unknown

scripts.... the basic principle is the

analysis and indexing of coded texts, so

that underlying patterns and regularities

can be discovered. If a number of

instances can be collected, it may appear

that a certain group of signs in the coded

text has a particular function....”

- John Chadwick,

The Decipherment of Linear B,

Cambridge Univ. Press, 1958

Comparative genome analysis
Comparative genome analysis

VISTA plot; I. Dubchak, E. Rubin, et al.

human, mouse, dog genomes

Estimates of human gene number
Estimates of human gene number

mean: 61,710

low: 27,462

high: 153,478

Want to place a bet? The book is held by the bartender at Cold

Spring Harbor Laboratory.

The yeast genome completed
The yeast genome completed

Science 274:546, 1996

Life with 6000 Genes

A. Goffeau, B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon,

H. Feldmann, F. Galibert, J.D. Hoheisel, C. Jacq, M. Johnston,

E.J. Louis, H.W. Mewes, Y. Murakami, P. Phillippsen,

H. Tettelin, S.G. Oliver

where “gene” = ORF of 100 amino acids or more.

but besides the ~6000 large protein-coding genes, there’s also:

140 ribosomal RNA genes,

275 transfer RNA genes,

~40 small nuclear RNA genes,

~100 small nucleolar RNA genes,

... and ... ?

Structure of the large ribosomal subunit
Structure of the large ribosomal subunit

Haloarcula marismortui

Ban et. al., Science 289:905, 2000

Inside out genes
inside-out genes

Tycowski, Shu, and Steitz

Nature 379:464, 1996

Human UHG (U22 host gene)

no significant ORFs; not conserved with mouse; rapidly degraded

Eight intron-encoded snoRNAs

conserved with mouse; stable

An rna motor
An RNA motor

Simpson et al, Nature 408:745, 2000

“Structure of the bacteriophage f29 DNA packaging motor”

Cartilage hair hypoplasia mapped to an rna
Cartilage-hair hypoplasia mapped to an RNA

M. Ridanpaa et al. Cell 104:195, 2001

RMRP: Human RNase MRP, 267 nt

Micrornas mirnas in metazoa
microRNAs (miRNAs) in metazoa

T. Tuschl; D. Bartel; V. Ambros

lin-4 acts as translational repressor

by binding 3’ UTR

~22-mer processed from ~70-mer precursor

by RNAi pathway

Rna genes can be hard to detect
RNA genes can be hard to detect


C. elegans Let-7; 21 nt

Pasquinelli et al. Nature 408:86, 2000

  • often small

  • sometimes multicopy and redundant

  • often not polyadenylated (and remember EST libraries are poly-A selected)

  • immune to frameshift and nonsense mutation

  • no open reading frame or codon bias

  • relatively little information in primary sequence consensus

Two computational analysis problems
Two computational analysis problems

  • Similarity search (e.g. BLAST):

  • I give you a query; you find sequences in a database that

  • look like the query.

  • For RNA, you want to take the secondary structure

  • of the query into account.

  • 2. Genefinding (e.g. GENSCAN):

  • Based solely on a priori knowledge of what a “gene”

  • looks like, find genes in a genome sequence.

  • For RNA – with no open reading frame and no codon

  • bias – what do you look for?

Context free grammars
Context-free grammars

Noam Chomsky, 1956

Basic CFG

“production rules”

a CFG “derivation”

Sequence vs secondary structure alignment
Sequence vs. secondary structure alignment

R Durbin, SR Eddy, GJ Mitchison, A Krogh

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

Cambridge Univ. Press, 1998

HMM algorithm








SCFG algorithm









optimal alignment

P(sequence | model)

EM parameter estimation

memory complexity:

time complexity (general):

time complexity (as used):

  • we can analyze target sequences with secondary structure models;

  • but the algorithms are computationally expensive.

Scfg based rna similarity search
SCFG-based RNA similarity search

C/D methylation guide snoRNA consensus:

Graphical model, prior to conversion to probabilistic model:

the program snoscan was used to detect C/D snoRNA homologues in Archaea;

Omer et al., Science 288:517-522, 2000

Scfgs for rna folding
SCFGs for RNA folding

Elena Rivas and S.R. Eddy, Bioinformatics 16:573, 2000

Full SCFG analogue of Michael Zuker’s minimum energy RNA folding –

means we can apply statistical models to any RNA structure

(e. g., what’s the probability that this is a plausible RNA structure?)

Genefinding by comparative analysis
Genefinding by comparative analysis

Jonathan Badger, Gary Olsen: CRITICA, Mol Biol. Evol. 16:512, 1999

Most comparative analysis relies just on differential rates of evolution.

However, the pattern of mutation is also informative.

The OTHER model:

score with terms P(a,b | OTH)

models divergence only

the CODING model:

score with terms P(aaa,bbb | COD)

models divergence, constrained by

amino acid substitution matrix and

codon bias

Add a comparative model of structural rnas
add: a comparative model of structural RNAs

Elena Rivas, S.R. Eddy: QRNA, BMC Bioinformatics 2:8, 2001

The RNA model:

terms: P(a-a’, b-b’ | RNA)

models DNA divergence constrained by

a secondary structure

Some technical issues
Some technical issues

  • The structure is unknown; must do ensemble averaging.

  • model must deal with gapped alignments.

  • bounds of conservation or alignment don’t correspond to bounds of RNA.

  • evolutionary divergence times of the three models must be the same.

  • We use a form of probabilistic model called “pair-SCFGs”.

A screen for novel ncrnas in e coli
A screen for novel ncRNAs in E. coli

Elena Rivas et al., Curr Biol 11:1369, 2001

2367 E. coli intergenic sequences >50 nt in length

WUBLASTN vs. S. typhi, S. paratyphi, S. enteriditis, K. pneumoniae

gave 23,674 WUBLASTN alignments w/ E<0.01, length >50 nt, >65% identity

QRNA classified: 556 candidate RNA loci

160 candidate small ORFs (not examined further)

281 candidate loci are explainable: cis-regulatory RNA structures (terminators,

attenuators, etc.) and certain inverted repeat elements

leaves 275 candidate ncRNA gene loci

Northerns on 49 candidates: 11/49 are expressed as small stable RNAs

in exponentially growing E. coli in rich media

The altuvia screen
The Altuvia screen

Argaman et al., Current Biology 11:941, 2001

“Novel small RNA-encoding genes in the intergenic regions of E. coli”

“Over a period of about 30 years, only four bona fide regulatory RNAs have been discovered in E. coli. Here we report on the discovery of 14 novel small RNA-encoding genes....”

sraA 120 nt

sraB 149-168 nt

rprA 105 nt

sraC 234-249 nt

sraD 70 nt

gcvB 205 nt

sraE 88 nt

sraF 189 nt

sraG 146-174 nt

sraH 88-108 nt

sraI 91-94 nt

sraJ 172 nt

sraK 245 nt

sraL 140 nt

  • start w/ “intergenic” regions

  • computational identification of putative promoter and terminator, 50-400 nt apart

  • select regions conserved with other bacteria by BLAST

The gottesman screen
The Gottesman screen

Wassarman et al., Genes Dev. 15:1637, 2001

“Identification of novel small RNAs using comparative genomics and microarrays”

rydB 60 nt

ryeE 86 nt

ryfA 320 nt

ryhA 45 nt (sraH)

ryhB 90 nt (sraI)

ryiA 210 nt

ryjA 92 nt

rybB 80 nt

ryiB 270 nt (sraK, csrC)

rybA 205 nt

rygA 89 nt (sraE)

rygB 83 nt

ryeA 275 nt

ryeB 100 nt

ryeC 107,143 nt

ryeD 102,137 nt

rygC 107,139 nt

“... a multifaceted search strategy to predict sRNA genes was validated by our discovery of 17 novel sRNAs....”

  • intergenic regions >= 180 nt

  • conserved w/ other bacteria by BLAST

  • manual inspection of location & sequence

  • expression detected on high-density oligo probe array

Summary of three e coli screens
Summary of three E. coli screens

31 different new RNAs found and confirmed by the three screens:

Altuvia: 14

Gottesman: 19 (1 showed no expression; 1 untested)

Rivas: 22 (1 showed no expression; 10 untested)

Conclusions: Sensitivity of QRNA is respectable;

most E. coli ncRNAs conserve secondary structure

Only 4/11 of our confirmed ncRNAs are in the Altuvia or Gottesman genes

Conclusions: These screens have not saturated E. coli for new ncRNAs;

We have >200 other candidates in testing;

We have confirmed transcripts as short as 40 nt;

The functions of these RNAs are unknown.

Pyrococcus three hyperthermophile genomes
Pyrococcus: three hyperthermophile genomes

  • P. horikoshii

  • 1.8 Mb, complete

  • isolated off Okinawa, 1400m depth

  • Kawarabayasi et al. (NITE, Tokyo)

  • P. furiosus

  • 1.9 Mb, complete

  • from Vulcano Island, Italy

  • Robb et al. (Utah Genome Center)

  • P. abyssi

  • 1.8 Mb, complete

  • from South Pacific vent, 3500m depth

  • Genoscope (France)

A “black smoker” – deep sea hydrothermal vent

photo: American Natural History Museum

Rnas stand out in at rich hyperthermophiles
RNAs stand out in AT-rich hyperthermophiles

% known RNAs detected

growth temp (C)


% GC (genome)

% GC (RNA)

Methanococcus 85 31% 67% 36% 97%

Pyrococcus 98 42% 71% 29% 52%

Borrelia 37 29% 54% 25% 29%

Aquifex 90 44% 68% 24% 14%

Archaeoglobus 83 48% 68% 20% 2%

S. cerevisiae 30 38% 54% 16% 0

E. coli 37 51% 59% 8% 0


The g c computational screen
The G/C computational screen

Robbie Klein et al., manuscript submitted

Implemented as a 2-state hidden Markov model, using Viterbi or

posterior decoding algorithms.

Methanococcus jannaschii: (Viterbi parse alone)

43 regions detected (some span multiple RNAs)

includes 36/37 tRNAs; SSU and LSU rRNA; 5S, 7S, RNase P.

9 unassigned candidates.

4/9 express small RNAs detectable on Northern.

Pyrococcus furiosus: (posterior decoding, plus conservation w. P.a., P.h.)

51 regions detected (some span multiple RNAs)

includes 46/46 tRNAs, SSU and LSU rRNA; 2 5S, 7S, and RNase P.

8 unassigned candidates.

4/8 express small RNAs detectable on Northern.

Comparison of g c to qrna screen
Comparison of G/C to QRNA screen

Robbie Klein et al., PNAS, in press

P. furiosus – screened by QRNA by comparison to P. horikoshii, P. abyssi

G/C screen


QRNA screen




Candidate loci:



known tRNAs detected (of 46):




novel loci:


Confirmed by Northern:




  • Like the E. coli screen, about 25% of QRNA candidates were

  • confirmed by Northern (again in a single growth condition only).

  • QRNA is detecting most novel structural RNA genes.

Human mouse ncrna detection
human/mouse ncRNA detection

the cartilage-hair hypoplasia region:

QRNA is a general genefinder for structural ncRNA genes.

The ancient rna world
The ancient RNA World

Gesteland, Cech, Atkins: The RNA World, CSHL Press, 1999

Rna is very good at recognizing rna
RNA is very good at recognizing RNA

Ha, Wightman, Ruvkun; Genes Dev. 10:3041, 1996

A closing idea the modern rna world
A closing idea: The modern RNA world


When a cell needs a molecule that specifically recognizes a target RNA molecule, and the function is either:

- catalytically unsophisticated

- something that can be abstracted onto a shared protein (e.g. many guide snoRNAs, one methylase)

then RNA may be the material of choice. Specific RNA-binding proteins are big, expensive, and more difficult to evolve.

In fact an old idea
In fact, an old idea...

Jacob and Monod, JMB 3:318, 1961


  • There appear to be many noncoding RNA genes.

  • Methods to find homologous RNAs by structural similarity have been

  • greatly improved, using stochastic context free grammar algorithms.

  • Methods to find novel RNAs by de novo genefinding have finally

  • become possible, for instance by using comparative genome analysis.

  • .

[SR Eddy, Nature Reviews Genetics, 2:919, 2001]

[R Durbin et al., Biological Sequence Analysis, Cambridge U. Press 1998]

[E Rivas, RJ Klein, TA Jones, SR Eddy, Curr Biol 11:1369, 2001;

E Rivas, SR Eddy, BMC Bioinformatics, 2:8, 2001]


the Eddy lab:

the Eddy lab:

senior scientist:

Elena Rivas


Goran Ceric


Ajay Khanna

wet lab:

Ziva Misulovin

secret agent man:

Tom Jones







Zhirong Bao

Christian Zmasek

Robin Dowell

Robbie Klein

Steve Johnson

Shawn Stricklin

John McCutcheon