The modern rna world
1 / 43

The modern RNA world: - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

The modern RNA world:. computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis. The human genome sequence is (almost) done. The genome, famously, is digital.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

The modern RNA world:

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

The modern RNA world:

computational screens for noncoding RNA genes

Eddy lab

HHMI/Washington University, Saint Louis

The human genome sequence is (almost) done

The genome, famously, is digital

1892: Miescher postulates that genetic information may be encoded in a linear form using a few different chemical units:

“...just as all the words and concepts in all languages can find expression in twenty-four to thirty letters of the alphabet.”

Symbolic texts can be cracked

Michael Ventris and John Chadwick, 1953

“Cryptography has contributed a new

weapon to the student of unknown

scripts.... the basic principle is the

analysis and indexing of coded texts, so

that underlying patterns and regularities

can be discovered. If a number of

instances can be collected, it may appear

that a certain group of signs in the coded

text has a particular function....”

- John Chadwick,

The Decipherment of Linear B,

Cambridge Univ. Press, 1958

The phylogenetic history of life

Comparative genome analysis

VISTA plot; I. Dubchak, E. Rubin, et al.

human, mouse, dog genomes

Estimates of human gene number

mean: 61,710

low: 27,462

high: 153,478

Want to place a bet? The book is held by the bartender at Cold

Spring Harbor Laboratory.

The yeast genome completed

Science 274:546, 1996

Life with 6000 Genes

A. Goffeau, B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon,

H. Feldmann, F. Galibert, J.D. Hoheisel, C. Jacq, M. Johnston,

E.J. Louis, H.W. Mewes, Y. Murakami, P. Phillippsen,

H. Tettelin, S.G. Oliver

where “gene” = ORF of 100 amino acids or more.

but besides the ~6000 large protein-coding genes, there’s also:

140 ribosomal RNA genes,

275 transfer RNA genes,

~40 small nuclear RNA genes,

~100 small nucleolar RNA genes,

... and ... ?

Structure of the large ribosomal subunit

Haloarcula marismortui

Ban et. al., Science 289:905, 2000

inside-out genes

Tycowski, Shu, and Steitz

Nature 379:464, 1996

Human UHG (U22 host gene)

no significant ORFs; not conserved with mouse; rapidly degraded

Eight intron-encoded snoRNAs

conserved with mouse; stable

An RNA motor

Simpson et al, Nature 408:745, 2000

“Structure of the bacteriophage f29 DNA packaging motor”

Cartilage-hair hypoplasia mapped to an RNA

M. Ridanpaa et al. Cell 104:195, 2001

RMRP: Human RNase MRP, 267 nt

microRNAs (miRNAs) in metazoa

T. Tuschl; D. Bartel; V. Ambros

lin-4 acts as translational repressor

by binding 3’ UTR

~22-mer processed from ~70-mer precursor

by RNAi pathway

RNA genes can be hard to detect


C. elegans Let-7; 21 nt

Pasquinelli et al. Nature 408:86, 2000

  • often small

  • sometimes multicopy and redundant

  • often not polyadenylated (and remember EST libraries are poly-A selected)

  • immune to frameshift and nonsense mutation

  • no open reading frame or codon bias

  • relatively little information in primary sequence consensus

Two computational analysis problems

  • Similarity search (e.g. BLAST):

  • I give you a query; you find sequences in a database that

  • look like the query.

  • For RNA, you want to take the secondary structure

  • of the query into account.

  • 2. Genefinding (e.g. GENSCAN):

  • Based solely on a priori knowledge of what a “gene”

  • looks like, find genes in a genome sequence.

  • For RNA – with no open reading frame and no codon

  • bias – what do you look for?

RNA structure: nested pairwise correlations

Context-free grammars

Noam Chomsky, 1956

Basic CFG

“production rules”

a CFG “derivation”

Sequence vs. secondary structure alignment

R Durbin, SR Eddy, GJ Mitchison, A Krogh

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

Cambridge Univ. Press, 1998

HMM algorithm








SCFG algorithm









optimal alignment

P(sequence | model)

EM parameter estimation

memory complexity:

time complexity (general):

time complexity (as used):

  • we can analyze target sequences with secondary structure models;

  • but the algorithms are computationally expensive.

SCFG-based RNA similarity search

C/D methylation guide snoRNA consensus:

Graphical model, prior to conversion to probabilistic model:

the program snoscan was used to detect C/D snoRNA homologues in Archaea;

Omer et al., Science 288:517-522, 2000

SCFGs for RNA folding

Elena Rivas and S.R. Eddy, Bioinformatics 16:573, 2000

Full SCFG analogue of Michael Zuker’s minimum energy RNA folding –

means we can apply statistical models to any RNA structure

(e. g., what’s the probability that this is a plausible RNA structure?)

Genefinding by comparative analysis

Jonathan Badger, Gary Olsen: CRITICA, Mol Biol. Evol. 16:512, 1999

Most comparative analysis relies just on differential rates of evolution.

However, the pattern of mutation is also informative.

The OTHER model:

score with terms P(a,b | OTH)

models divergence only

the CODING model:

score with terms P(aaa,bbb | COD)

models divergence, constrained by

amino acid substitution matrix and

codon bias

add: a comparative model of structural RNAs

Elena Rivas, S.R. Eddy: QRNA, BMC Bioinformatics 2:8, 2001

The RNA model:

terms: P(a-a’, b-b’ | RNA)

models DNA divergence constrained by

a secondary structure

Some technical issues

  • The structure is unknown; must do ensemble averaging.

  • model must deal with gapped alignments.

  • bounds of conservation or alignment don’t correspond to bounds of RNA.

  • evolutionary divergence times of the three models must be the same.

  • We use a form of probabilistic model called “pair-SCFGs”.

Three models – examples of their scores

A screen for novel ncRNAs in E. coli

Elena Rivas et al., Curr Biol 11:1369, 2001

2367 E. coli intergenic sequences >50 nt in length

WUBLASTN vs. S. typhi, S. paratyphi, S. enteriditis, K. pneumoniae

gave 23,674 WUBLASTN alignments w/ E<0.01, length >50 nt, >65% identity

QRNA classified: 556 candidate RNA loci

160 candidate small ORFs (not examined further)

281 candidate loci are explainable: cis-regulatory RNA structures (terminators,

attenuators, etc.) and certain inverted repeat elements

leaves 275 candidate ncRNA gene loci

Northerns on 49 candidates: 11/49 are expressed as small stable RNAs

in exponentially growing E. coli in rich media

Northern blots confirming E. coli RNAs

The Altuvia screen

Argaman et al., Current Biology 11:941, 2001

“Novel small RNA-encoding genes in the intergenic regions of E. coli”

“Over a period of about 30 years, only four bona fide regulatory RNAs have been discovered in E. coli. Here we report on the discovery of 14 novel small RNA-encoding genes....”

sraA120 nt

sraB149-168 nt

rprA105 nt

sraC234-249 nt

sraD70 nt

gcvB205 nt

sraE88 nt

sraF189 nt

sraG146-174 nt

sraH88-108 nt

sraI91-94 nt

sraJ172 nt

sraK245 nt

sraL140 nt

  • start w/ “intergenic” regions

  • computational identification of putative promoter and terminator, 50-400 nt apart

  • select regions conserved with other bacteria by BLAST

The Gottesman screen

Wassarman et al., Genes Dev. 15:1637, 2001

“Identification of novel small RNAs using comparative genomics and microarrays”

rydB 60 nt

ryeE86 nt

ryfA320 nt

ryhA45 nt(sraH)

ryhB90 nt(sraI)

ryiA210 nt

ryjA92 nt

rybB80 nt

ryiB270 nt(sraK, csrC)

rybA205 nt

rygA89 nt (sraE)

rygB83 nt

ryeA275 nt

ryeB100 nt

ryeC107,143 nt

ryeD102,137 nt

rygC107,139 nt

“... a multifaceted search strategy to predict sRNA genes was validated by our discovery of 17 novel sRNAs....”

  • intergenic regions >= 180 nt

  • conserved w/ other bacteria by BLAST

  • manual inspection of location & sequence

  • expression detected on high-density oligo probe array

Summary of three E. coli screens

31 different new RNAs found and confirmed by the three screens:

Altuvia: 14

Gottesman: 19 (1 showed no expression; 1 untested)

Rivas: 22 (1 showed no expression; 10 untested)

Conclusions: Sensitivity of QRNA is respectable;

most E. coli ncRNAs conserve secondary structure

Only 4/11 of our confirmed ncRNAs are in the Altuvia or Gottesman genes

Conclusions: These screens have not saturated E. coli for new ncRNAs;

We have >200 other candidates in testing;

We have confirmed transcripts as short as 40 nt;

The functions of these RNAs are unknown.

Pyrococcus: three hyperthermophile genomes

  • P. horikoshii

  • 1.8 Mb, complete

  • isolated off Okinawa, 1400m depth

  • Kawarabayasi et al. (NITE, Tokyo)

  • P. furiosus

  • 1.9 Mb, complete

  • from Vulcano Island, Italy

  • Robb et al. (Utah Genome Center)

  • P. abyssi

  • 1.8 Mb, complete

  • from South Pacific vent, 3500m depth

  • Genoscope (France)

A “black smoker” – deep sea hydrothermal vent

photo: American Natural History Museum

G/C composition detects RNAs in Pyrococcus

RNAs stand out in AT-rich hyperthermophiles

% known RNAs detected

growth temp (C)


% GC (genome)

% GC (RNA)





Archaeoglobus8348%68%20% 2%

S. cerevisiae3038%54%16% 0

E. coli3751%59% 8% 0


The G/C computational screen

Robbie Klein et al., manuscript submitted

Implemented as a 2-state hidden Markov model, using Viterbi or

posterior decoding algorithms.

Methanococcus jannaschii: (Viterbi parse alone)

43 regions detected (some span multiple RNAs)

includes 36/37 tRNAs; SSU and LSU rRNA; 5S, 7S, RNase P.

9 unassigned candidates.

4/9 express small RNAs detectable on Northern.

Pyrococcus furiosus: (posterior decoding, plus conservation w. P.a., P.h.)

51 regions detected (some span multiple RNAs)

includes 46/46 tRNAs, SSU and LSU rRNA; 2 5S, 7S, and RNase P.

8 unassigned candidates.

4/8 express small RNAs detectable on Northern.

pyrococcus genome comparisons

Comparison of G/C to QRNA screen

Robbie Klein et al., PNAS, in press

P. furiosus – screened by QRNA by comparison to P. horikoshii, P. abyssi

G/C screen


QRNA screen




Candidate loci:



known tRNAs detected (of 46):




novel loci:


Confirmed by Northern:




  • Like the E. coli screen, about 25% of QRNA candidates were

  • confirmed by Northern (again in a single growth condition only).

  • QRNA is detecting most novel structural RNA genes.

Archaeal RNA Northerns

human/mouse ncRNA detection

the cartilage-hair hypoplasia region:

QRNA is a general genefinder for structural ncRNA genes.

The ancient RNA World

Gesteland, Cech, Atkins: The RNA World, CSHL Press, 1999

RNA is very good at recognizing RNA

Ha, Wightman, Ruvkun; Genes Dev. 10:3041, 1996

A closing idea: The modern RNA world


When a cell needs a molecule that specifically recognizes a target RNA molecule, and the function is either:

- catalytically unsophisticated

- something that can be abstracted onto a shared protein (e.g. many guide snoRNAs, one methylase)

then RNA may be the material of choice. Specific RNA-binding proteins are big, expensive, and more difficult to evolve.

In fact, an old idea...

Jacob and Monod, JMB 3:318, 1961


  • There appear to be many noncoding RNA genes.

  • Methods to find homologous RNAs by structural similarity have been

  • greatly improved, using stochastic context free grammar algorithms.

  • Methods to find novel RNAs by de novo genefinding have finally

  • become possible, for instance by using comparative genome analysis.

  • .

[SR Eddy, Nature Reviews Genetics, 2:919, 2001]

[R Durbin et al., Biological Sequence Analysis, Cambridge U. Press 1998]

[E Rivas, RJ Klein, TA Jones, SR Eddy, Curr Biol 11:1369, 2001;

E Rivas, SR Eddy, BMC Bioinformatics, 2:8, 2001]


the Eddy lab:

the Eddy lab:

senior scientist:

Elena Rivas


Goran Ceric


Ajay Khanna

wet lab:

Ziva Misulovin

secret agent man:

Tom Jones







Zhirong Bao

Christian Zmasek

Robin Dowell

Robbie Klein

Steve Johnson

Shawn Stricklin

John McCutcheon

  • Login