Microrna
Download
1 / 53

MicroRNA - PowerPoint PPT Presentation


  • 283 Views
  • Uploaded on

MicroRNA. The Computational Challenge. Bioinformatics Seminar, March 9, 2005 By Yaron Levy. Tree of RNA Types. miRNA Biological Process. Micro RNA – Computational Approach. Problem 1: Finding putative microRNA from a sequence Horesh et al, using suffix trees data structure

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' MicroRNA' - felix


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Microrna
MicroRNA

The Computational Challenge

Bioinformatics Seminar, March 9, 2005

By Yaron Levy




Micro rna computational approach
Micro RNA – Computational Approach

  • Problem 1: Finding putative microRNA from a sequence

    • Horesh et al, using suffix trees data structure

  • Problem 2: Computing secondary structure of a given sequence

    • Zuker & Steigler, minimum free energy, using dynamic programming

  • Problem 3: miRNA predicting algorithms

    • Lim et al, MiRscan

  • Problem 4: Predicting miRNA target genes

    • Lewis et al, TargetScan


Problem 1

Find these


Problem 1 finding putative microrna from a sequence
Problem 1: Finding putative microRNA from a sequence

  • A naïve idea: slide a “window” of size L over the sequence of size N, looking for stems of size S.

    • Computationally O(NL+NS) – too much

  • A better approach – using a suffix tree.


What is a suffix tree?

S = M A L A Y A L A M $

1 2 3 4 5 6 7 8 9 10

A

$

M

LA

YALAM$

AL

5

10

$M

YALAM$

YALAM$

$M

$

ALAYALAM$

3

8

4

7

$M

YALAM$

1

9

6

2


Suffix tree properties
Suffix tree properties

  • For a string S of length n, there are n leaves and at most n internal nodes.

    • therefore requires only linear space

  • Each leaf represents a unique suffix.

  • Concatenation of edge labels from root to a leaf spells out the suffix.

  • Each internal node represents a distinct common prefix to at least two suffixes.


Finding a short pattern in a long string
Finding a (short) Patternin a (long) String

  • Build a suffix tree of the string.

  • Starting from the root, traverse a path matching characters of the pattern.

  • If stuck, pattern not present in string.

    Otherwise, each leaf below gives a position of the pattern in the string.


Finding a Pattern in a String

Find “ALA”

A

$

M

LA

YALAM$

AL

5

10

M$

YALAM$

YALAM$

M$

$

ALAYALAM$

3

8

4

7

M$

YALAM$

1

9

Two matches - at 6 and 2

6

2


Generalized suffix tree
Generalized Suffix Tree

WINDOW$ INDIGO$

1234567 1234567

$

D

ND

I

$OG

O

W

(1, 7)

(2, 7)

(2, 5)

ND

OW$

$

$OGI

OW$

$OGI

$OG

$W

INDOW$

$

(2, 4)

(2, 2)

(1, 3)

(1, 5)

(2, 6)

(2, 3)

(1, 4)

$OGI

OW$

(1, 6)

(1, 1)

(2, 1)

(1, 2)


Horesh et al using a generalized suffix tree for finding putative microrna s
Horesh et al – using a generalized suffix tree for finding putative microRNA’s

  • Assumptions:

    • At least a triple repeat is necessary:

      • 2 for the stems of the hairpin – close to each other in the sequence, and as inverted repeat of each other

      • The rest are target genes – can be anywhere

    • The repeats must be fully matched – no mismatches are allowed

      • This is more of a constraint


Horesh et al the algorithm
Horesh et al – the algorithm putative microRNA’s

  • Construct a generalized suffix tree of the original sequence and the inverted repeat sequence.

  • Preprocess the suffix tree for calculating:

    • Length of suffixes

    • Number of repeats

    • Index of suffix in sequence

  • With these attributes for each node, along with the indices of the suffixes in the sequence, it is possible to find the requested triple (or more) repeats.

    • Computationally efficient O(N)


a putative microRNA’s

na

banana

na

na

na

1. Build a suffix tree

0

1

2

6

3

2

1

4

5

3

2. Scan the tree in a PreOrder traversal

(all parents are visited before their sons)

The length of a prefix a node represent is:

node.len = father.len + node.Length of the sequence fragment it carries

(root is 0)


a putative microRNA’s

na

banana

na

na

na

6

3

2

1

2

1

1

1

1

1

3. Scan the tree in a PostOrder traversal

(all sons are visited before their parents)

The number of repeats of a prefix a node represent is:

node.repeats = Sum of sons repeats (leaf is 1)


Now every node carries the putative microRNA’slength of the prefix

It represents and the number of leaves below it.

(the number of repeats it is their prefix).

4. Scan the tree again,

For every node that represents a prefix longer than SIZE (22 for

example), and has two repeats or more;

Print its length and repeats and print the indexes of its leaves.

1

All sections are done in linear time !

a

3

na

1

5

3

na


Problem 1 conclusions
Problem 1 conclusions putative microRNA’s

  • The problem is not trivial!

  • Suffix trees are an elegant solution, providing:

    • No mismatches are allowed (not really biologically realistic)

    • Enough memory to store the large data structure


Problem 2 putative microRNA’s

How do these fold?


Problem 2 computing secondary structure of a given sequence
Problem 2: Computing secondary structure of a given sequence putative microRNA’s

  • Approaches to RNA secondary structure prediction:

    • comparative sequence analysis

    • prediction from base sequence

      • find minimum free energy (MFE) structure


Free energy model
Free energy model putative microRNA’s

  • free energy of structure (at fixed temperature, ionic concentration) = sum of loop energies

  • standard model uses experimentally determined thermodynamic parameters where available; extrapolations for long loops


Free energy model1
Free energy model putative microRNA’s

  • free energy of structure (at fixed temperature, ionic concentration) = sum of loop energies

  • standard model uses experimentally determined thermodynamic parameters where available; extrapolations for long loops


On the mfe approach
On the MFE approach putative microRNA’s

  • MFE approach ignores folding pathway, metal ions, nonstandard bonds

  • “some species can remain kinetically trapped in nonequilibrium states… we expect that most RNA’s exist naturally in their thermodynamically most stable configurations” –Tinoco and Bustamante, J. Mol. Biol. 1999.


Why is mfe secondary structure prediction hard
Why is MFE secondary structure prediction hard? putative microRNA’s

  • MFE structure can be found by calculating free energy of all possible structures

  • but, number of potential structures grows exponentially with the number, n, of bases

  • structures can be arbitrarily complex

  • success for restricted classes of structures


Predicting mfe pseudoknot free structures
Predicting MFE pseudoknot free structures putative microRNA’s

  • Dynamic programming avoids explicit enumeration of all pseudoknot free structures (Zuker & Stiegler 1981)

  • Suboptimal folds, probabilities of base pairings can also be calculated

  • software: mfold, Vienna package


Dynamic programming zuker steigler
Dynamic programming putative microRNA’s(Zuker & Steigler)

  • Based on the “more is less” principle: by calculating more than you need, less work is needed overall

  • Construct MFE structure for whole strand from MFE structures for substrands

  • Running time is O(n3)


Rna folding with dynamic programming
RNA folding with dynamic programming putative microRNA’s

  • Assume a function W(i,j) which is the MFE for the sequence starting at i and ending at j (i<j)

  • Define sigma as the MFE function for the simple cases, where, for example a base pair’s score is less than a non-pair

  • Consider 4 recursion possibilities:

    • i,j are a base pair, added to the structure for i+1..j-1

      • Define this as V(i,j)

    • i is unpaired, added to the structure for i+1..j

    • j is unpaired, added to the structure for i..j-1

    • i,j are paired, but not to each other; the structure for i..j adds together sub-structures for 2 sub-sequences: i..k and k+1..j a bifurcation (i<k<j)


Dynamic programming zuker and steigler

V( putative microRNA’si,j)

i

j

Dynamic programming (Zuker and Steigler)

  • W(i,j): MFE structure of substrand from i to j

  • V(i,j): MFE structure of substrand from i to j, in which i-th and j-th bases are paired

W(i,j)

i

j


Recurrences

W( putative microRNA’si,j)

i

j

Recurrences

=

min

V(i,j)

W(i,k)

W(k+1,j)

i

j

i

k

k+1

j


Recurrences1

i putative microRNA’s

j

i+1

k

k+1

j-1

j

i

i

k

l

j

Recurrences

=

min

i

j


Recurrences2
Recurrences putative microRNA’s

=

min

i

j

i

j

i

k

k+1

j

= min

i+1

k

k+1

j-1

j

i

j

i

j

i

i

k

l

j


What is actually being done
What is actually being done? putative microRNA’s

  • Simple base pair maximization is a poor scoring scheme for RNA structure prediction.

  • It is more plausible that an RNA adopts a globally minimum energy structure, not the structure with the maximum number of base pairs.

  • Developed the thermodynamic model in conjunction with the development of DP

    • independence assumptions in the thermodynamic model's terms have been made compatible with the independence assumptions needed for recursive dynamic programming algorithms to work.

  • Energy minimization algorithms become somewhat complex, with more detailed recursions that distinguish different lengths and types of loops, and which score base pairs according to nearest-neighbor stacking interactions with adjacent base pairs.

  • Nonetheless, the mechanics of the algorithm are pretty much the same


Problem 2 conclusions
Problem 2 conclusions putative microRNA’s

  • RNA secondary structure finding is a hard problem – exponential number of possibilities

  • Several heuristics claim to achieve relatively good success rates

    • Specifically, MFE based algorithms are believed to be ~70% accurate on structures without pseudoknots.


Problem 3 putative microRNA’s

How to predict these?


Problem 3 mirna predicting algorithms
Problem 3: miRNA predicting algorithms putative microRNA’s

  • Lim et al. developed a machine learning tool called MiRscan to help identify new miRNA genes

  • This program looks at hairpin sequences conserved between species (C. elegans and C. briggsae)

  • The program is given a training set of known miRNAs in C. elegans

  • This data is then used to identify which conserved hairpin sequences are most similar to the training data.


Mirscan algorithm
MiRscan Algorithm putative microRNA’s

  • The MiRscan algorithm examines several features of the hairpin

  • The total score computed by summing the score of each feature

  • The score for each feature is computed by dividing the frequency of the given value in the training set to its overall frequency


Mirscan relative importance of hairpin features
MiRscan – Relative importance of hairpin features putative microRNA’s

  • Certain features were found to be more useful than others in distinguishing miRNAs


Mirscan testing the algorithm
MiRscan – Testing the algorithm putative microRNA’s

  • In order to test their algorithm, Lim et al. ran MiRscan on the ~36,000 conserved hairpins in the C. elegans and C. briggsae genomes

  • The 50 known miRNA genes conserved between C. elegans and C. briggsae were used as a training set

  • 35 sequences received a MiRscan score greater than the mean score of the known genes

  • These sequences were given special attention in the experimental portion of this research


Mirscan results
MiRscan – Results putative microRNA’s


Mirscan results example
MiRscan – Results example putative microRNA’s

Flanking sequence of control and real matches in the UTRs.


Problem 3 conclusions
Problem 3 conclusions putative microRNA’s

  • Predicting miRNA genes is a hot subject!

    • Algorithms use machine learning techniques to predict genes

    • Candidate genes can be biologically verified to be miRNA genes. Although this process may be slow, it gives feedback and allows refinement of techniques and better predictions

    • Hundreds (thousands?) of new miRNA genes are suspected to be found in the (near?) future!

    • Commercial companies are performing these kinds of processes for money…


Problem 4 putative microRNA’s

What are the targets these bind to?


Problem 4 predicting target genes
Problem 4: Predicting target genes putative microRNA’s

  • Mammals/vertebrates

    • Lots of known miRNAs

    • Mostly unknown target genes

  • Initial method outline

    • Look at conserved miRNAs

    • Look for conserved target sites


Mirnas in animals
miRNAs in animals putative microRNA’s

  • 0.5%-1.0% of predicted genes encode miRNA (!!)

    • One of the more abundant regulatory classes

  • Tissue-specific or developmental stage-specific expression

  • High evolutionary conservation


Targetscan algorithm by lewis et al 2003
TargetScan Algorithm by Lewis putative microRNA’set al 2003

The Goal – a ranked list of candidate target genes

  • Stage 1: Search UTRs in one organism

    • Bases 2-8 from miRNA = “miRNA seed”

    • Perfect Watson-Crick complementarity

    • No wobble pairs (G-U)

    • 7nt matches = “seed matches”


Targetscan algorithm
TargetScan Algorithm putative microRNA’s

  • Stage 2: Extend seed matches

    • Allow G-U (wobble) pairs

    • Both directions

    • Stop at mismatches


Targetscan algorithm1
TargetScan Algorithm putative microRNA’s

  • Stage 3: Optimize basepairing

    • Remaining 3’ region of miRNA

    • 35 bases of UTR 5’ to each seed match

    • RNAfold program (Hofacker et al 1994)


TargetScan Algorithm putative microRNA’s

  • Stage 4: Folding free energy (G) assigned to each putative miRNA:target interaction

  • Assign rank to each UTR

  • Repeat this process for each of the other organisms with UTR datasets


Targetscan program flow
TargetScan Program Flow putative microRNA’s


Targetscan results for mammals
TargetScan - Results for mammals putative microRNA’s

  • Database of 79 miRNA’s searched against human, mouse, and rat orthologous 3’ UTRs

  • 451 miRNA:target interactions predicted for 400 unique genes

  • Average 5.7 targets per miRNA

  • Signal:noise ratio of 3.2:1


Targetscan biological relevance
TargetScan - Biological relevance putative microRNA’s

  • Hypothesis: 5’ conservation of miRNAs is important for mRNA target recognition

    • Highest signal:noise ratio observed when seed positioned close to 5’ end

  • Hypothesis: highly conserved miRNAs are more involved in regulation

    • High degree of conservation -> more predicted targets

    • Membership in large miRNA family -> more predicted targets


Mirna targets not the end of the story
miRNA Targets - Not the end of the story… putative microRNA’s

Many programs claim to discover miRNA

targets in mammals:

  • miRanda - Enright et al, SKI

  • DIANA-MicroT - Hatzigeorgiou et al, UPenn

  • rna22 - Rigoutsos et al, IBM

  • PicTar - Rajewsky et al, NYU


Problem 4 conclusion algorithms comparison
Problem 4 conclusion: algorithms comparison putative microRNA’s

  • NYAS Competition (Feb 17, 2005)

    • Task: given 2 miRNAs, find mammalian targets

  • Widely differing results, from 1 target to ~500 targets!

  • Very little overlap

  • So who’s “right”?

    • Currently correct targets unknown…


Thank You putative microRNA’s


ad