MicroRNA

1 / 53

# MicroRNA - PowerPoint PPT Presentation

MicroRNA. The Computational Challenge. Bioinformatics Seminar, March 9, 2005 By Yaron Levy. Tree of RNA Types. miRNA Biological Process. Micro RNA – Computational Approach. Problem 1: Finding putative microRNA from a sequence Horesh et al, using suffix trees data structure

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' MicroRNA' - felix

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
MicroRNA

The Computational Challenge

Bioinformatics Seminar, March 9, 2005

By Yaron Levy

Micro RNA – Computational Approach
• Problem 1: Finding putative microRNA from a sequence
• Horesh et al, using suffix trees data structure
• Problem 2: Computing secondary structure of a given sequence
• Zuker & Steigler, minimum free energy, using dynamic programming
• Problem 3: miRNA predicting algorithms
• Lim et al, MiRscan
• Problem 4: Predicting miRNA target genes
• Lewis et al, TargetScan

Problem 1

Find these

Problem 1: Finding putative microRNA from a sequence
• A naïve idea: slide a “window” of size L over the sequence of size N, looking for stems of size S.
• Computationally O(NL+NS) – too much
• A better approach – using a suffix tree.

What is a suffix tree?

S = M A L A Y A L A M \$

1 2 3 4 5 6 7 8 9 10

A

\$

M

LA

YALAM\$

AL

5

10

\$M

YALAM\$

YALAM\$

\$M

\$

ALAYALAM\$

3

8

4

7

\$M

YALAM\$

1

9

6

2

Suffix tree properties
• For a string S of length n, there are n leaves and at most n internal nodes.
• therefore requires only linear space
• Each leaf represents a unique suffix.
• Concatenation of edge labels from root to a leaf spells out the suffix.
• Each internal node represents a distinct common prefix to at least two suffixes.
Finding a (short) Patternin a (long) String
• Build a suffix tree of the string.
• Starting from the root, traverse a path matching characters of the pattern.
• If stuck, pattern not present in string.

Otherwise, each leaf below gives a position of the pattern in the string.

Finding a Pattern in a String

Find “ALA”

A

\$

M

LA

YALAM\$

AL

5

10

M\$

YALAM\$

YALAM\$

M\$

\$

ALAYALAM\$

3

8

4

7

M\$

YALAM\$

1

9

Two matches - at 6 and 2

6

2

Generalized Suffix Tree

WINDOW\$ INDIGO\$

1234567 1234567

\$

D

ND

I

\$OG

O

W

(1, 7)

(2, 7)

(2, 5)

ND

OW\$

\$

\$OGI

OW\$

\$OGI

\$OG

\$W

INDOW\$

\$

(2, 4)

(2, 2)

(1, 3)

(1, 5)

(2, 6)

(2, 3)

(1, 4)

\$OGI

OW\$

(1, 6)

(1, 1)

(2, 1)

(1, 2)

Horesh et al – using a generalized suffix tree for finding putative microRNA’s
• Assumptions:
• At least a triple repeat is necessary:
• 2 for the stems of the hairpin – close to each other in the sequence, and as inverted repeat of each other
• The rest are target genes – can be anywhere
• The repeats must be fully matched – no mismatches are allowed
• This is more of a constraint
Horesh et al – the algorithm
• Construct a generalized suffix tree of the original sequence and the inverted repeat sequence.
• Preprocess the suffix tree for calculating:
• Length of suffixes
• Number of repeats
• Index of suffix in sequence
• With these attributes for each node, along with the indices of the suffixes in the sequence, it is possible to find the requested triple (or more) repeats.
• Computationally efficient O(N)

a

na

banana

na

na

na

1. Build a suffix tree

0

1

2

6

3

2

1

4

5

3

2. Scan the tree in a PreOrder traversal

(all parents are visited before their sons)

The length of a prefix a node represent is:

node.len = father.len + node.Length of the sequence fragment it carries

(root is 0)

a

na

banana

na

na

na

6

3

2

1

2

1

1

1

1

1

3. Scan the tree in a PostOrder traversal

(all sons are visited before their parents)

The number of repeats of a prefix a node represent is:

node.repeats = Sum of sons repeats (leaf is 1)

Now every node carries the length of the prefix

It represents and the number of leaves below it.

(the number of repeats it is their prefix).

4. Scan the tree again,

For every node that represents a prefix longer than SIZE (22 for

example), and has two repeats or more;

Print its length and repeats and print the indexes of its leaves.

1

All sections are done in linear time !

a

3

na

1

5

3

na

Problem 1 conclusions
• The problem is not trivial!
• Suffix trees are an elegant solution, providing:
• No mismatches are allowed (not really biologically realistic)
• Enough memory to store the large data structure

Problem 2

How do these fold?

Problem 2: Computing secondary structure of a given sequence
• Approaches to RNA secondary structure prediction:
• comparative sequence analysis
• prediction from base sequence
• find minimum free energy (MFE) structure
Free energy model
• free energy of structure (at fixed temperature, ionic concentration) = sum of loop energies
• standard model uses experimentally determined thermodynamic parameters where available; extrapolations for long loops
Free energy model
• free energy of structure (at fixed temperature, ionic concentration) = sum of loop energies
• standard model uses experimentally determined thermodynamic parameters where available; extrapolations for long loops
On the MFE approach
• MFE approach ignores folding pathway, metal ions, nonstandard bonds
• “some species can remain kinetically trapped in nonequilibrium states… we expect that most RNA’s exist naturally in their thermodynamically most stable configurations” –Tinoco and Bustamante, J. Mol. Biol. 1999.
Why is MFE secondary structure prediction hard?
• MFE structure can be found by calculating free energy of all possible structures
• but, number of potential structures grows exponentially with the number, n, of bases
• structures can be arbitrarily complex
• success for restricted classes of structures
Predicting MFE pseudoknot free structures
• Dynamic programming avoids explicit enumeration of all pseudoknot free structures (Zuker & Stiegler 1981)
• Suboptimal folds, probabilities of base pairings can also be calculated
• software: mfold, Vienna package
Dynamic programming (Zuker & Steigler)
• Based on the “more is less” principle: by calculating more than you need, less work is needed overall
• Construct MFE structure for whole strand from MFE structures for substrands
• Running time is O(n3)
RNA folding with dynamic programming
• Assume a function W(i,j) which is the MFE for the sequence starting at i and ending at j (i<j)
• Define sigma as the MFE function for the simple cases, where, for example a base pair’s score is less than a non-pair
• Consider 4 recursion possibilities:
• i,j are a base pair, added to the structure for i+1..j-1
• Define this as V(i,j)
• i is unpaired, added to the structure for i+1..j
• j is unpaired, added to the structure for i..j-1
• i,j are paired, but not to each other; the structure for i..j adds together sub-structures for 2 sub-sequences: i..k and k+1..j a bifurcation (i<k<j)

V(i,j)

i

j

Dynamic programming (Zuker and Steigler)
• W(i,j): MFE structure of substrand from i to j
• V(i,j): MFE structure of substrand from i to j, in which i-th and j-th bases are paired

W(i,j)

i

j

W(i,j)

i

j

Recurrences

=

min

V(i,j)

W(i,k)

W(k+1,j)

i

j

i

k

k+1

j

i

j

i+1

k

k+1

j-1

j

i

i

k

l

j

Recurrences

=

min

i

j

Recurrences

=

min

i

j

i

j

i

k

k+1

j

= min

i+1

k

k+1

j-1

j

i

j

i

j

i

i

k

l

j

What is actually being done?
• Simple base pair maximization is a poor scoring scheme for RNA structure prediction.
• It is more plausible that an RNA adopts a globally minimum energy structure, not the structure with the maximum number of base pairs.
• Developed the thermodynamic model in conjunction with the development of DP
• independence assumptions in the thermodynamic model\'s terms have been made compatible with the independence assumptions needed for recursive dynamic programming algorithms to work.
• Energy minimization algorithms become somewhat complex, with more detailed recursions that distinguish different lengths and types of loops, and which score base pairs according to nearest-neighbor stacking interactions with adjacent base pairs.
• Nonetheless, the mechanics of the algorithm are pretty much the same
Problem 2 conclusions
• RNA secondary structure finding is a hard problem – exponential number of possibilities
• Several heuristics claim to achieve relatively good success rates
• Specifically, MFE based algorithms are believed to be ~70% accurate on structures without pseudoknots.

Problem 3

How to predict these?

Problem 3: miRNA predicting algorithms
• Lim et al. developed a machine learning tool called MiRscan to help identify new miRNA genes
• This program looks at hairpin sequences conserved between species (C. elegans and C. briggsae)
• The program is given a training set of known miRNAs in C. elegans
• This data is then used to identify which conserved hairpin sequences are most similar to the training data.
MiRscan Algorithm
• The MiRscan algorithm examines several features of the hairpin
• The total score computed by summing the score of each feature
• The score for each feature is computed by dividing the frequency of the given value in the training set to its overall frequency
MiRscan – Relative importance of hairpin features
• Certain features were found to be more useful than others in distinguishing miRNAs
MiRscan – Testing the algorithm
• In order to test their algorithm, Lim et al. ran MiRscan on the ~36,000 conserved hairpins in the C. elegans and C. briggsae genomes
• The 50 known miRNA genes conserved between C. elegans and C. briggsae were used as a training set
• 35 sequences received a MiRscan score greater than the mean score of the known genes
• These sequences were given special attention in the experimental portion of this research
MiRscan – Results example

Flanking sequence of control and real matches in the UTRs.

Problem 3 conclusions
• Predicting miRNA genes is a hot subject!
• Algorithms use machine learning techniques to predict genes
• Candidate genes can be biologically verified to be miRNA genes. Although this process may be slow, it gives feedback and allows refinement of techniques and better predictions
• Hundreds (thousands?) of new miRNA genes are suspected to be found in the (near?) future!
• Commercial companies are performing these kinds of processes for money…

Problem 4

What are the targets these bind to?

Problem 4: Predicting target genes
• Mammals/vertebrates
• Lots of known miRNAs
• Mostly unknown target genes
• Initial method outline
• Look at conserved miRNAs
• Look for conserved target sites
miRNAs in animals
• 0.5%-1.0% of predicted genes encode miRNA (!!)
• One of the more abundant regulatory classes
• Tissue-specific or developmental stage-specific expression
• High evolutionary conservation
TargetScan Algorithm by Lewis et al 2003

The Goal – a ranked list of candidate target genes

• Stage 1: Search UTRs in one organism
• Bases 2-8 from miRNA = “miRNA seed”
• Perfect Watson-Crick complementarity
• No wobble pairs (G-U)
• 7nt matches = “seed matches”
TargetScan Algorithm
• Stage 2: Extend seed matches
• Allow G-U (wobble) pairs
• Both directions
• Stop at mismatches
TargetScan Algorithm
• Stage 3: Optimize basepairing
• Remaining 3’ region of miRNA
• 35 bases of UTR 5’ to each seed match
• RNAfold program (Hofacker et al 1994)

TargetScan Algorithm

• Stage 4: Folding free energy (G) assigned to each putative miRNA:target interaction
• Assign rank to each UTR
• Repeat this process for each of the other organisms with UTR datasets
TargetScan - Results for mammals
• Database of 79 miRNA’s searched against human, mouse, and rat orthologous 3’ UTRs
• 451 miRNA:target interactions predicted for 400 unique genes
• Average 5.7 targets per miRNA
• Signal:noise ratio of 3.2:1
TargetScan - Biological relevance
• Hypothesis: 5’ conservation of miRNAs is important for mRNA target recognition
• Highest signal:noise ratio observed when seed positioned close to 5’ end
• Hypothesis: highly conserved miRNAs are more involved in regulation
• High degree of conservation -> more predicted targets
• Membership in large miRNA family -> more predicted targets
miRNA Targets - Not the end of the story…

Many programs claim to discover miRNA

targets in mammals:

• miRanda - Enright et al, SKI
• DIANA-MicroT - Hatzigeorgiou et al, UPenn
• rna22 - Rigoutsos et al, IBM
• PicTar - Rajewsky et al, NYU
Problem 4 conclusion: algorithms comparison
• NYAS Competition (Feb 17, 2005)
• Task: given 2 miRNAs, find mammalian targets
• Widely differing results, from 1 target to ~500 targets!
• Very little overlap
• So who’s “right”?
• Currently correct targets unknown…