Prediction of > 3000 novel human microRNAs …
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science – FORTH PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on
  • Presentation posted in: General

Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science – FORTH. Rfam/miRBase 7.1 (October 2005). ID #miRNAs name ------------------------------------------- aga 42 A. gambiae (MOZ2)

Download Presentation

Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science – FORTH

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Microrna computational prediction pipeline

Prediction of > 3000 novel human microRNAs …

Martin Reczko

ICS/IMBB Bioinformatics Program

Biomedical Informatics Lab

Institute for Computer Science – FORTH


Microrna computational prediction pipeline

Rfam/miRBase 7.1 (October 2005)

ID #miRNAs name

-------------------------------------------

aga 42 A. gambiae (MOZ2)

ame 26 A. mellifera (AMEL2.0)

ath 117 A. thaliana (RefSeq entries)

cbr 82 C. briggsae (cb25.agp8)

cel 115 C. elegans (WormBase WS140)

cfa 6 C. familiaris (BROADD1)

dme 78 D. melanogaster (BDGP4)

dps 73 D. pseudoobscura (DPSE2.0)

dre 293 D. rerio (WTSI Zv5)

fru 130 F. rubripes (FUGU2.0)

gga 122 G. gallus (WASHUC1)

hsa 325 H. sapiens (NBCI35)

mmu 255 M. musculus (NCBIM34)

osa 123 O. sativa (TIGR 3.0)

ptr 67 P. troglodytes (CHIMP1)

rno 189 R. norvegicus (RGSC3.4)

tni 131 T. nigroviridis (TETRAODON7)

zma 95 Z. mays (TIGR AZM4)

ebv 5 Epstein Barr virus (EMBL:V01555.1)

hcmv 8 Human cytomegalovirus (Refseq:NC_001347.2)

kshv 11 Kaposi sarcoma associated herpesvirus (EMBL:U75698.1)

mghv 9 Mouse gammaherpesvirus 68 (EMBL:U97553.1)

microrna.sanger.ac.uk

used 227 from miRBase 6.0


Microrna computational prediction pipeline

Negative examples: 3’UTR s

~ 9 MBases http://www.ensembl.org/BioMart/


Microrna computational prediction pipeline

Conservation: MultiZ alignments

11111111111111111111111111111111111111110111111111111111111101111111111111111110111111111111111111111111 0

11111011111111111111111111111111111111010111111111111111111111111111110111110110111111111111111111111111 1

11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 2

11111101111111111111111111111111111111111101101011111111111111111111111111111110111111101111111111111111 3

11101001101101111111111111111111111110011111011011111111011011111111001111011100111111101111111111111111 4

11100001101101111111111111111111111110010100001011111111011001111111000111010000111111101111111111111111 5

Conservation rules: # 1’s above >= 120 , at least one stretch of 12 1’s


Microrna computational prediction pipeline

Genome wide prediction pipeline

  • Process windows of 104 nt along genome:

  • Fast filtering using composition and palindromes

  • 2. Comparative analysis with other genomes

  • (BLASTZ)

  • 3. Approximate secondary structure prediction

  • (stem-loop) using a novel dynamic programming

  • algorithm.

  • 4. Feature extraction and classification (SVMs)

  • 5. Filter conserved secondary structures


Microrna computational prediction pipeline

’Fast’ rules:

  • No window containing unknown base

  • No windows with complete repeat-regions gain 40% reduction in analyzed size,

    • 100% - > 98.4 % sensitivity

  • (lost: hsa-mir-151 hsa-mir-370 hsa-mir-422a hsa-mir-513-1 hsa-mir-513-2)

  • - Single nt composition, both strands:

  • max A 43% min 9%

  • max C 38% min 10.6%

  • max G 45% min 11%

  • max T 40% min 9.3%

  • - Single nt composition, single strands:

  • max A 37.5% min 9%

  • max C 38% min 10.6%

  • max G 43.8% min 12.5%

  • max T 40% min 12.7%


Microrna computational prediction pipeline

More ’fast’ rules:

  • Double nt composition, single strands:

  • max AA 15.4% min 0%

  • max AC 10.7% min 0%

  • max AG 14.2% min 1%

  • max AT 16.1% min 0%

  • max CA 14.7% min 0%

  • max CC 18.3% min 0%

  • max CG 15.8% min 0%

  • max CT 16.4% min 1.3%

  • max GA 11.9% min 0%

  • max GC 17.6% min 0%

  • max GG 19.3% min 1%

  • max GT 13.4% min 1.4%

  • max TA 15.7% min 0%

  • max TC 15.6% min 1.1%

  • max TG 18.8% min 2.9%

  • max TT 25.8% min 0%


Microrna computational prediction pipeline

>= 4nt palindrome rule:

Hash-table with 4^4=256 entries:

Hash-key occured at position rev.comp

---------------------------------------

000 AAAA 3 255

001 AAAC 0 254

002 AAAG 0 253

003 AAAU 4 252

004 AACA 0 251

005 ...

...

254 UUUG 0 001

255 UUUU 60 000


Microrna computational prediction pipeline

microRNA computational prediction pipeline

Energy + structural features

2 851 352 871 bases

Cross-species conservation

Inverted repeats,

composition

SVM

SS-conservation

RNA secondary structure

prediction

Novel microRNAs:

Microarray verification


Microrna computational prediction pipeline

Prediction features

predicted seconddary structure

comparative analysis

  • Stem_Length 2. GC_Content 3. Stem_BPs 4. maxLinHelix 5. MatureCons

  • 6. MatureOppositeCons 7. ArmCons 8. SS_Energy 9. MatureBPs 10. MatureEnergyProfile

=> 10 features for SVM classification


Microrna computational prediction pipeline

Histogram for feature: stem length


Microrna computational prediction pipeline

Histogram for feature: GC content


Microrna computational prediction pipeline

Histogram for feature: #base pairs in stem


Microrna computational prediction pipeline

Feature: longest ‘linear’ helix

maxlinhelix = 18 nt

maxlinhelix = 26 nt


Microrna computational prediction pipeline

Histogram for feature: longest ‘linear’ helix


Microrna computational prediction pipeline

Features related to mature region

Sliding 0 to 15 nt from loop

window of 23 nt

Calculate ‘mature’ feature at all positions and keep

prediction with highest score


Microrna computational prediction pipeline

Histogram for feature: #conserved bases in mature region


Microrna computational prediction pipeline

Histogram for feature: #conserved bases in mature region(on opposite strand)


Microrna computational prediction pipeline

Histogram for feature: #conserved bases in both arms of the stem


Microrna computational prediction pipeline

Histogram for feature: secondary structure minimal free energy


Microrna computational prediction pipeline

Histogram for feature: #paired bases in mature region


Microrna computational prediction pipeline

Mature region: average stacking energy


Microrna computational prediction pipeline

Histogram for feature: correlation with averagemature energy profile in mature region


Microrna computational prediction pipeline

Learning with Support Vector Machines

Training data Test data

‘Soft-margin’

hyperplanes,

cost parameter C


Microrna computational prediction pipeline

Training with libsvm-2.6 package by C.-C. Chang & C.-J. Lin

Modification:

optimize

Mathews

correlation,

not % correct

http://www.csie.ntu.edu.tw/~cjlin/libsvm/


Microrna computational prediction pipeline

Importance of features with ‘knockout’ retraining:

All features:

Cross Validation Accuracy = 87.2728%

Feature ‘knockout’:

Cross Validation Accuracy = 75.4618% ss-energy ***

Cross Validation Accuracy = 84.6784% stem-start

Cross Validation Accuracy = 84.409% stem-end

Cross Validation Accuracy = 85.2758% loop-length

Cross Validation Accuracy = 82.3163% loop-start

Cross Validation Accuracy = 82.3909% # base-pairs

Cross Validation Accuracy = 76.4124% GC-content **

Cross Validation Accuracy = 86.3902% higher arm conservation

Cross Validation Accuracy = 84.97% lower arm conservation

Cross Validation Accuracy = 85.0393% loop conservation

Cross Validation Accuracy = 84.0942% # GU pairs

Cross Validation Accuracy = 85.4047% length of longest bulge


Microrna computational prediction pipeline

Test-set results for various SVM thresholds

Q SENS SPEC CORR cp cn fp fn threshold

---------------------------------------------------------------------

99.60 96.74 28.16 +0.5208 89 56497 227 3 0.010000

99.76 95.65 39.82 +0.6163 88 56591 133 4 0.020000

99.83 95.65 48.09 +0.6776 88 56629 95 4 0.030000

99.86 95.65 54.32 +0.7203 88 56650 74 4 0.040000

99.87 95.65 55.00 +0.7248 88 56652 72 4 0.050000

99.92 95.65 67.18 +0.8012 88 56681 43 4 0.100000

99.94 95.65 75.21 +0.8479 88 56695 29 4 0.150000

99.95 95.65 78.57 +0.8667 88 56700 24 4 0.200000

99.96 95.65 82.24 +0.8868 88 56705 19 4 0.250000

99.96 95.65 83.02 +0.8909 88 56706 18 4 0.300000 ***

99.96 94.57 85.29 +0.8979 87 56709 15 5 0.350000

99.97 94.57 86.14 +0.9024 87 56710 14 5 0.400000

99.97 92.39 87.63 +0.8996 85 56712 12 7 0.450000

99.97 91.30 90.32 +0.9080 84 56715 9 8 0.500000

99.97 88.04 91.01 +0.8950 81 56716 8 11 0.550000

99.96 85.87 90.80 +0.8828 79 56716 8 13 0.600000

99.96 85.87 91.86 +0.8880 79 56717 7 13 0.650000

99.97 85.87 94.05 +0.8985 79 56719 5 13 0.700000

99.96 82.61 93.83 +0.8802 76 56719 5 16 0.750000

99.96 80.43 96.10 +0.8790 74 56721 3 18 0.800000

99.96 80.43 96.10 +0.8790 74 56721 3 18 0.849999

99.96 77.17 97.26 +0.8662 71 56722 2 21 0.899999


Microrna computational prediction pipeline

< 3 weeks on ~40 AMD-242-Opterons (ICS-FORTH)


Microrna computational prediction pipeline

Hg17-scan results for various SVM thresholds

precursor #candidates

sensitivity (incl. known miRNAs) hit-rate

----------------------------------------------

95.1% 96699 16 ppm

90.3% 45231 7.6 ppm

85.9% 23025 3.9 ppm

80.6% 14429 2.4 ppm

75.7% 9732 1.6 ppm

70.9% 6912 1.2 ppm

---------------------------------------

Total nt processed: 5976557831


Microrna computational prediction pipeline

Secondary structure conservation:

From RNAfold-library:

structure – stucture comparison:

Null, H, B, I, M, S, E

-------------------------------------

{ 0, 2, 2, 2, 2, 1, 1} Null

{ 2, 0, 2, 2, 2, INF, INF} H

{ 2, 2, 0, 1, 2, INF, INF} B

{ 2, 2, 1, 0, 2, INF, INF} I

{ 2, 2, 2, 2, 0, INF, INF} M

{ 1, INF, INF, INF, INF, 0, INF} S

{ 1, INF, INF, INF, INF, INF, 0} E

'H' hairpin loop

'I' interior loop

'B' bulge

'M' multi-loop

'S' stack

'E' external elements


Microrna computational prediction pipeline

Secondary structure conservation

vs. SVM scores


Microrna computational prediction pipeline

Probe-design for experimental verification (RNA-RNA chip):

  • - 2 probes with 60 nt for each candidate

  • end of 5' probes reach 75% into the hairpin-loop - 3' probes start after 50% of the hairpin-loop

  • sensitivity detecting mature miRNA: 86 %

  • Chip in preparation at UoToronto

Estimate for the number of true miRNAs:

Q:099.96 SENS:085.87 SPEC:091.86 CORR:+0.8880 cp 79 cn 56717 fp 7 fn 13 th 0.67

spec=cp/(cp+fp)=cp/nhits => (expected cp)=spec*nhits=0.9168*7664=7026

All predictions are avaliable !


Microrna computational prediction pipeline

Just the tip of an iceberg

  • tiling window expression analysis of mouse:

  • 30 % of the genome is transcribed !

  • - mRNA genes are 3% of the truth….


Microrna computational prediction pipeline

Acknowledgments:

Artemis Hatzigeorgiou,

Praveen Sethupathy, Molly Megraw, Karol Szafranski

Center for Bioinformatics, School of Medicine, University of Pennsylvania

Yannis Tollis

Panayiota Poïrazi

Anastasis Oulas

Alkiviadis Simeonidis

Angelos Bilas, Michalis Flouris

Advanced Computing Systems,

Computer Architecture and VLSI Systems Lab, ICS-FORTH


  • Login