Prediction of > 3000 novel human microRNAs …
Download
1 / 34

MicroRNA computational prediction pipeline - PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on

Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science – FORTH. Rfam/miRBase 7.1 (October 2005). ID #miRNAs name ------------------------------------------- aga 42 A. gambiae (MOZ2)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'MicroRNA computational prediction pipeline' - ostinmannual


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Prediction of > 3000 novel human microRNAs …

Martin Reczko

ICS/IMBB Bioinformatics Program

Biomedical Informatics Lab

Institute for Computer Science – FORTH


Rfam/miRBase 7.1 (October 2005)

ID #miRNAs name

-------------------------------------------

aga 42 A. gambiae (MOZ2)

ame 26 A. mellifera (AMEL2.0)

ath 117 A. thaliana (RefSeq entries)

cbr 82 C. briggsae (cb25.agp8)

cel 115 C. elegans (WormBase WS140)

cfa 6 C. familiaris (BROADD1)

dme 78 D. melanogaster (BDGP4)

dps 73 D. pseudoobscura (DPSE2.0)

dre 293 D. rerio (WTSI Zv5)

fru 130 F. rubripes (FUGU2.0)

gga 122 G. gallus (WASHUC1)

hsa 325 H. sapiens (NBCI35)

mmu 255 M. musculus (NCBIM34)

osa 123 O. sativa (TIGR 3.0)

ptr 67 P. troglodytes (CHIMP1)

rno 189 R. norvegicus (RGSC3.4)

tni 131 T. nigroviridis (TETRAODON7)

zma 95 Z. mays (TIGR AZM4)

ebv 5 Epstein Barr virus (EMBL:V01555.1)

hcmv 8 Human cytomegalovirus (Refseq:NC_001347.2)

kshv 11 Kaposi sarcoma associated herpesvirus (EMBL:U75698.1)

mghv 9 Mouse gammaherpesvirus 68 (EMBL:U97553.1)

microrna.sanger.ac.uk

used 227 from miRBase 6.0


Negative examples: 3’UTR s

~ 9 MBases http://www.ensembl.org/BioMart/


Conservation: MultiZ alignments

11111111111111111111111111111111111111110111111111111111111101111111111111111110111111111111111111111111 0

11111011111111111111111111111111111111010111111111111111111111111111110111110110111111111111111111111111 1

11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 2

11111101111111111111111111111111111111111101101011111111111111111111111111111110111111101111111111111111 3

11101001101101111111111111111111111110011111011011111111011011111111001111011100111111101111111111111111 4

11100001101101111111111111111111111110010100001011111111011001111111000111010000111111101111111111111111 5

Conservation rules: # 1’s above >= 120 , at least one stretch of 12 1’s


Genome wide prediction pipeline

  • Process windows of 104 nt along genome:

  • Fast filtering using composition and palindromes

  • 2. Comparative analysis with other genomes

  • (BLASTZ)

  • 3. Approximate secondary structure prediction

  • (stem-loop) using a novel dynamic programming

  • algorithm.

  • 4. Feature extraction and classification (SVMs)

  • 5. Filter conserved secondary structures


’Fast’ rules:

  • No window containing unknown base

  • No windows with complete repeat-regions gain 40% reduction in analyzed size,

    • 100% - > 98.4 % sensitivity

  • (lost: hsa-mir-151 hsa-mir-370 hsa-mir-422a hsa-mir-513-1 hsa-mir-513-2)

  • - Single nt composition, both strands:

  • max A 43% min 9%

  • max C 38% min 10.6%

  • max G 45% min 11%

  • max T 40% min 9.3%

  • - Single nt composition, single strands:

  • max A 37.5% min 9%

  • max C 38% min 10.6%

  • max G 43.8% min 12.5%

  • max T 40% min 12.7%


More ’fast’ rules:

  • Double nt composition, single strands:

  • max AA 15.4% min 0%

  • max AC 10.7% min 0%

  • max AG 14.2% min 1%

  • max AT 16.1% min 0%

  • max CA 14.7% min 0%

  • max CC 18.3% min 0%

  • max CG 15.8% min 0%

  • max CT 16.4% min 1.3%

  • max GA 11.9% min 0%

  • max GC 17.6% min 0%

  • max GG 19.3% min 1%

  • max GT 13.4% min 1.4%

  • max TA 15.7% min 0%

  • max TC 15.6% min 1.1%

  • max TG 18.8% min 2.9%

  • max TT 25.8% min 0%


>= 4nt palindrome rule:

Hash-table with 4^4=256 entries:

Hash-key occured at position rev.comp

---------------------------------------

000 AAAA 3 255

001 AAAC 0 254

002 AAAG 0 253

003 AAAU 4 252

004 AACA 0 251

005 ...

...

254 UUUG 0 001

255 UUUU 60 000


Microrna computational prediction pipeline
microRNA computational prediction pipeline

Energy + structural features

2 851 352 871 bases

Cross-species conservation

Inverted repeats,

composition

SVM

SS-conservation

RNA secondary structure

prediction

Novel microRNAs:

Microarray verification


Prediction features

predicted seconddary structure

comparative analysis

  • Stem_Length 2. GC_Content 3. Stem_BPs 4. maxLinHelix 5. MatureCons

  • 6. MatureOppositeCons 7. ArmCons 8. SS_Energy 9. MatureBPs 10. MatureEnergyProfile

=> 10 features for SVM classification





Feature: longest ‘linear’ helix

maxlinhelix = 18 nt

maxlinhelix = 26 nt



Features related to mature region

Sliding 0 to 15 nt from loop

window of 23 nt

Calculate ‘mature’ feature at all positions and keep

prediction with highest score








Histogram for feature: correlation with average energymature energy profile in mature region


Learning with Support Vector Machines energy

Training data Test data

‘Soft-margin’

hyperplanes,

cost parameter C


Training with libsvm-2.6 package by C.-C. Chang & C.-J. Lin energy

Modification:

optimize

Mathews

correlation,

not % correct

http://www.csie.ntu.edu.tw/~cjlin/libsvm/


Importance of features with ‘knockout’ retraining: energy

All features:

Cross Validation Accuracy = 87.2728%

Feature ‘knockout’:

Cross Validation Accuracy = 75.4618% ss-energy ***

Cross Validation Accuracy = 84.6784% stem-start

Cross Validation Accuracy = 84.409% stem-end

Cross Validation Accuracy = 85.2758% loop-length

Cross Validation Accuracy = 82.3163% loop-start

Cross Validation Accuracy = 82.3909% # base-pairs

Cross Validation Accuracy = 76.4124% GC-content **

Cross Validation Accuracy = 86.3902% higher arm conservation

Cross Validation Accuracy = 84.97% lower arm conservation

Cross Validation Accuracy = 85.0393% loop conservation

Cross Validation Accuracy = 84.0942% # GU pairs

Cross Validation Accuracy = 85.4047% length of longest bulge


Test-set results for various SVM thresholds energy

Q SENS SPEC CORR cp cn fp fn threshold

---------------------------------------------------------------------

99.60 96.74 28.16 +0.5208 89 56497 227 3 0.010000

99.76 95.65 39.82 +0.6163 88 56591 133 4 0.020000

99.83 95.65 48.09 +0.6776 88 56629 95 4 0.030000

99.86 95.65 54.32 +0.7203 88 56650 74 4 0.040000

99.87 95.65 55.00 +0.7248 88 56652 72 4 0.050000

99.92 95.65 67.18 +0.8012 88 56681 43 4 0.100000

99.94 95.65 75.21 +0.8479 88 56695 29 4 0.150000

99.95 95.65 78.57 +0.8667 88 56700 24 4 0.200000

99.96 95.65 82.24 +0.8868 88 56705 19 4 0.250000

99.96 95.65 83.02 +0.8909 88 56706 18 4 0.300000 ***

99.96 94.57 85.29 +0.8979 87 56709 15 5 0.350000

99.97 94.57 86.14 +0.9024 87 56710 14 5 0.400000

99.97 92.39 87.63 +0.8996 85 56712 12 7 0.450000

99.97 91.30 90.32 +0.9080 84 56715 9 8 0.500000

99.97 88.04 91.01 +0.8950 81 56716 8 11 0.550000

99.96 85.87 90.80 +0.8828 79 56716 8 13 0.600000

99.96 85.87 91.86 +0.8880 79 56717 7 13 0.650000

99.97 85.87 94.05 +0.8985 79 56719 5 13 0.700000

99.96 82.61 93.83 +0.8802 76 56719 5 16 0.750000

99.96 80.43 96.10 +0.8790 74 56721 3 18 0.800000

99.96 80.43 96.10 +0.8790 74 56721 3 18 0.849999

99.96 77.17 97.26 +0.8662 71 56722 2 21 0.899999



Hg17-scan results for various SVM thresholds energy

precursor #candidates

sensitivity (incl. known miRNAs) hit-rate

----------------------------------------------

95.1% 96699 16 ppm

90.3% 45231 7.6 ppm

85.9% 23025 3.9 ppm

80.6% 14429 2.4 ppm

75.7% 9732 1.6 ppm

70.9% 6912 1.2 ppm

---------------------------------------

Total nt processed: 5976557831


Secondary structure conservation: energy

From RNAfold-library:

structure – stucture comparison:

Null, H, B, I, M, S, E

-------------------------------------

{ 0, 2, 2, 2, 2, 1, 1} Null

{ 2, 0, 2, 2, 2, INF, INF} H

{ 2, 2, 0, 1, 2, INF, INF} B

{ 2, 2, 1, 0, 2, INF, INF} I

{ 2, 2, 2, 2, 0, INF, INF} M

{ 1, INF, INF, INF, INF, 0, INF} S

{ 1, INF, INF, INF, INF, INF, 0} E

'H' hairpin loop

'I' interior loop

'B' bulge

'M' multi-loop

'S' stack

'E' external elements


Secondary structure conservation energy

vs. SVM scores


Probe-design for experimental verification (RNA-RNA chip): energy

  • - 2 probes with 60 nt for each candidate

  • end of 5' probes reach 75% into the hairpin-loop - 3' probes start after 50% of the hairpin-loop

  • sensitivity detecting mature miRNA: 86 %

  • Chip in preparation at UoToronto

Estimate for the number of true miRNAs:

Q:099.96 SENS:085.87 SPEC:091.86 CORR:+0.8880 cp 79 cn 56717 fp 7 fn 13 th 0.67

spec=cp/(cp+fp)=cp/nhits => (expected cp)=spec*nhits=0.9168*7664=7026

All predictions are avaliable !


Just the tip of an iceberg energy

  • tiling window expression analysis of mouse:

  • 30 % of the genome is transcribed !

  • - mRNA genes are 3% of the truth….


Acknowledgments: energy

Artemis Hatzigeorgiou,

Praveen Sethupathy, Molly Megraw, Karol Szafranski

Center for Bioinformatics, School of Medicine, University of Pennsylvania

Yannis Tollis

Panayiota Poïrazi

Anastasis Oulas

Alkiviadis Simeonidis

Angelos Bilas, Michalis Flouris

Advanced Computing Systems,

Computer Architecture and VLSI Systems Lab, ICS-FORTH


ad