860 likes | 977 Views
Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS. Notice: During this practical, you will need to use ‘raw’ and ‘fasta’ sequence formats. For additional information on the different sequence formats available, please have a look at
E N D
Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS
Notice: During this practical, you will need to use ‘raw’ and ‘fasta’ sequence formats. For additional information on the different sequence formats available, please have a look at http://www.genomatix.de/online_help/help/sequence_formats.html
Choose: eukaryotic tRNA; does not give any result with general tRNA model !
CpG island in the C. Elegans cosmid Lenght 219 pb; position 21’954 to 22’172 cgttttctgtggtcaca cacgagtatc cggatcttct ggatcaactt gttctcgtct gcaacgtctt tgcaagaatg gcaccagaac agaaacaact actcgtggaa caccttcaag acgttgggca gacggtcgct atgtgtggcg atggagctaa tgattgtgct gctctgaaag cagctcacgc gggaatctca ctatcggagg ctgaagcatc ga To confirm that this sequence could be part of a promoter sequence (> 80 % of CpG islands extend in the 5’ flanking region of the associated genes), check - according to its positions - if this CpG island is located in a gene promoter region(see later).
Gene prediction with HMM on the complete cosmid sequence
3 HMM models: firstex, exon_n, lastex Gene 1 Wrong CDS ? Gene 4 Gene 3 Gene 2
3 1 2 4 Summary: tRNA 169 238 Predicted CpG island: 21954 22172 -> in the middle of CDS4: not a ‘classical’ CpG (not in the 5’ of a gene)
Gene 1 prediction with HMMgene One gene found
Gene 1 prediction with HMMgene With ‘human’: 2 genes found, one on each strand, (strand minus with less good scores) The programs are ‘trained’ with sequence from specific organisms. The ‘codon bias’ for example, is not the same for the different species.
Example of codon usage tables (-> codon bias) http://www.kazusa.or.jp/codon/
Gene 1 prediction with Netgene2 Netgene 2 gives the positions of the first and last nucleotide of the intron (donnor and acceptor splice sites) intron GT donnor AG acceptor
Gene 1 prediction with GeneBuilder (organism: no choice….human; option: first and last exon disabled) Matrix: miscellaneous One gene found
Gene 1 prediction with GenScan !! No choice except: vertebrate, maize and arabidobsis ! Two genes found
Two genes found !! No choice except: vertebrate, maize and arabidobsis !
FGENESH One gene found
1914 1997 AC 1913 (1.00) AC 1451 (0.90) AC 1304 (0.77) DO 1662 (1.00) DO 1407 (0.89) DO 1084 (1.00) Summary (gene prediction) One gene 977 163 211 1557 2000 1003 1406 1083 1305 1661 1452 5 ’ 3 ’ + another potential gene from positions 2000 to 2900 and GenScan (organism = human !!) HMMgene Genebuilder Netgene2 (organism = human !!) DO:donnor site AC: acceptor site GeneMark: finds a second gene in 3’!!! FGENESH
ID FGENESH Unreviewed; 159 AA.SQ SEQUENCE 159 AA; 17780 MW; F9A2C7DE9614425C CRC64; MKVETCVYSG YKIHPGHGKR LVRTDGKVQI FLSGKALKGA KLRRNPRDIR WTVLYRIKNK KGTHGQEQVT RKKTKKSVQV VNRAVAGLSL DAILAKRNQT EDFRRQQREQ AAKIAKDANK AVRAAKAAAN KEKKASQPKT QQKTAKNVKT AAPRVGGKR//ID GENESCAN1 Unreviewed; 159 AA.SQ SEQUENCE 159 AA; 17780 MW; F9A2C7DE9614425C CRC64; MKVETCVYSG YKIHPGHGKR LVRTDGKVQI FLSGKALKGA KLRRNPRDIR WTVLYRIKNK KGTHGQEQVT RKKTKKSVQV VNRAVAGLSL DAILAKRNQT EDFRRQQREQ AAKIAKDANK AVRAAKAAAN KEKKASQPKT QQKTAKNVKT AAPRVGGKR//ID GENESCAN2 Unreviewed; 202 AA.SQ SEQUENCE 202 AA; 23684 MW; 98A69FA21823F2F3 CRC64; MRTLRIAQYS VLTVGFAIYM YRLIEEIPID IRNLNSDSLE GIINSDELCD VTVSNRNRGL LVRNDSLDLD ILKAKFTTFF SKRYLTRFLS EQVPFLHVID EALLVKRFVM CACFMVFCLT VIWFLVIRRM GNLIKRLSVL NQLEDAESVE WARCIREFTQ EKLAVLCFCI VPPFAQTDKL VSDKIKLFRE HKILRIRSVQ HI//ID GENEMARK1 Unreviewed; 184 AA.SQ SEQUENCE 184 AA; 20255 MW; 85BB0234E6C14EA0 CRC64; MGRCGSSGKR DGYGAKDSSS EGLSTMKVET CVYSGYKIHP GHGKRLVRTD GKVQIFLSGK ALKGAKLRRN PRDIRWTVLY RIKNKKGTHG QEQVTRKKTK KSVQVVNRAV AGLSLDAILA KRNQTEDFRR QQREQAAKIA KDANKAVRAA KAAANKEKKA SQPKTQQKTA KNVKTAAPRV GGKR//ID GENEMARK2 Unreviewed; 183 AA.SQ SEQUENCE 183 AA; 21336 MW; 64F65D472A58046E CRC64; MRTLRIAQYS VLTVGFAIYM YRLIEEIPID IRNLNSDSLE GIINSDELCD VTVSNRNRGL LVRNDSLDLD ILKAKFTTFF SKRYLTRFLS EQVPFLHVID EALLVKRFVM CACFMVFCLT VIWFLVIRRM GNLIKRLSVL NQLEDAESVE WARCIREFTQ EKLAVLCFCI VPPFAQTDNV QHI//
For fun… Compare the predictions with the same program (GenMark) with different parameters (HMM trained with eukaroyta or prokaroyta)
Gene 1 prediction with GeneMark (prokaryota specific; E.coli K12) Protein 1 Protein 2
Gene 1 prediction with GeneMark (prokaryota specific) Protein 1 Protein 2 CDS corresponds ~ to ‘exon’ : there is no intron in prokaryota !
1914 1997 AC 1913 (1.00) AC 1451 (0.90) AC 1304 (0.77) DO 1662 (1.00) DO 1407 (0.89) DO 1084 (1.00) Summary (prokaryota gene prediction) 1557 1003 1661 1406 1083 1452 1305 5 ’ 3 ’ 2000 1254 1433 Protein 1 1437 1688 Protein 2 GenScan GenMark (euka) HMMgene Genebuilder Netgene2 DO:donnor site AC: acceptor site Gene Mark (proka)
Alignment between the ‘eukaryota and prokaryota’ predicted sequences
Gene prediction: similarity searches with ESTs ESTs: Expressed sequence tags (cDNAs which are rapidly and badly sequenced)
Two genes found Blast 2012 Gene A Gene B
Blast 2010 Gene A Gene B
Gene A EST1 >gi|47590759|gb|BJ750997.1|BJ750997 BJ750997 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 5', mRNA sequenceGGTTTAATTACCCAAGTTTGAGATTCGTCAAGCGAGGGCCTATCAGCAATGAAGGTCGAAACCTGCGTTTACTCCGGATACAAGATCCACCCAGGACACGGAAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTATATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTTGAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACTGTCCTCTACAGAATCAAGAACAAGAAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCCGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAA EST2 >gi|47646579|gb|BJ775052.1|BJ775052 BJ775052 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 3', mRNA sequenceATAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTCTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCTTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCTCTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTCTTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATCTGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTGAAATTTGAATAAATAAATCATGAAACGGTAGTTTGTCGCACTCAACACGTGGCATGCTTAACGATGGTATAACTTCTAAAAACTAGAAGATATAGCACCAACACATACATAAGGTGATTATGCTTTACTTTTGCAATACTTCAACACTCGTAAAATTACAATATACCTTACGAGCTCTAACAGCATGCTAACGCCTTTCAAAGAGAAACTGAACTCACCTTTCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCGACCTTCATTGCTGATANGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCCCA EST3 >gi|47727995|gb|BJ818152.1|BJ818152 BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis elegans cDNA clone yk1685h11 3', mRNA sequence TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTC TTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCT TGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCT CTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTC TTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATC TGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCG ACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCTACAAATAAAAATG AGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAAAACCGAAAACGAGAAAATTATTCTATTATG ACAGATAGAATAAGTTAAAATGGGAAGAGTGCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCG TGGGCAAGGTAAGCGACATTGTTCGATGAA
Blast result with EST1 975-1407 1450-1615 1692-1865 BUT: Blast does not take care of the intron-exon boundaries when aligning DNA with RNA -> we have to use a specific tool : SIM4 The 3rd part of the EST1 is of very bad quality
SIM4 alignment Example with EST 1 BJ750997 (partial) The 3rd part of the EST1 is of very bad quality: not align by SIM4 -> EST1 is considered as partial !
SIM4 alignment results EST 1 BJ750997 (partial) EST 2 BJ775052 EST 3 BJ818152
1914 1997 Gene A summary (ESTs) 1003 1661 1406 1083 1452 1305 EST3BJ818152.1 5 ’ 3 ’ 1615 … EST1BJ750997.1 EST2 BJ775052.1 Alternative splicing event (intron retention) -> 2 different mRNAs (EST BJ750997.1 is partial)
Translation and BLASTp Translation (beware the EST sequence orientation !)
>gi|47590759|gb|BJ750997.1|BJ750997 BJ750997 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 5', mRNA sequence GGTTTAATTACCCAAGTTTGAGATTCGTCAAGCGAGGGCCTATCAGCAATGAAGGTCGAAACCTGCGTTT ACTCCGGATACAAGATCCACCCAGGACACGGAAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGT TTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTATATTGTAATTTTACGAGTGTTGAAGT ATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATA CCATCGTTAAGCATGCCACGTGTTGAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAG GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGAT GGACTGTCCTCTACAGAATCAAGAACAAGAAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGAC CAAGAAGTCCGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGA AACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAA EST1
MIYLFKFQVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKKTKKSVQ VVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIA Blastp results
>gi|47646579|gb|BJ775052.1|BJ775052 BJ775052 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 3', mRNA sequence ATAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGT CTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCC TTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTC TCTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGT EST2 CTTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCAT CTGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCT GAAATTTGAATAAATAAATCATGAAACGGTAGTTTGTCGCACTCAACACGTGGCATGCTTAACGATGGTA TAACTTCTAAAAACTAGAAGATATAGCACCAACACATACATAAGGTGATTATGCTTTACTTTTGCAATAC TTCAACACTCGTAAAATTACAATATACCTTACGAGCTCTAACAGCATGCTAACGCCTTTCAAAGAGAAAC TGAACTCACCTTTCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAA ACGCAGGTTTCGACCTTCATTGCTGATANGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCCC A
MIYLFKFQVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKKTKKSVQ VVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIAKDANKAVRAAKAAANKEKKASQPK TQQKTAKNVKTAAPRVGGKR Blastp results
>gi|47727995|gb|BJ818152.1|BJ818152 BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis elegans cDNA clone yk1685h11 3', mRNA sequence TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTC TTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCT TGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCT CTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTC TTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATC TGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCG ACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCTACAAATAAAAATG AGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAAAACCGAAAACGAGAAAATTATTCTATTATG ACAGATAGAATAAGTTAAAATGGGAAGAGTGCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCG T GGGCAAGGTAAGCGACATTGTTCGATGAA EST3
Gene A EST1 is partial in C-ter
Gene A EST1 is partial. EST3 corresponds to the UniProtKB/Swiss-Prot RL24_CAEEL sequence