1 / 15

Publish.pl

Publish.pl. Steven Githens FW4089 Spring 2003. The Problem. Task: Format a sequence with multiple exons and their translations so that the amino acids line up with their corresponding codons. It should look nice too.

carina
Download Presentation

Publish.pl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Publish.pl Steven Githens FW4089 Spring 2003

  2. The Problem • Task: Format a sequence with multiple exons and their translations so that the amino acids line up with their corresponding codons. It should look nice too. • Current Solution: Use GCG’s publish command, over and over again, manually adjusting exon boundaries, and then manually fix phase mismatches. • Natural Conclusion: This is what computers are for. • Objective: Write a formatting program to do all this and have it be easily extendable. • Tools for the Job: Perl and Bioperl

  3. Perl – Practical Extraction and Report Language • Makes manipulating text really easy. Example: @files = split(/,/, “one.txt,two.txt,three.txt”) • Good Regular Expression support Like DOS wildcards(*,?) and GCG’s findpatterns, with more power Examples: /(Fred|Wilma|Pebbles) Flinstone/ s/^([^ ]+) +([^ ]+)/$2 $1/ swap first two words /.{80,90}/ find line of length 80 to 90 /\b(\w\S+)(\s+\1) +\b/xig find repeat words(Is is this OK?) • Genetic information is itself a language and easily expressed as text. It makes sense that Perl would be a good choice to use for working with sequences.

  4. Bioperl.org • Makes manipulating sequence data really easy. • Contains modules for • Sequence formats • Sequences and annotations • Restriction Enzymes • Remote Databases • Using external programs like clustalW • Using online queries like blast • Much much more

  5. Converting sequence formats use Bio::SeqIO; my $in = Bio::SeqIO->newFh ( -file => '<seqs.sp', -format => ‘genbank' ); my $out = Bio::SeqIO->newFh ( -file => '>seqs.fasta', -format => ‘gcg' ); print $out $_ while <$in>;

  6. Statistics: $seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj); $seq_stats->count_monomers(); $seq_stats->count_codons(); $weight = $seq_stats->get_mol_wt($seqobj); Restriction Enzymes: $re1 = new Bio::Tools::RestrictionEnzyme(-NAME =>'EcoRI'); @fragments = $re1->cut_seq($seqobj); $re2 = new Bio::Tools::RestrictionEnzyme( -NAME =>'EcoRV--GAT^ATC', -MAKE =>'custom');

  7. publish.pl Usage: perl publish.pl Options: --infile: sequence to publish in genbank format --outfile: filename for output, default is publish.output --source: Where to get exon information: all, manual, genbank, or genscan --exons: actual exon information, depends on value for –source --codontable: changes the codon table used (can vary between species and organelles) defaults to the standard codon table

  8. Using all perl publish.pl --infile af027172.genbank --source all --outfile publish_all.output publish_all.output: . . . . . . 1 CCTGAGCCGGAGTCCTATTCAATTATCTAGAAGAAGTCTGAGCCGGAGTCCCACTCGATT 60 1 P E P E S Y S I I * K K S E P E S H S I 20 . . . . . . 61 GTCTAGGAGAAGCCTAAGCCGGAGTCCCATTCGATCACCTAGGAAGAGTGTGAGCAGGAG 120 21 V * E K P K P E S H S I T * E E C E Q E 40 . . . . . . 121 TCCAGTCCGATCATCTAGGAAGAGTGTGAGCAGAAGTCCGGTTCGTTCATCCAGGAGACG 180 41 S S P I I * E E C E Q K S G S F I Q E T 60 . . . . . . 181 TATCAGCAGGAGTCCAGTCCGATCATCTAGGAAGAGTGTGAGCAGGAGTCCTATTCGATT 240 61 Y Q Q E S S P I I * E E C E Q E S Y S I 80 . . . . . . 241 GTCCAGAAGAAGTATCAGCAGGAGTCCTATTCGATTGTCCAGGAGAAGTATCAGCAGGAG 300 81 V Q K K Y Q Q E S Y S I V Q E K Y Q Q E 100 . . . . . . 301 TCCTGTTAGAGGAAGAAGAAGAATTAGCAGAAGTCCAGTTCCGGCAAGGAGAAGGAGTGT 360 101 S C * R K K K N * Q K S S S G K E K E C 120 . . . . . . 361 GCGGCCTAGATCTCCTCCTCCTGACCGCAGAAGAAGTTTGTCAAGAAGTGCTTCTCCTAA 420 121 A A * I S S S * P Q K K F V K K C F S * 140 . . . . . . 421 TGGGCGCATAAGGAGAGGGAGAGGATTTAGCCAAAGATTCTCATACGCCCGTCGATACAG 480 141 W A H K E R E R I * P K I L I R P S I Q 160 . . . . . . 481 AACTAGTCCATCTCCTGATCGATCTCCTTATCGCTTTAGTGATAGGAGTGACCGTGACAG 540 161 N * S I S * S I S L S L * * * E * P * Q 180 . . . . . . 541 GTGAATAGCCCACACATAATATAACTCCCCCTTTCTGTTACACACTCTCGTACTGAACCG 600 181 V N S P H I I * L P L S V T H S R T E P 200

  9. Using manual perl publish.pl --infile af027172.genbank --source manual –exons 2286..2366,2894..3089,3188..3360,3584..3698,3814..4003,4171..4437,4773..5118,5197..5334,5416..5541,5693..5905,6012..6276,6364..6560,6645..6995,7078..7665 publish.output: . . . . . . 2881 GGTCAAAGTGCAGACCAAACCTTTGAAGAATATGAATGGCCAGATATGTCAGATCTGTGG 2940 28 T K P L K N M N G Q I C Q I C G 43 . . . . . . 2941 TGATGATGTTGGACTCGCTGAAACTGGAGATGTCTTTGTCGCGTGTAATGAATGTGCCTT 3000 44 D D V G L A E T G D V F V A C N E C A F 63 . . . . . . 3001 CCCTGTGTGTCGGCCTTGCTATGAGTACGAGAGGAAAGATGGAACTCAGTGTTGCCCTCA 3060 64 P V C R P C Y E Y E R K D G T Q C C P Q 83 . . . . . . 3061 ATGCAAGACTAGATTCAGACGACACAGGGGTCAGTTGTCTTTTTCTTTTTGTTGGCAATT 3120 84 C K T R F R R H R G 93 . . . . . . 3121 GCTATATATGGATTTTCTCTTTTTGTTTCTTTGCTGTTGTGTTGAACAATTTTTTGGAAT 3180 . . . . . . 3181 TTTCCAGGGAGTCCTCGTGTTGAAGGAGATGAAGATGAGGATGATGTTGATGATATCGAG 3240 94 S P R V E G D E D E D D V D D I E 110 . . . . . . 3241 AATGAGTTCAATTACGCCCAGGGAGCTAACAAGGCGAGACACCAACGCCATGGCGAAGAG 3300 111 N E F N Y A Q G A N K A R H Q R H G E E 130 . . . . . . 3301 TTTTCTTCTTCCTCTAGACATGAATCTCAACCAATTCCTCTTCTCACCCATGGCCATACG 3360 131 F S S S S R H E S Q P I P L L T H G H T 150 . . . . . . 3361 GTAGGGACCTACATTTTCCCTTTAGACTCTAGAGTGATTTGTATTACTCAATAAATCCCT 3420 . . . . . . 3421 AGAGTGGTCATTTATTACTTACTATTCACGTTAATGTTATATGTGAACAAATCTTAACAG 3480 . . . . . . 3481 AATTTTTTTCTGATAGTACATGGTCATCCAAATTAAGAAATAATAATAGATGTTGTTAGT 3540 . . . . . . 3541 TGTGTCTGTTTTCAATAGATTCATGACCTTTTTCTATACACAGGTTTCTGGAGAGATTCG 3600 151 V S G E I R 156 . . . . . . 3601 CACGCCTGATACACAATCTGTGCGAACTACATCAGGTCCTTTGGGTCCTTCTGACAGGAA 3660 157 T P D T Q S V R T T S G P L G P S D R N 176 . . . . . . 3661 TGCTATTTCATCTCCATATATTGATCCACGGCAACCTGGTATTCATATGTTTTTCCCTTG 3720 177 A I S S P Y I D P R Q P V 189 . . . . . . 3721 TGCACGTGGTCTTTGTTAAATGTGATTCCTATTCATTTTTACAACATATATATTTTGTGT

  10. Using Genbank perl publish.pl --infile af027172.genbank --source genbank Gets exons from: FEATURES Location/Qualifiers source 1..8401 /organism="Arabidopsis thaliana" /mol_type="genomic DNA" /cultivar="Columbia" /db_xref="taxon:3702" /chromosome="4" /map="4; between g8300 and AtKC1/mi431" gene <2286..>7665 /gene="RSW1" mRNA join(<2286..2366,2894..3089,3188..3360,3584..3698, 3814..4003,4171..4437,4773..5118,5197..5334,5416..5541, 5693..5905,6012..6276,6364..6560,6645..6995,7078..>7665) /gene="RSW1" /product="cellulose synthase catalytic subunit" CDS join(2286..2366,2894..3089,3188..3360,3584..3698, 3814..4003,4171..4437,4773..5118,5197..5334,5416..5541, 5693..5905,6012..6276,6364..6560,6645..6995,7078..7665) /gene="RSW1" /codon_start=1 /product="cellulose synthase catalytic subunit" /protein_id="AAC39334.1" /db_xref="GI:2827139" /translation="MEASAGLVAGSYRRNELVRIRHESDGGTKPLKNMNGQICQICGD

  11. Using Genscan Output perl publish.pl --infile af027172.genbank --source genscan --exons genscan_selected.output genscan_selected.output: GENSCAN 1.0 Date run: 27-Apr-103 Time: 20:35:16Sequence 20:35:16 : 8401 bp : 40.02% C+G : Isochore 1 ( 0 - 43 C+G%)Parameter matrix: Arabidopsis.smatPredicted genes/exons:Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------ 2.01 Init + 2286 2366 81 2 0 89 9 213 0.998 19.52 2.02 Intr + 2894 3089 196 1 1 9 44 218 0.932 12.57 2.03 Intr + 3188 3360 173 0 2 66 66 202 0.999 19.54 2.04 Intr + 3584 3698 115 1 1 79 26 124 0.942 9.40 2.05 Intr + 3814 4003 190 2 1 41 101 137 0.990 13.02 2.06 Intr + 4171 4437 267 1 0 99 81 241 0.985 25.22 2.07 Intr + 4773 5118 346 0 1 80 36 278 0.981 21.27 2.08 Intr + 5197 5334 138 0 0 80 92 83 0.999 12.64 2.09 Intr + 5416 5541 126 0 0 84 10 73 0.943 4.06 2.10 Intr + 5693 5905 213 1 0 82 76 180 0.999 19.19 2.11 Intr + 6012 6276 265 2 1 84 42 187 0.933 14.86 2.12 Intr + 6364 6560 197 2 2 70 78 194 0.999 19.81 2.13 Intr + 6645 6995 351 2 0 50 68 152 0.873 8.99 2.14 Term + 7078 7665 588 0 0 75 41 573 0.964 49.83Predicted peptide sequence(s):Predicted coding sequence(s):>20:35:16|GENSCAN_predicted_peptide_1|248_aaMTVATNPFFPPGLLSEDVESDFDGDLALDGDLDTDLAFPVPMGDFEYDFDLDLDLDGLRLWLADGDLLFLGLVNRLGENLLRLLNLQRRRYKKDGSVRECVTERGSYIMCGLFTCHGEALLDKLLLRSGGGDLGRTLLLLAGTGLLLILLLPLTGLLLILLLDNRIGLLLILLLDNRIGLLLTLFLDDRTGLLLIRLLDERTGLLLTLFLDDRTGLLLTLFLGDRMGLRLRLLLDNRVGLRLRLLLDN>20:35:16|GENSCAN_predicted_CDS_1|747_bpatgacggtggctactaatccctttttcccacctgggctattatccgaggatgtcgaatccgacttcgatggagaccttgccttggatggtgatcttgacactgatcttgctttccccgtcccaatgggagattttgaatatgactttgatcttgatcttgacctcgatggactcaggctgtggctagcggatggtgaccttctctttcttgggcttgtgaaccgacttggcgagaaccttctgcgacttctaaatctacagagaagacgttataaaaaagacggttcagtacgagagtgtgtaacagaaagggggagttatattatgtgtgggctattcacctgtcacggagaagcacttcttgacaaacttcttctgcggtcaggaggaggagatctaggccgcacactccttctccttgccggaactggacttctgctaattcttcttcttcctctaacaggactcctgctgatacttctcctggacaatcgaataggactcctgctgatacttcttctggacaatcgaataggactcctgctcacactcttcctagatgatcggactggactcctgctgatacgtctcctggatgaacgaaccggacttctgctcacactcttcctagatgatcggactggactcctgctcacactcttcctaggtgatcgaatgggactccggcttaggcttctcctagacaatcgagtgggactccggctcagacttcttctagataattga>20:35:16|GENSCAN_predicted_peptide_2|1081_aaMEASAGLVAGSYRRNELVRIRHESDGGTKPLKNMNGQICQICGDDVGLAETGDVFVACNECAFPVCRPCYEYERKDGTQCCPQCKTRFRRHRGSPRVEGDEDEDDVDDIENEFNYAQGANKARHQRHGEEFSSSSRHESQPIPLLTHGHTVSGEIRTPDTQSVRTTSGPLGPSDRNAISSPYIDPRQPVPVRIVDPSKDLNSYGLGNVDWKERVEGWKLKQEKNMLQMTGKYHEGKGGEIEGTGSNGEELQMADDTRLPMSRVVPIPSSRLTPYRVVIILRLIILCFFLQYRTTHPVKNAYPLWLTSVICEIWFAFSWLLDQFPKWYPINRETYLDRLAIRYDRDGEPSQLVPVDVFVSTVDPLKEPPLVTANTVLSILSVDYPVDKVACYVSDDGSAMLTFESLSETAEFAKKWVPFCKKFNIEPRAPEFYFAQKIDYLKDKIQPSFVKERRAMKREYEEFKVRINALVAKAQKIPEEGWTMQDGTPWPGNNTRDHPGMIQVFLGHSGGLDTDGNELPRLIYVSREKRPGFQHHKKAGAMNALIRVSAVLTNGAYLLNVDCDHYFNNSKAIKEAMCFMMDPAIGKKCCYVQFPQRFDGIDLHDRYANRNIVFFDINMKGLDGIQGPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCCGSRKKGKSSKKYNYEKRRGINRSDSNAPLFNMEDIDEGFEGYDDERSILMSQRSVEKRFGQSPVFIAATFMEQGGIPPTTNPATLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFKMHARGWISIYCNPPRPAFKGSAPINLSDRLNQVLRWALGSIEILLSRHCPIWYGYHGRLRLLERIAYINTIVYPITSIPLIAYCILPAFCLITDRFIIPEISNYASIWFILLFISIAVTGILELRWSGVSIEDWWRNEQFWVIGGTSAHLFAVFQGLLKVLAGIDTNFTVTSKATDEDGDFAELYIFKWTALLIPPTTVLLVNLIGIVAGVSYAVNSGYQSWGPLFGKLFFALWVIAHLYPFLKGLLGRQNRTPTIVIVWSVLLASIFSLLWVRINPFVDANPNANNFNGKGGVF>20:35:16|GENSCAN_predicted_CDS_2|3246_bpatggaggccagtgccggcttggttgctggatcctaccggagaaacgagctcgttcggatccgacatgaatctgatggcgggaccaaacctttgaagaatatgaatggccagatatgtcagatctgtggtgatgatgttggactcgctgaaactggagatgtctttgtcgcgtgtaatgaatgtgccttccctgtgtgtcggccttgctatgagtacgagaggaaagatggaactcagtgttgccctcaatgcaagactagattcagacgacacagggggagtcctcgtgttgaaggagatgaagatgaggatgatgttgatgatatcgagaatgagttcaattacgcccagggagctaacaaggcgagacaccaacgccatggcgaagagttttcttcttcctctagacatgaatctcaaccaattcctcttctcacccatggccatacggtttctggagagattcgcacgcctgatacacaatctgtgcgaactacatcaggtcctttgggtccttctgacaggaatgctatttcatctccatatattgatccacggcaacctgtccctgtaagaatcgtggacccgtcaaaagacttgaactcttatgggcttggtaatgttgactggaaagaaagagttgaaggctggaagctgaagcaggagaaaaatatgttacagatgactggtaaataccatgaagggaaaggaggagaaattgaagggactggttccaatggcgaagaactccaaatggctgatgatacacgtcttcctatgagtcgtgtggtgcctatcccatcttctcgcctaaccccttatcgggttgtgattattctccggcttatcatcttgtgtttcttcttgcaatatcgtacaactcaccctgtgaaaaatgcatatcctttgtggttgacctcggttatctgtgagatctggtttgcattttcttggcttcttgatcagtttcccaaatggtaccccattaacagggagacttatcttgaccgtctcgctataagatatgatcgagacggtgaaccatcacagctcgttcctgttgatgtgtttgttagtacagtggacccattgaaagagcctccccttgttacagcaaacacagttctctcgattctttctgtggactacccggtagataaagtagcctgttatgtttcagatgatggttcagctatgcttacctttgaatccctttctgaaaccgctgagtttgcaaagaaatgggtaccattttgcaagaaattcaacattgaacctagggcccctgaattctattttgcccagaagatagattacttgaaggacaagatccaaccgtcttttgttaaagagcgacgagctatgaagagagagtatgaagagtttaaagtgaggataaatgctcttgttgccaaagcacagaaaatccctgaagaaggctggacaatgcaggatggtactccctggcctggtaacaacactagagatcatcctggaatgatacaggtgttcttaggccatagtgggggtctggataccgatggaaatgagctgcctagactcatctatgtttctcgtgaaaagcggcctggatttcaacaccacaaaaaggctggagctatgaatgcattgatccgtgtatctgctgttcttaccaatggagcatatcttttgaacgtggattgtgatcattactttaataacagtaaggctattaaagaagctatgtgtttcatgatggacccggctattggaaagaagtgctgctatgtccagttccctcaacgttttgacggtattgatttgcacgatcgatatgccaacaggaatatagtctttttcgatattaacatgaaggggttggatggtatccagggtccagtatatgtgggtactggttgttgttttaataggcaggctctatatgggtatgatcctgttttgacggaagaagatttagaaccaaatattattgtcaagagctgttgcgggtcaaggaagaaaggtaaaagtagcaagaagtataactacgaaaagaggagaggcatcaacagaagtgactccaatgctccacttttcaatatggaggacatcgatgagggttttgaaggttatgatgatgagaggtctattctaatgtcccagaggagtgtagagaagcgttttggtcagtcgccggtatttattgcggcaaccttcatggaacaaggcggcattccaccaacaaccaatcccgctactcttctgaaggaggctattcatgttataagctgtggttacgaagacaagactgaatggggcaaagagattggttggatctatggttccgtgacggaagatattcttactgggttcaagatgcatgcccggggttggatatcgatctactgcaatcctccacgccctgcgttcaagggatctgcaccaatcaatctttctgatcgtttgaaccaagttcttcgatgggctttgggatctatcgagattcttcttagcagacattgtcctatctggtatggttaccatggaaggttgagacttttggagaggatcgcttatatcaacaccatcgtctatcctattacatccatccctcttattgcgtattgtattcttcccgctttttgtctcatcaccgacagattcatcatacccgagataagcaactacgcgagtatttggttcattctactcttcatctcaattgctgtgactggaatcctggagctgagatggagcggtgtgagcattgaggattggtggaggaacgagcagttctgggtcattggtggcacatccgcccatctttttgctgtcttccaaggtctacttaaggttcttgctggtatcgacaccaacttcaccgttacatctaaagccacagacgaagatggggattttgcagaactctacatcttcaaatggacagctcttctcattccaccaaccaccgtcctacttgtgaacctcataggcattgtggctggtgtctcttatgctgtaaacagtggctaccagtcgtggggtccgcttttcgggaagctcttcttcgccttatgggttattgcccatctctaccctttcttgaaaggtctgttgggaagacaaaaccgaacaccaaccatcgtcattgtctggtctgttcttctcgcctccatcttctcgttgctttgggtcaggatcaatccctttgtggacgccaatcccaatgccaacaacttcaatggcaaaggaggtgtcttttag

  12. Overall Algorithm • Determine the exon source • Assemble the nucleotide base string. • Assemble a list of exons. • Assemble the amino acid string in proportion to the base string using the exon list. • Format and print the strings.

  13. #Translate Exons and match them with Genomic Sequence my $leftover_text = ""; my $leftover_amount = 0; my $leftover_counter = 0; foreach my $single_exon (@all_exons) { my @exon_bounds = split /\.\./, $single_exon; my $exon_start = shift @exon_bounds; my $exon_end = shift @exon_bounds; my $counter = $exon_start; my $pos; if ($leftover_amount > 0) { if ($leftover_amount == 1) { my $split_codon = $leftover_text . substr($seqtext,$counter-1,2); my $split_aa = $codon_table->translate($split_codon); $pos = $leftover_counter-1; $proteintext =~ s/(.{$pos})(.)/$1$split_aa/; $counter = $counter + 2; $leftover_amount = 0; } elsif ($leftover_amount == 2) { my $split_codon = $leftover_text . substr($seqtext,$counter-1,1); my $split_aa = $codon_table->translate($split_codon); $pos = $leftover_counter-1; $proteintext =~ s/(.{$pos})(.)/$1$split_aa/; $counter = $counter + 1; $leftover_amount = 0; }} while ($counter <= $exon_end) { if ($counter + 2 <= $exon_end) { my $amino_acid = $codon_table->translate(substr($seqtext,$counter-1,3)); $pos = $counter-1; $proteintext =~ s/(.{$pos})(.)/$1$amino_acid/; $counter = $counter + 3; $leftover_amount = 0; } else { $leftover_amount = $exon_end-$counter+1; $leftover_text = substr($seqtext,$counter-1,$exon_end-$counter+1); $leftover_counter = $counter; $counter = $counter + 3; } } } } Phase Algorithm (for each exon)

  14. Room For Improvement • More exon sources • Other input formats, gcg, fasta, etc. • Various Formating options • uppercase/lowercase • linelength, numbering schemes • other formatting options like those in GCG’s publish

  15. Sources • http://www.bioperl.org • Bioperl Course • http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/ • Programming Perl, Wall, Christiansen, Schwartz • GCG Documentation • http://forestry.mtu.edu/manuals/gcg/index.htm • AF027172, Arioli,T., Peng,L., Betzner,A.S., Burn,J., Wittke,W., Herth,W., Camilleri,C., Hofte,H., Plazinski,J., Birch,R., Cork,A., Glover,J., Redmond,J. and Williamson,R.E.

More Related