Using Perl

Using Perl Paul Tymann Computer Science Department Rochester Institute of Technology ptt@cs.rit.edu

Regular Expressions • To review your knowledge of regular expressions from the previous session… • There is redundancy in the genetic code • GCU, GCC, GCA, and GCG all encode Alanine • Write a series of regular expressions that could be used to match a codon to the amino acid it encodes • For example • (GC.) would match all the codons that encode Alanine

Fill In The Blanks

Answers

Using Files • In order to use a file you must associate a handle with the file using the open function open OUT, “>Results” open FROM, “<data” • Open returns a boolean value that indicates whether the file was opened or not • The die statement can be used to print an error message and terminate the program open OUT,”>Results” or die “Unable to open file ($!);

Read & Writing • To print text to a file, use the print statement • print OUT “Some output”; • To read text from a file • $line = <IN> • $line = readline(IN) • Note that these techniques only work with text files

Open the file whose name is the first command line argument As long as we have not hit the end of the file Print a message and terminate if it cannot be opened Read the next line Close the files Write the line to the copy copyfile.perl open INFILE,"<$ARGV[0]" or die "Unable to open input file ($!)"; open OUTFILE,">$ARGV[1]" or die "Unable to open output file ($!)"; while ( !eof( INFILE ) ) { $line = readline( INFILE ); print OUTFILE $line; } close INFILE; close OUTFILE;

copyfile2.perl open INFILE,"<$ARGV[0]" or die "Unable to open input file ($!)"; open OUTFILE,">$ARGV[1]" or die "Unable to open output file ($!)"; while ( <INFILE> ) { print OUTFILE; } close INFILE; close OUTFILE;

Files and Lists • It is possible to read a file directly into a list • Each line in the file is an element of the list • This is useful if you need to process the data in a file several times • You only read it once and then work with the copy in memory open INFILE,"<$ARGV[0]" or die "Unable to open input file ($!)"; @filedata = <INFILE>;

Parsing Genbank Records • Recall that perl was designed to munge the output of another program • When you do a BLAST search you get quite a bit of information, but often you only want a small part of it • Perl is well suited for stripping out the part of the record that you are interested in

Genbank Record LOCUS AF165912 5485 bp DNA linear PLN 29-JUL-1999 DEFINITION Arabidopsis thaliana CTP:phosphocholine cytidylyltransferase (CCT) gene, complete cds. ACCESSION AF165912 VERSION AF165912.1 GI:5640000 KEYWORDS . SOURCE Arabidopsis thaliana. ORGANISM Arabidopsis thaliana Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; Rosidae; eurosids II; Brassicales; Brassicaceae; Arabidopsis. REFERENCE 1 (bases 1 to 5485) AUTHORS Choi,Y.H., Choi,S.B. and Cho,S.H. TITLE Structure of a CTP:Phosphocholine Cytidylyltransferase Gene from Arabidopsis thaliana JOURNAL Unpublished REFERENCE 2 (bases 1 to 5485) AUTHORS Choi,Y.H., Choi,S.B. and Cho,S.H. TITLE Direct Submission JOURNAL Submitted (06-JUL-1999) Biology, Inha University, Yonghyon-Dong 253, Inchon 402-751, Korea

Genbank Record FEATURES Location/Qualifiers source 1..5485 /organism="Arabidopsis thaliana" /cultivar="Columbia Col-O" /db_xref="taxon:3702" gene 1..4637 /gene="CCT" promoter 1..1602 /gene="CCT" TATA_signal 1554..1560 /gene="CCT" mRNA join(1603..1891,2322..2438,2538..2633,2801..2843, 2918..3073,3167..3247,3874..3972,4082..4637) /gene="CCT" /product="CTP:phosphocholine cytidylyltransferase" 5'UTR 1603..1712 /gene="CCT"

Genbank Record CDS join(1713..1891,2322..2438,2538..2633,2801..2843, 2918..3073,3167..3247,3874..3972,4082..4309) /gene="CCT" /EC_number="2.7.7.15" /codon_start=1 /product="CTP:phosphocholine cytidylyltransferase" /protein_id="AAD45922.1" /db_xref="GI:5640001" /translation="MSNVIGDRTEDGLSTAAAASGSTAVQSSPPTDRPVRVYADGIYD LFHFGHARSLEQAKLAFPNNTYLLVGCCNDETTHKYKGRTVMTAEERYESLRHCKWVD EVIPDAPWVVNQEFLDKHQIDYVAHDSLPYADSSGAGKDVYEFVKKVGRFKETQRTEG ISTSDIIMRIVKDYNQYVMRNLDRGYSREDLGVSFVKEKRLRVNMRLKKLQERVKEQQ ERVGEKIQTVKMLRNEWVENADRWVAGFLEIFEEGCHKMGTAIVDSIQERLMRQKSAE RLENGQDDDTDDQFYEEYFDHDMGSDDDEDEKFYDEEEVKEEETEKTVMTDAKDNK" 3'UTR 4310..4637 /gene="CCT"

Genbank Record BASE COUNT 1650 a 956 c 1046 g 1833 t ORIGIN 1 ccagaatggt tactatggac atccgccaac catacaagct atggtgaaat gctttatcta 61 tctcattttt agtttcaaag cttttgttat aacacatgca aatccatatc cgtaaccaat 121 atccaatcgc ttgacatagt ctgatgaagt ttttggtagt taagataaag ctcgagactg 181 atatttcata tactggatga tttagggaaa cttgcattct attcatgaac gaatgagtca 241 atacgagaca caaccaagca tgcaaggagc tgtgagttga tgttctatgc tatttaagta lines deleted to save space… 5101 tgttgttaac caactctctt tacatattag gaccgtgctt gtcaggccaa tggttttcac 5161 ttcgaaaaat tgcttccgat atcaaactat gtgtacatta ttggtggact gtggacataa 5221 cttaaacgca taattttatt gtgtaccttt aaaataaaca atagattaca catatatata 5281 tggcaaatat ttgaacatta gatgtcaaga gaaaagtaaa acatgtcatg attacaccat 5341 ctttgttatt atttagagtg attctcacta aatcttaggc ggttagcaac cgccatagtt 5401 ttcaaaatct cattctatcg ggattaaatc tgtttttggt gactatatat aaacattggt 5461 cgaattttta ggtaagtaaa atcag //

Accession • How would you go about finding the accession number? • You can read the file line by line • See if the line starts with ACCESSION • Does it have to start with ACCESSION? • $line =~ /^ACCESSION/; • How would you strip out the word ACCESSION leaving only the number? • $line =~ s/ACCESSION(\s)*//;

accession.perl open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)"; while ( !eof( INFILE ) ) { $line = <INFILE>; if ( $line =~ /^ACCESSION/ ) { $line =~ s/ACCESSION(\s)*//; print $line,"\n"; } } close INFILE;

Your Turn!! • Modify the script accession.perl so that in addition to the accession number it also prints out the locus and the organism • The output from your program should look like this Accession: AF165912 Organism: Arabidopsis thaliana Locus: AF165912 5485 bp DNA linear PLN 29-JUL-1999 • The information obtained from the file must be printed in the order specified

Sequence Data • How would you go about getting the sequence data out of a Genbank record? • The sequence data is delimited by • ORIGIN • // • So read lines looking for one that starts with ORIGIN • After seeing ORIGIN, read and print lines until you see the //

sequence.perl open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)"; $in_sequence = 0; # 0 is false while ( !eof( INFILE ) ) { $line = <INFILE>; if ( $line =~ /^\/\/\n/ ) { $in_sequence = 1; # non-zero value is true } elsif ( $line =~ /^ORIGIN/ ) { $in_sequence = true; } elsif ( $in_sequence ) { print $line,"\n"; } } close INFILE;

Your turn!! • Modify the script sequence.perl so that the sequence data is placed into a scalar variable named $sequence_data. • All of the spaces, newlines, and line numbers should be removed from the sequence data • You can do this as you read or all at once at the end • Write your program so that it prints $sequence_data to verify it works correctly • Testing hint • Modify the Genbank record so that it has many fewer sequence lines so that you can determine if your program is working correctly

Where Are We? • So now we have a program that extracts the DNA sequence information out of a Genbank record • Lets go one step further and convert the sequence data into an equivalent sequence of amino acids • To simplify things assume that a reading frame starts with the first nucleotide in the sequence

Hashes • Associative arrays • Key value pairs • Given the key, the table returns the value • You can build them by hand %codons = (‘TCA’,‘S’,‘TCC’,‘S’,…) Alternatively… %codons = ( ‘TCA’ => ‘S’, ‘TCC’ => ‘S’, ‘TCG’ => ‘S’, … ); • This is a big pain for all of the codons

Building a Hash • The hash can be built under program control • The file codons.txt contains 64 lines in the following format • TCA S Serine • A perl script could read the lines one at a time • Break extract the codon and the amino acid from the line • Add the information to the hash

Handy Functions • Here are some functions that you might find useful • chomp $line • Removes the line termination character(s) from a string • split /\s/, $line • Splits a line into fields (i.e., turns it into a list) • ($codon, $amino, $name) = split /\s/,$line • sort @somelist • Sorts a list • reverse @somelist • Reverses a list

Functions on Hashes • Some operations you can perform on a hash $codons{‘TCA’}=‘S’; • Adds the entry (TCA,S) to the hash delete $codons{‘XXX’}; • Removes the key XXX from the hash keys %codons • Returns a list of the keys associated with the hash sort keys %codons • Sorts the keys in the hash

Steps through a list element by element. The current element is placed in $d Returns a sorted list of the keys in the hash %codons Prints the current element and the amino acid associated with that element What Do You Think This Does? foreach $d ( sort keys %codons ) { print “$d: $codons{$d}\n”; }

Your Turn!! • Write a script named buildhash that builds a hash that contains each of the 64 codons and the amino acid that each one encodes • The information you need to build the hash is contained in the file codons.txt • Print the hash after it has been built

Now This Is Cool!! $dna = ‘CGACGTTTCGTACGGACTAGCT’; $amino_acids = “”; for ($i=0; $i<length($dna)-2; $i=$i+3) { $amino_acids .= $codons{ substr( $dna, $i, 3 ) }; } print $amino_acids,”\n”;

Subroutines • Clearly the code to translate DNA to amino acids is something that we might want to use many times in a program • Perl supports subroutines • Think of a subroutine as a small program that you can call from other programs • Subroutines often require parameters • For the translation program • The string to translate • Where to start translation (ORFs?)

The Subroutine translate sub translate { my($dna,$start)=@_; my $amino_acid; my $i; for ($i=$start; $i<length($dna)-2; $i=$i+3) { $amino_acid .= $codons{ substr( $dna, $i, 3 ) }; } return $amino_acid; }

Packaging • Perl subroutines are typically placed in a separate file • Often referred to as a module or a library • The last line in the module must be a ‘1’ • The module name typically ends with “.pm” • The program that will call the subroutine must state that it will use the module before calling the subroutine

BioSubs.pm sub translate { my($dna,$start)=@_; my $amino_acid; my $i; for ($i=$start; $i<length($dna)-2; $i=$i+3) { $amino_acid .= $codons{ substr( $dna, $i, 3 ) }; } return $amino_acid; } 1

Using BioSubs use BioSubs; open CODONS,"<$ARGV[0]" or die "Unable to open codon file ($!)"; while (<CODONS>) { ($codon,$amino,$name)=split /\s/; $codons{$codon}=$amino; } print translate("GCAGCCGCGGCUAGAAGG”,0),"\n"; close CODONS;

Your Turn!! • Write a perl script that will read sequence data from a Genbank record and will print out the 6 reading frames associated with the sequence data in the record • Take advantage of subroutines • Clearly translate will be helpful • If you have time add a subroutine named getsequence to the BioSubs module • getSequence takes the name of a file that contains a Genbank record as an argument • It returns a scalar that contains the sequence data found in the record

Using Perl

Using Perl

Presentation Transcript

Data-Mining the Web Using Perl

Perl

Perl

PERL

Running External Programs using Perl

Perl

Data files using cgi/perl

Pushing the Envelope With Perl: Using Perl With Microsoft Exchange

Load Testing Using Perl

Perl

PERL

Perl

Perl

CGI programming using Perl

Perl

Ni mble Perl Programming Using Scriptome

Using Perl Modules

Fundamentals of CGI Programming Using Perl

CGI programming using Perl

Load Testing Using Perl