Running External Programs using Perl

Running External Programs using Perl

Running External Programs with Perl • Many clever people have written software for use in bioinformatics. Often they have used sophisticated algorithms, and tested them with a wide variety of data. • Since we want to avoid endlessly re-inventing and debugging programs that do standard procedures, we want to use their programs. • But, we want to use Perl to get the input data into the proper form, parse the output, and (often) run the program many different times with slight variations. • Most common program: BLAST, which we will use as an example.

BLAST+ • Recently, NCBI has “improved” the BLAST interface by heavily revising the syntax needed to invoke the program. The new version is called BLAST+. • Manual: http://www.ncbi.nlm.nih.gov/books/NBK1763/ • It is still possible use the old methods, by using the legacy_blast.pl program. An example: legacy_blast.pl blastall –I query -d nr -o blast.out • However, it probably best to just adjust to the new syntax.

makeblastdb • The subject sequences for a BLAST search must be put into a special format, a BLAST database, before they can be used. The makeblastdb program is used for this purpose • This was formerly done with the program formatdb. • Only three input parameters of note: • -in specifies the FASTA-formatted input file, usually containing multiple sequences. • -out specifies the name of the blast database you wish to create • -dbtype specifies the molecule type: “nucl” for DNA or RNA and “prot” for protein. • Example: makeblastdb -in scaSeq.fasta -out scaSeq -dbtype nucl

BLAST Programs • In the past, “blastall” was the basic command, with the particular type of blast specified with the –p option. • With BLAST+, the basic commands are blastn, blastp, blastx, tblastx, tblastn, psiblast, rpsblast, and rpstblastn. • Within the blastn and blastp basic commands are several options, specified with the –task flag. • For blastp, • -task blastp is the standard blastp program, • –task blastp-short is optimized for queries shorted than 30 amino acids • For blastn, • -task blastn is the standard blastn program, • –task blastn-short is optimized for queries shorted than 50 nucleotides • -task megablast uses a BLAST program with a linear gap penalty instead of an affine gap penalty; it is used for matching very similar sequences and it is much faster than blastn • -task dc-megablast is discontinuous megablast, used for more distant matches.

Necessary Parameters • All blast programs have several necessary parameters • -db specifies the BLAST database to be used • -query specifies the FAST-formatted file to be used as the query sequence • -out specifies a file name for the output. In the absence of a –out file, the output goes to STDOUT.

BLAST Options • It is also possible to control parameters of BLAST itself. There are lots of options, but here are some I have found useful: • -evalue specifies that worst e-value that is reported. The default is 10, which is way too much, so I often use –evalue 1e-3, which is more realistic. • -matrix specifies the scoring matrix. For blastp with is BLOSUM62 • -max_target_seqs specifies the maximum number of subject sequences that will appear in the output. Note that you can have more than one hit on the same subject sequence, and all of them count as just a single target. But, this is often quite useful if you just want the top hit. • -seg no turns off the SEG filter, for simple sequences • -dust no turns off the DUST simple sequence filter for blastn. • You can also fool around with the gap penalties, word sizes, and other parameters if you like. Not worth the trouble, in my opinion.

Output Formatting • Output format can be specified with the –outfmt option. • The default value (which is 0) gives the standard pairwise alignment format used in the old BLAST program • For tabular output equivalent to the old –m8 output, use –outfmt 6 • A list of all possible options can be found by using the command “blastn –help”. For tabular output, it is possible to specify exactly which pieces of data you will see as well as their order. • The default for –outfmt 6 is : -outfmt 6 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore‘ • There are quite a few other data items you can see if you like.

“system” command • The most commonly used Perl command for running external programs is “system”. This command executes the program specified by its arguments, then returns control to the next line in your Perl program. • “system” returns the signal number resulting from the process (program) it executed. If all went well, this is 0. Numbers other than 0 indicate some kind of error. • You should always test the results of interacting with the external world. • Thus, reasonable syntax for using the system command: if ( system(“whatever see below”) != 0) { die “error executing system command\n”; } • Surprisingly often you will find that you get an error when you first start running an external program, due to lack of proper permissions, or files not being where you thought they were, or improper syntax, etc.

More on the system command • The simplest way to use “system” is to simply enclose the command line you need in quotes: system( “blastn -task megablast -db all_genes -query temp_qu_file -out temp_out_file -outfmt 6 -evalue 1e-3 -perc_identity 97” ) • The above line invokes the bash shell to interpret the command line, converting each group of symbols separated by spaces into a separate argument. • You can avoid invoking the shell at all (a somewhat more secure method), by separating out the individual arguments yourself: system( “blastall”, “–p blastn”, “–I my_input_file”, “–d my_database”, “–o my_blast_output.txt” ) • I rarely do this.

Backticks • Another way to run external programs is to enclose the command line in backticks. • the backtick looks like: `. It is on the keyboard just to the left of 1, same key as ~. • The program’s output to STDOUT is the return value of the command enclosed in backticks: $output_string = ` blastn –task blastn –i my_input_file –d my_database`; • In this case, the output string needs to be split at newline characters, since the entire output is contained with in a single string. my @output_lines = split /\n/, $output_string; • Backticks are very useful when the program you want to run doesn’t have the option of writing to a file.

BLAST types • A BLAST search involves taking a single sequence (the query sequence) and comparing it to a database of other sequences (subject sequences), then reporting the best matches. • In general, protein searches are more capable of detecting more distant and imperfect matches than DNA searches, because synonymous and similar amino acids are more conserved evolutionarily than are nucleotide sequences. • A number of different BLAST programs are available. The main ones differ primarily by the type of query and subject sequences used: • blastn: query is DNA, subject is DNA • blastp: query is protein, subject is protein • blastx: query is nucleic acid that is translated by the program into protein sequences (all 6 reading frames); subject database is protein. Useful for imperfect query sequences such as ESTs • tblastn: query is protein; database is DNA translated into protein sequences in all 6 reading frames. Useful for low quality databases. • tblastx: query is DNA translated into protein, subject is nucleotide translated into protein. Both are translated into all 6 frames. It is very slow relative to the other BLAST types. • Several other types of BLAST search exist: see the NCBI site FAQs to learn more about them.

Query Sequence • The query sequence must be written in “FASTA” format. This consists of a single comment line that starts with a >, followed by one or more lines of sequence. • Comments usually start with a unique identifier for the sequence, then some other descriptive information. To save trouble later, the unique identifier should contain letters, numbers, undescores, hyphens and periods only, and contain no spaces. The unique identifier ends at the first space. • The sequence should contain only sequence characters: no spaces or numbers allowed. • Either upper or lower case is acceptable, • Some suggest that lines of 80 or fewer characters should be used. I prefer to put the entire sequence on a single line, which can be quite long. • Nucleic acid sequences can be either DNA or RNA. In general, A, C, G, T and U are used, along with N for “any base” and a single – (hyphen) for gaps of any length. • Some sequences also contain a variety of other letters, such as R for pyrimidine or M for amino-containing. These are not widely used and may well confuse others. • However, the existence of these characters makes it necessary that you check unknown sequences for the presence of letters other than ACGTUN-. • Protein sequence use the standard 20 amino acid single letter codes. In addition, X means any amino acid, * is the translational stop codon at the end of the sequence, and there are a few other oddities. • Note that FASTA format is not rigidly defined, and it is always worth checking sequence files for what exactly is being used as a format, and the information for any new program for what is an acceptable format.

FASTA format examples • >gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans LNQAVRFRPVITFALAFILIITWFAPRADAAAQWQAGTAYKQGDLVTYLNKDYECIQPHTALTGWEPSNVPALWKYVGEGTGGGTPTPDTTPPTVPAGLTSSLVTDTSVNLTWASTDNVGVTGYEVYRNGTLVANTSTTTAVVTGLTAGTTYVFTVKAKDAAGNLSAASTSLSVTTSTGSSNPGPSGSKWLIGYWHNFDNGSTNIKLRNVSTAYDVINVSFAEPISPGSGTLAFTPYNATVEEFKSDIAYLQSQGKKVLISMGGANGRIELTDATKKRQQFEDSLKSIISTYGFNGLDIDLEGSSLSLNAGDTDFRSPTTPKIVNLINGVKALKSHFGANFVLTAAPETAYVQGGYLNYGGPWGAYLPVIHALRNDLTLLHVQHYNTGSMVGLDGRSYAQGTADFHVAMAQMLLQGFNVGGSSGPFFSPLRPDQIAIGVPASQQAAGGGYTAPAELQKALNYLIKGVSYGGSYTLRQLRAMSVSRAL • >U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1) AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGGC

Subject Databases • BLAST uses a special database format to speed up the search operation. • Several pre-packaged databases exist, most notably “nr” which is the non-redundant database consisting of all sequences in GenBank. See the BLAST program selection guide at http://www.ncbi.nlm.nih.gov/blast/producttable.shtml . • We mostly make our own databases, which is done with the “formatdb” program. Useful information can be found in the file /home/share/blast.dir/README.formatdb. • Databases are located either in the same directory you are running BLAST from, or in /home/share/blast.dir/data. • To use the /home/share/blast.dir/data directory without having to specify the entire path every time, you need to create a file called “.ncbirc” in your home directory that contains these two lines: [NCBI] Data=/home/share/blast.dir/data

Tabular BLAST Output • The –m 8 option gives a simple table as output. Often this information is all you need. • The table is tab-separated, with each hit on a separate line. • You can easily parse these data by splitting each line at tabs: my @data = split /\t/; • Starting at 0, the columns are: 0: query name 1: subject name 2: percent identities 3: aligned length 4: number of mismatched positions 5: number of gap positions 6: query sequence start 7: query sequence end 8: subject sequence start 9: subject sequence end 10: e-value 11: bit score

More Tabular BLAST Output • The query and subject names come from the FASTA comment lines: everything up to the first space character. • Aligned length can include gaps in both the query and the subject, so it is not necessarily the same length as either of these sequences. • Percent identities looks at every position in the aligned sequences and counts the number that have the same base or amino acid. This count is then divided by the aligned length and then multiplied by 100. • Mismatches are positions in the aligned sequences where both query and subject have a base or amino acid. Gap positions are not included. • The number of gap positions is the number of positions in the aligned sequence with a base or amino acid in one sequence and a hyphen (gap) in the other. (NOT the number of gaps regardless of length). • The query sequence start and end positions are always written with the start less than the end. However, the subject sequences can have the start position greater than the end, if the alignment is on the opposite DNA strand from the way the subject sequence was originally put into the database. • The e-value is the expected number of random sequences that would have at least this good a match in a random database of the same size as the subject. Thus, an e-value of 1 means that 1 match would be expected to occur by chance alone using this database. • It is worth noting that a probability of 0.05 is often considered significant in biological experiments. This value would almost never be considered a significant hit in BLAST searches. Perhaps a rule-of-thumb worst significant score would be 10-6 (1e-6). • Scores less than 1e-180 are given a value of 0.0. • The bit score is a normalized (corrected for the mismatch and gap penalties) version of the score derived from the Smith-Waterman search. It is related to the e-value through an exponential function that also takes into account the length of both the query and the subject database. That is, the e-value already takes into account query and subject lengths, while the bit score doesn’t. • I prefer to deal with e-values.

BLAST Output • If you don’t specify –m 8, you get a longer, detailed output file. • The output starts with basic info, identifier for query (FASTA comment line up to the first space) and the database you listed on the command line with the –d option. • Then a summary of each database sequence that had a significant alignment(s). Two numbers appear: the bit score (an integer) and the e-value (a floating point number with a minimum of 0 and a maximum of whatever you set –e to be, defaulting to 10.0). • The summary lines add scores for all hits within each subject sequence. This is useful and meaningful if the subject database contains discrete sequences such as individual genes or peptides. However, if the subject database is genomic chromosomal sequences, the summary lines can be quite misleading. • After the summary lines are detailed views of each hit.

More BLAST Output: Detailed View • At the start of the details for each subject sequence, there is a > in the first column, followed by the subject’s FASTA comment line. The next line gives the subject sequence’s length. • Note that there can be multiple hits on each subject, and each will be listed separately following these subject identifier and length lines. • Then, some summary information about this hit. For example: Score = 1594 bits (804), Expect = 0.0 Identities = 804/804 (100%) Strand = Plus / Plus • The Score is the raw bit score followed by the normalized bit score in parentheses. If you use bit scores to compare sequences, you want to use the normalized bit score. • Expect is the e-value. • Identities is the number of matching positions between the aligned sequences, followed by the percentage of identical bases. • Sometimes the Identities line will also have a “Gaps = “ section, which gives the number of gap positions. • For protein alignments, there is also a “Positives = “ section, which gives the number of amino acid matches that are not identical but which count as similar in the scoring matrix (e.g. isolecine and valine). • The order on this line is Identities, Positives, Gaps. However, Positives and/or Gaps may be absent. • Strand is the orientation of the query sequence, then the subject sequence. Query is always “Plus”, but the subject will be “Minus” if the alignment is on the opposite strand from the original sequence put into the subject database. This line appears in nucleic acid searches but not in protein searches (which only match

More BLAST Output: Detailed View • After the summary lines, the aligned sequence appears, in groups of 3 lines. The query sequence is on top, followed by a line indicating matches, followed by the subject line. • For blastn, the match line just consists of vertical lines where the bases are identical and a blank space where they are not identical. • For blastp, the match line gives the amino acid code for indentical amino acids, a “+” for similar (positives) amino acids, and a blank space otherwise. • After all hits have been displayed, the end of the output file has some statistical information about the subject database and parameters of the blast search.

Advanced Parsing of BLAST output

Parsing BLAST Output • Once you have run BLAST, its output is in a file. You need to open this file, then read it line-by-line with a while statement. • Because you are reading line-by-line, most of the information you gather needs to be stored in global variables: variables declared with “my” outside the while loop. • Each new line needs to be examined for the type of information it contains using regular expressions. It is important to be able to identify the start and end points of various sections, and respond appropriately. • Most information will be gathered using regular expression memory parentheses. • Variables need to be re-initialized at the end of each section (or the beginning of the next one if the end is not easily discerned). • We will store the information in a hash whose keys are the subject identifiers. All hits for that subject will be stored in an array. • Assuming that you want the information in the detailed analysis of each hit, note that there are 3 important sections: • 1. each new subject sequence starts with a “>” in the first column, followed by the rest of its FASTA comment line. This can spill over multiple lines in the output. Immediately following this line(s) is the “Length = “ line. • 2. Within each subject sequence there are one or more hits. Each hit starts with a “Score =“ line, followed by an “Identities = “ line, followed (for blastn but not blastp) with a “Strand = “ line. • 3. Within each hit there are groups of 3 lines representing the aligned sequence, including the coordinates of the start and end position of each line. There can be several such groups, for long hits.

1. Parsing New Subject Sequence Information • Each new subject sequence starts with a line having the entire FASTA comment line on it (i.e. a “>” in the first column), followed by zero or more additional comment lines, and ends with a “Length = “ line. • The useful pieces of information here are the comment line itself, the unique identifier for this subject sequence (part of the comment line), and the length. • Because these sections have easily recognized start and end points, you can deal with them as a unit, inputting new lines separately from the main while loop. my ($subject_ident, $comment_line, $subject_length); my %hits; # to store all data from this file; while (<INFILE>) { # code for processing other sections if (/^>/) { # finds the “>” in the first column of the line $subject_length = 0; # reset to 0 $comment_line = $_; while (1) { # endless loop to get all of comment line my $next_line = <INFILE>; if ($next_line =~ /Length = (\d+)/ ) { $subject_length = $1; last; # break out of loop } else { # continuation of comment line chomp $next_line; $next_line =~ s/^\s+/ /; # remove leading $next_line =~ s/\s+$/ /; # and trailing multiple spaces $comment_line .= $next_line; } # end else } # end while(1) loop $comment_line =~ />(\S+)\s/; $subject_ident = $1; $hits{$subject_ident}{subject_length} = $subject_length; $hits{$subject_ident}{comment_line} = $comment_line; } # end subject sequence info section # additional processing of BLAST output file } # end while <INFILE> loop

2. Parsing Hit Information • Each hit starts with the same 3 lines (2 for blastp), Score, Identities, and Strand. If you detect the Score line, you can get the next two lines without using the overall while loop: my ($e_value, $identities, $gaps, $aligned_length, $orientation); while (<INFILE>) { if (/Score = /) { my $score_line = $_; my $identities_line = <INFILE>; my $strand_line = <INFILE>; # now process the information in these 3 lines $score_line =~ /Score.*Expect = ([-.e\d]+)/; # note the regex here! $e_value = $1; $identities_line =~ /Identities = (\d+)\/(\d+)/; # note the escaped slash ($identities, $aligned_length) = ($1, $2); $gaps = 0; # set a default if ($identities_line =~ /Gaps = (\d+)/ ) { # Gaps are not always present $gaps = $1; } $orientation = 'Plus'; if ($strand_line =~ /Plus \/ (Plus|Minus)/) { # only for blastn $orientation = $1; } push @{ $hits{$subject_ident}{hit_data} }, [$e_value, $identities, $gaps, $aligned_length, $orientation]; } # end if /Score/ } # end while <INFILE>

3. Parsing the Aligned Sequences my ($query_start, $query_end, $subject_start, $subject_end); while (<INFILE>) { if ($subject_ident && exists $hits{$subject_ident}{hit_data} && (/^>/ || /Score =/ || /^Lambda/ ) ) { # but not for the first hit of the first subject push @{ $hits{$subject_ident}[-1] }, $query_start, $query_end, $subject_start, $subject_end; last if (/^Lambda/); # end of hit processing $query_start = $query_end = $subject_start = $subject_end = 0; # reset } # process the file } # end while <INFILE> • This section is the hardest to deal with, because the end point is not clearly delimited. • We solve this problem by noting that there are 3 ways the aligned sequence section can end: it can be followed by another hit (a Score = line), by a new subject (a line with > in the first column) or by the statistical information at the end of the file (identified by “Lambda” in the first column). • So, near the beginning of the overall while loop we will insert a section that recognizes any of these patterns, and writes the aligned sequence information to the array of information from the previous hit. • However, note that this should only be done if there is a previous hit, that is, not for the first hit in the first subject sequence. We can detect this because the $subject_ident variable will be undef before the first subject line is processed, and there will be no $hits{$subject_ident}{hit_data}.

3. Parsing Aligned Sequences: 2 • The aligned sequences come in groups of 3 lines, Query, the match line, and Subject. It is easiest to process each group of 3 as a unit. • We will gather the start and end positions for both query and subject. It is of course possible to collect more detailed information about the sequences if desired. • There will be one or more aligned sequence units. Since the start and end points are set to 0 at the beginning of whatever section follows, the start positions are only modified if their variables are at 0. while (<INFILE>) { # previous code if (/^Query/) { my $query_line = $_; my $match_line = <INFILE>; my $subject_line = <INFILE>; $query_line =~ /Query:\s*(\d+)\D+(\d+)\s*$/; $query_start = $1 unless $query_start; $query_end = $2; $subject_line =~ /Sbjct:\s*(\d+)\D+(\d+)\s*$/; # note wierd spelling $subject_start = $1 unless $subject_start; $subject_end = $2; } # end /^Query/ # code for other sections } # end while <INFILE>

Printing Out the Information • After the end of the while <INFILE> loop. • Very simple format. • This whole program could be a subroutine, with a reference to the %hits hash returned. foreach my $subject_ident (sort keys %hits) { print "$subject_ident:\n"; print " $hits{$subject_ident}{comment_line}\n"; print " Length = $hits{$subject_ident}{subject_length}\n"; foreach my $hit ( @{$hits{$subject_ident}{hit_data}} ) { print "@$hit\n"; } print "\n"; }

Running External Programs using Perl