Advanced Perl for Bioinformatics

Advanced Perl for Bioinformatics Lecture 5

Regular expressions - review • You can put the pattern you want to match between //, bind the pattern to the variable with =~, then use it within a conditional: if ($dna =~ /CAATTG/) {print “Eco RI\n”;} • Square brackets within the match expression allow for alternative characters: if ($dna =~ /CAG[AT]CAG/) • A vertical line means “or”; it allows you to look for either of two completely different patterns: if ($dna=~/GAAT|ATTC/)

Reading and writing files, review • Open a file for reading: open INPUT,”/home/class30/input.txt”; • Or writing open OUTPUT,”>/home/class30/output.txt”; • Make sure you can open it! open INPUT, ”input.txt” or die “Can’t open file\n”;

Test time Last one…

Hashes Perl has another super useful data structure called a hash, for want of a better name. A hash is an associative array – i.e. it is an array of variables that are associated with each other.

Making a hash of it • You can think of a hash just as if it were a set of questions and answers my %matthash = (“first_name” => “Matt”, “surname” => “Hudson”, “age” => “secret”, “height” => 187, #cm “hairstyle” => “D minus” );

Getting the hash back my %matthash = (“first_name” => “Matt”, “surname” => “Hudson”, “age” => “secret”, “height” => 187, #cm “hairstyle” => “D minus” ) print “my name is “, $matthash{first_name}; print “ “, $matthash{surname}, “\n”; You can store a lot of information and recover it easily and quickly without knowing in what order you added it, unlike an array.

Hashes as an array • You can get the “keys” of the hash and use them like an array: foreach my $info (keys %matthash){ print “$info = $matthash{$info}”; }

Why are hashes useful? Exercise. • Many of you might have noticed in the exercise on restriction sites, that there was no way to keep track of which sites were which using arrays • Modify your script using a hash like this one: my %enzymehash = ( “EcoRI” => “CAATTG”, “BamHI” => “GGATCC”, “HindIII” => “AAGCTT”);

(an) answer foreach my $name (keys %enzymehash){ if ($sequence =~ /$enzymehash{$name}/) { print “I found a site for $name,$enzymehash{$name}”; } }

Putting data in a hash my %hash; while (<FILE>) { /stuff(important stuff) more stuff (best stuff)/; $hash{$1} = $2; } Or…. while ($line = <FILE>) { my @tmp = split /\t/, $line; $hash{$tmp[0]} = $tmp[1]; }

Advanced regex • The fun isn’t over yet. • You can match precise numbers of characters • Any number of characters • Positions in a line • Precise formatting (spaces, tabs etc) • You can get bits of the string you matched out and store them in variables • You can use regexes to substitute or to translate

Grabbing bits of the regex • The fun isn’t over yet. my $blastline = “Query= AT1g34399 gene CDS”; $blastline =~ /Query= (.+) gene/; my $atgnumber = $1; print “The accession number is $atgnumber\n”; You can store the contents of the bit within brackets, within the regex, as the special variable $1. Then use it for other stuff. If you put another pair of brackets in, it will be stored in $2.

Using modules • You can use other peoples modules, including those that come with Perl. These provide extra commands, or change the way your Perl script behaves. E.g. use strict; use warnings; use Bio::Perl; You will see these stacked up at the beginning of more complicated Perl scripts. Some modules come with perl (strict, warnings) #man perlmod others you need to download and add in yourself.

A last exercise?... • So: how might hashes help you solve this? • Open up a BLAST output file • Spit out the name of the query sequence, the top hit, and how many hits there were.

Programming projects • Now it’s time to think of your programming projects. • Hopefully you have an idea – we’ll discuss how feasible they are in the time available • If not, here are some suggestions

Suggested program functions • Translate a cDNA into protein, and then check it against the pfam database for HMM hits. • Make a real restriction map of a DNA sequence, with predicted fragment sizes • Align proteins of a favorite family, open the alignment and find residues that are totally conserved. • Perform BLAST against the latest version of the database files for a particular organism – which will check whether the user has the latest files, and if not will download them • Design PCR primers, to make a fragment size chosen by the user, for a sequence input from a fasta file. • Check whether primer sites are unique in a sequenced, or partially sequenced, genome, and gives an “electronic PCR” result. • Output an XML formatted version of a BLAST or HMMER text file. • Analyze codon usage in a protein coding DNA sequence and calculate the Ka/Ks ratio

Advanced Perl for Bioinformatics

Advanced Perl for Bioinformatics

Presentation Transcript

Managing complexity (Advanced Perl)

Programming and Perl for Bioinformatics Part IV

Perl Programming: Developing Key Tools for Bioinformatics

Programming and Perl for Bioinformatics Part II

Advanced Bioinformatics

Advanced Bioinformatics

Perl - Advanced

CS 6293 Advanced Topics: Bioinformatics

Perl for Bioinformatics

Advanced Perl For Bioinformatics

Survey of Advanced Perl Topics

Programming and Perl for Bioinformatics Part I

Programming and Perl for Bioinformatics Part I

Advanced Bioinformatics (MB480/580)

Programming and Perl for Bioinformatics Part III

Introduction to Perl for Bioinformatics

Advanced Perl WUC 2006

Advanced Bioinformatics

Perl for Bioinformatics Part 2

Advanced Bioinformatics

Advanced Data Structure: Bioinformatics

Advanced Bioinformatics, chsl, october 2005