advanced perl for bioinformatics l.
Skip this Video
Loading SlideShow in 5 Seconds..
Advanced Perl for Bioinformatics PowerPoint Presentation
Download Presentation
Advanced Perl for Bioinformatics

Loading in 2 Seconds...

play fullscreen
1 / 17

Advanced Perl for Bioinformatics - PowerPoint PPT Presentation

  • Uploaded on

Advanced Perl for Bioinformatics. Lecture 5. Regular expressions - review. You can put the pattern you want to match between //, bind the pattern to the variable with =~, then use it within a conditional: if ($dna =~ /CAATTG/) {print “Eco RIn”;}

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Advanced Perl for Bioinformatics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
regular expressions review
Regular expressions - review
  • You can put the pattern you want to match between //, bind the pattern to the variable with =~, then use it within a conditional:

if ($dna =~ /CAATTG/) {print “Eco RI\n”;}

  • Square brackets within the match expression allow for alternative characters:

if ($dna =~ /CAG[AT]CAG/)

  • A vertical line means “or”; it allows you to look for either of two completely different patterns:

if ($dna=~/GAAT|ATTC/)

reading and writing files review
Reading and writing files, review
  • Open a file for reading:

open INPUT,”/home/class30/input.txt”;

  • Or writing

open OUTPUT,”>/home/class30/output.txt”;

  • Make sure you can open it!

open INPUT, ”input.txt” or die “Can’t open file\n”;

test time

Test time

Last one…


Perl has another super useful data structure

called a hash, for want of a better name.

A hash is an associative array – i.e. it

is an array of variables that are associated

with each other.

making a hash of it
Making a hash of it
  • You can think of a hash just as if it were a set of questions and answers

my %matthash = (“first_name” => “Matt”,

“surname” => “Hudson”,

“age” => “secret”,

“height” => 187, #cm

“hairstyle” => “D minus”


getting the hash back
Getting the hash back

my %matthash = (“first_name” => “Matt”,

“surname” => “Hudson”,

“age” => “secret”,

“height” => 187, #cm

“hairstyle” => “D minus”


print “my name is “, $matthash{first_name};

print “ “, $matthash{surname}, “\n”;

You can store a lot of information and recover it easily and quickly without knowing in what order you added it, unlike an array.

hashes as an array
Hashes as an array
  • You can get the “keys” of the hash and use them like an array:

foreach my $info (keys %matthash){

print “$info = $matthash{$info}”;


why are hashes useful exercise
Why are hashes useful? Exercise.
  • Many of you might have noticed in the exercise on restriction sites, that there was no way to keep track of which sites were which using arrays
  • Modify your script using a hash like this one:

my %enzymehash = (

“EcoRI” => “CAATTG”,

“BamHI” => “GGATCC”,

“HindIII” => “AAGCTT”);

an answer
(an) answer

foreach my $name (keys %enzymehash){

if ($sequence =~ /$enzymehash{$name}/) {

print “I found a site for $name,$enzymehash{$name}”;



putting data in a hash
Putting data in a hash

my %hash;

while (<FILE>) {

/stuff(important stuff) more stuff (best stuff)/;

$hash{$1} = $2;



while ($line = <FILE>) {

my @tmp = split /\t/, $line;

$hash{$tmp[0]} = $tmp[1];


advanced regex
Advanced regex
  • The fun isn’t over yet.
  • You can match precise numbers of characters
  • Any number of characters
  • Positions in a line
  • Precise formatting (spaces, tabs etc)
  • You can get bits of the string you matched out and store them in variables
  • You can use regexes to substitute or to translate
grabbing bits of the regex
Grabbing bits of the regex
  • The fun isn’t over yet.

my $blastline = “Query= AT1g34399 gene CDS”;

$blastline =~ /Query= (.+) gene/;

my $atgnumber = $1;

print “The accession number is $atgnumber\n”;

You can store the contents of the bit within brackets, within the regex, as the special variable $1. Then use it for other stuff. If you put another pair of brackets in, it will be stored in $2.

using modules
Using modules
  • You can use other peoples modules, including those that come with Perl. These provide extra commands, or change the way your Perl script behaves. E.g.

use strict;

use warnings;

use Bio::Perl;

You will see these stacked up at the beginning of more complicated

Perl scripts. Some modules come with perl (strict, warnings)

#man perlmod

others you need to download and add in yourself.

a last exercise
A last exercise?...
  • So: how might hashes help you solve this?
  • Open up a BLAST output file
  • Spit out the name of the query sequence, the top hit, and how many hits there were.
programming projects
Programming projects
  • Now it’s time to think of your programming projects.
  • Hopefully you have an idea – we’ll discuss how feasible they are in the time available
  • If not, here are some suggestions
suggested program functions
Suggested program functions
  • Translate a cDNA into protein, and then check it against the pfam database for HMM hits.
  • Make a real restriction map of a DNA sequence, with predicted fragment sizes
  • Align proteins of a favorite family, open the alignment and find residues that are totally conserved.
  • Perform BLAST against the latest version of the database files for a particular organism – which will check whether the user has the latest files, and if not will download them
  • Design PCR primers, to make a fragment size chosen by the user, for a sequence input from a fasta file.
  • Check whether primer sites are unique in a sequenced, or partially sequenced, genome, and gives an “electronic PCR” result.
  • Output an XML formatted version of a BLAST or HMMER text file.
  • Analyze codon usage in a protein coding DNA sequence and calculate the Ka/Ks ratio