perl for bioinformatics n.
Skip this Video
Download Presentation
Perl for Bioinformatics

Loading in 2 Seconds...

play fullscreen
1 / 34

Perl for Bioinformatics - PowerPoint PPT Presentation

  • Uploaded on

Perl for Bioinformatics. Lecture 4. Variables - review. A variable name starts with a $ It contains a number or a text string Use my to define a variable Use = to assign a value Use \ to stop the variable being interpolated

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Perl for Bioinformatics' - kylar

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
variables review
Variables - review
  • A variable name starts with a $
  • It contains a number or a text string
  • Usemyto define a variable
  • Use = to assign a value
  • Use \ to stop the variable being interpolated
  • Take care with variable names and with changing the contents of variables
conditional blocks review
Conditional Blocks, review
  • Anif test can be used to control a command in a conditional block, according to the outcome of a decision made by comparing variables.
  • It’s important to keep track of whether variables are strings or numbers. Numbers are compared with ==, strings with eq.
  • It’s usual to indent the block to make it easier to read the code
  • An array can store multiple pieces of data.
  • They are essential for the most useful functions of Perl. They can store data such as:
    • the lines of a text file (e.g. primer sequences)
    • a list of numbers (e.g. BLAST e values)
  • Arrays are designated with the symbol @

my @bases = (“A”, “C”, “G”, “T”);

converting a variable to an array
Converting a variable to an array

split splits a variable into parts and puts them in an array.

my $dnastring = "ACGTGCTA";

my @dnaarray = split //, $dnastring;

@dnaarray is now (A, C, G, T, G, C, T, A)

@dnaarray = split /T/, $dnastring;

@dnaarray is now (ACG, GC, A)

converting an array to a variable
Converting an array to a variable
  • joincombines the elements of an array into a single scalar variable (a string)

$dnastring = join('', @dnaarray);


(empty here)

which array

  • A loop repeats a bunch of functions until it is done. The functions are placed in a BLOCK – some code delimited with curly brackets {}
  • Loops are really useful with arrays.
  • The “foreach” loop is probably the most useful of all:

foreach my $base (@dnaarray) {

print "$base “;


comparing strings
Comparing strings
  • String comparison (is the text the same?)
      • eq (equal )
      • ne (not equal )

There are others but beware of them!

getting part of a string
Getting part of a string
  • substrtakes characters out of a string

$letter = substr($dnastring, $position, 1)

where in the string

how many letters to take

which string

combining strings
Combining strings
  • Strings can be concatenated (joined).
  • Use the dot . operator

$seq1= “ACTG”;

$seq2= “GGCTA”;

$seq3= $seq1 . $seq2;

print $seq3;ACTGGGCTA

making decisions review
Making Decisions - review
  • The if operator is generally used together with numerical or string comparison operators, inside an (expression).

numerical: ==, !=, >, <, ≥, ≤

strings: eq, ne

  • You can make decisions on each member of an array using a loop which puts each part of the array through the test, one at a time
more healthy exercise
More healthy exercise
  • Write a program that asks the user for a DNA restriction site, and then tells them whether that particular sequence matches the site for the restriction enzyme EcoRI, or Bam HI, or Hind III.
  • Site for EcoR1: GAATTC
  • Bam H1: GGATCC
  • Hind III: AAGCTT
arrays and loops review
Arrays and loops - review
  • An array starts with @. It contains multiple bits of data in a list-like format.
  • @bases = (“A”, “C”, “G”, “T”);
  • You can make decisions on each member of an array using a foreachloop which puts each part of the array through the test, one at a time
test time again
Test time, again
  • Remember –

keep track of what’s in a variable

don’t over-write a variable with another value, unless you intend to

syntax and case are critical

lines end with a semicolon

brackets and quotes must match.

opening and closing files
Opening and closing files
  • So we can input large amounts of data, Perl has to read data out of files, and write results into output files
  • This is done in two steps
  • First, you must give the file a name within the script - this is known as a filehandle
  • Use the open command:

open MYFILE, ‘exampleprotein.txt’;

reading a file
Reading a file
  • Once the file is open, you can read from it, line by line, using the readline <> operator again
    • (put the filehandle between the angle brackets)
  • Perl reads files one line at a time, each time you input data from the file, the next line is read:

open FILE1,’exampleprotein.txt’;

$line1 = <FILE1>;

chomp $line1;

$line2 = <FILE1>;

using loops to read in a file
Using loops to read in a file
  • The while loop just keeps doing an expression while it’s true. So it will keep reading lines from the file until it runs out.
  • The special variable $_ keeps track of the line of the file we’re on.

my $longsequence;

open FILE, ‘exampleprotein.txt’;

while (<FILE>){

$longsequence = $longsequence . $_;

chomp $longsequence;


close FILE;

  • This reads the whole file, and puts each line into the variable $longsequence one at a time.
now more fun excercises
Now More Fun Excercises
  • Read a DNA sequence from a fasta format file
  • Calculate the GC content.
  • What about the non-DNA characters in the file?

>header lines with the name of the sequence

carriage returns !! You know this one.

blank spaces

N’s or X’s or unexpected letters

writing to a file
Writing to a File
  • Writing to a file is similar to reading from it
  • Use the > operator to open a file for writing:

open OUTPUT,‘>/home/class30/output.txt’;

  • This creates a new file with that name, or overwrites an existing file
  • Use >> to append text to an existing file
  • print to the file using the filehandle:

print OUTPUT $myoutputdata;


Some more stuff you need to know

  • Instead of just letting the script go on if it fails an if test, you can get it to execute a second block of code if the statement in brackets isn’t true.
  • You can string a lot of “if”s together using elsif

if ($site eq “GAATTC” {

print “EcoR1 site\n”;


elsif ($site eq “CCATGG” {

print “BamHI site\n”;


elsif ($site eq “AAGCTT”) {

print “HindIII site\n”;


else { #only happens if none of the preceeding are true

die “I can’t find any of the sites I know\n”;


  • Bioinformatics data often can be made into array format:
    • multi-line sequence files
    • Microarray or statistics data in “tab delimited”


  • You can address part of the array as if it was a variable using a subscript

@numbers = (8, 8, 8, 23984092, 8);

print “$numbers[3]\n”;

Please note – the first element is number zero! Second is 1!

regular expressions
Regular Expressions
  • Sounds odd, doesn’t it? It means a pattern that the computer can match, in a standard format.
  • Very useful in bioinformatics work
  • DNA patterns
      • restriction sites
      • promoters/transcription factor binding sites
      • intron splice site
  • Protein patterns
      • conserved domains (motifs)
      • active sites
      • structural motifs (membrane spanning, signal peptide, etc.)
the binding and match operators
The Binding and Match Operators:=~ / /
  • The =~ operator bindsfunctions together
  • The // operator matches things to patterns

It can be translated as “contains”

  • The forward slashes contain the pattern to be matched, like this:

if ($dnaseq=~ /GAATTC/) {print “EcoRI site found\n”}

a regular expression
A regular expression..

is a joy forever. And a pattern to match:

can be just a text string, such as: /GATC/

it can have alternative characters: /G[AT]TC/

or contain a wildcard that matches any character: /G.TC/

Or be something bizzare:/\/[^\/]*\/\.\./

perl regular expressions
Perl Regular Expressions
  • It never ceases to amaze me what people can do with regular expressions, but you can match pretty much anything you can think of and a lot you can’t:

#man perlrequick

alternative characters
Alternative Characters
  • Square brackets within the match expression allow for alternative characters:

if ($dna =~ /CAG[AT]CAG/)

      • This will match an DNA string that starts with CAG; has A or T in the 4th position, followed by another CAG.
  • A vertical line within the /expression/ means “or”; it allows you to look for either of two completely different patterns:

if ($dna=~/GAAT|ATTC/)

special characters
Special characters
  • Perl has a large set of special characters to use in regular expressions:
    • the dot (.) matches any character
    • \d matches any digit (a number from 0-9)
    • \w matches any “word” character (a letter or a number, not punctuation or space)
    • \s matches white space (any amount)
    • \t matches a tab (useful for tab delimited files)
    • ^ matches the beginning of a line
    • $ matches the end of a line
    • Knowing this makes you lots of fun at parties.
special characters1
“Special” characters
  • What if you need to match text that contains a special character?
      • Aren’t there dots at the end of sentences?
  • Now you have to use a backslash (\) to “escape” the special meaning of that character:

if $onewordsentence =~ /\w+\ ./

-This would match any text that has one or more text characters, followed by a dot.


bringing it together
Bringing it together
  • So now, when you think about it, you can:

Open a file

Check whether each line of the file contains a particular pattern

Recover part of that line

Write it out to another file

So.. isn’t that what you wanted to know?

But really, it’s very useful combined with the UNIX command line.

a last exercise
A last exercise?...
  • Now we’re getting up to speed with Perl, lets try something more fun:
  • Open up a BLAST output file
  • Spit out the name of the query sequence, the top hit, and how many hits there were.
only the beginning
Only the beginning
  • Sadly, there is much, much more than this to the Power of Perl.
  • You can make, create and download other people’s websites
  • Make Linux and Windows graphical programs
  • Do almost anything on the internet
  • Interact with databases
  • And much much more
why won t i teach you more stuff
Why won’t I teach you more stuff?
  • Whoa!
  • Programming takes time to learn properly
  • You’ve got the tools now to get started on a programming project
  • We will go through some more Perl functions in the later classes, especially modules such as Bioperl.
practice makes perfect
Practice makes perfect
  • You can now practice your Perl skills and understand a lot of the books and help files, which are probably more useful.

#man perlintro

#man perlrequick

#perldoc bioperl

  • Also, check out Lincoln Stein’s course at: