Perl for bioinformatics
Sponsored Links
This presentation is the property of its rightful owner.
1 / 34

Perl for Bioinformatics PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Perl for Bioinformatics. Lecture 4. Variables - review. A variable name starts with a $ It contains a number or a text string Use my to define a variable Use = to assign a value Use \ to stop the variable being interpolated

Download Presentation

Perl for Bioinformatics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Perl for bioinformatics

Perl for Bioinformatics

Lecture 4

Variables review

Variables - review

  • A variable name starts with a $

  • It contains a number or a text string

  • Usemyto define a variable

  • Use = to assign a value

  • Use \ to stop the variable being interpolated

  • Take care with variable names and with changing the contents of variables

Conditional blocks review

Conditional Blocks, review

  • Anif test can be used to control a command in a conditional block, according to the outcome of a decision made by comparing variables.

  • It’s important to keep track of whether variables are strings or numbers. Numbers are compared with ==, strings with eq.

  • It’s usual to indent the block to make it easier to read the code



  • An array can store multiple pieces of data.

  • They are essential for the most useful functions of Perl. They can store data such as:

    • the lines of a text file (e.g. primer sequences)

    • a list of numbers (e.g. BLAST e values)

  • Arrays are designated with the symbol @

    my @bases = (“A”, “C”, “G”, “T”);

Converting a variable to an array

Converting a variable to an array

split splits a variable into parts and puts them in an array.

my $dnastring = "ACGTGCTA";

my @dnaarray = split //, $dnastring;

@dnaarray is now (A, C, G, T, G, C, T, A)

@dnaarray = split /T/, $dnastring;

@dnaarray is now (ACG, GC, A)

Converting an array to a variable

Converting an array to a variable

  • joincombines the elements of an array into a single scalar variable (a string)

    $dnastring = join('', @dnaarray);


(empty here)

which array



  • A loop repeats a bunch of functions until it is done. The functions are placed in a BLOCK – some code delimited with curly brackets {}

  • Loops are really useful with arrays.

  • The “foreach” loop is probably the most useful of all:

    foreach my $base (@dnaarray) {

    print "$base “;


Comparing strings

Comparing strings

  • String comparison (is the text the same?)

    • eq (equal )

    • ne (not equal )

      There are others but beware of them!

Getting part of a string

Getting part of a string

  • substrtakes characters out of a string

    $letter = substr($dnastring, $position, 1)

where in the string

how many letters to take

which string

Combining strings

Combining strings

  • Strings can be concatenated (joined).

  • Use the dot . operator

    $seq1= “ACTG”;

    $seq2= “GGCTA”;

    $seq3= $seq1 . $seq2;

    print $seq3;ACTGGGCTA

Making decisions review

Making Decisions - review

  • The if operator is generally used together with numerical or string comparison operators, inside an (expression).

    numerical: ==, !=, >, <, ≥, ≤

    strings:eq, ne

  • You can make decisions on each member of an array using a loop which puts each part of the array through the test, one at a time

More healthy exercise

More healthy exercise

  • Write a program that asks the user for a DNA restriction site, and then tells them whether that particular sequence matches the site for the restriction enzyme EcoRI, or Bam HI, or Hind III.

  • Site for EcoR1: GAATTC

  • Bam H1: GGATCC

  • Hind III: AAGCTT

Arrays and loops review

Arrays and loops - review

  • An array starts with @. It contains multiple bits of data in a list-like format.

  • @bases = (“A”, “C”, “G”, “T”);

  • You can make decisions on each member of an array using a foreachloop which puts each part of the array through the test, one at a time

Test time again

Test time, again

  • Remember –

    keep track of what’s in a variable

    don’t over-write a variable with another value, unless you intend to

    syntax and case are critical

    lines end with a semicolon

    brackets and quotes must match.

Opening and closing files

Opening and closing files

  • So we can input large amounts of data, Perl has to read data out of files, and write results into output files

  • This is done in two steps

  • First, you must give the file a name within the script - this is known as a filehandle

  • Use the open command:

    open MYFILE, ‘exampleprotein.txt’;

Reading a file

Reading a file

  • Once the file is open, you can read from it, line by line, using the readline <> operator again

    • (put the filehandle between the angle brackets)

  • Perl reads files one line at a time, each time you input data from the file, the next line is read:

    open FILE1,’exampleprotein.txt’;

    $line1 = <FILE1>;

    chomp $line1;

    $line2 = <FILE1>;

Using loops to read in a file

Using loops to read in a file

  • The while loop just keeps doing an expression while it’s true. So it will keep reading lines from the file until it runs out.

  • The special variable $_ keeps track of the line of the file we’re on.

    my $longsequence;

    open FILE, ‘exampleprotein.txt’;

    while (<FILE>){

    $longsequence = $longsequence . $_;

    chomp $longsequence;


    close FILE;

  • This reads the whole file, and puts each line into the variable $longsequence one at a time.

Now more fun excercises

Now More Fun Excercises

  • Read a DNA sequence from a fasta format file

  • Calculate the GC content.

  • What about the non-DNA characters in the file?

    >header lines with the name of the sequence

    carriage returns !! You know this one.

    blank spaces

    N’s or X’s or unexpected letters

Writing to a file

Writing to a File

  • Writing to a file is similar to reading from it

  • Use the > operator to open a file for writing:

    open OUTPUT,‘>/home/class30/output.txt’;

  • This creates a new file with that name, or overwrites an existing file

  • Use >> to append text to an existing file

  • print to the file using the filehandle:

    print OUTPUT $myoutputdata;

Perl for bioinformatics


Some more stuff you need to know

  • Instead of just letting the script go on if it fails an if test, you can get it to execute a second block of code if the statement in brackets isn’t true.



  • You can string a lot of “if”s together using elsif

    if ($site eq “GAATTC” {

    print “EcoR1 site\n”;


    elsif ($site eq “CCATGG” {

    print “BamHI site\n”;


    elsif ($site eq “AAGCTT”) {

    print “HindIII site\n”;


    else { #only happens if none of the preceeding are true

    die “I can’t find any of the sites I know\n”;




  • Bioinformatics data often can be made into array format:

    • multi-line sequence files

    • Microarray or statistics data in “tab delimited”


  • You can address part of the array as if it was a variable using a subscript

    @numbers = (8, 8, 8, 23984092, 8);

    print “$numbers[3]\n”;

    Please note – the first element is number zero! Second is 1!

Regular expressions

Regular Expressions

  • Sounds odd, doesn’t it? It means a pattern that the computer can match, in a standard format.

  • Very useful in bioinformatics work

  • DNA patterns

    • restriction sites

    • promoters/transcription factor binding sites

    • intron splice site

  • Protein patterns

    • conserved domains (motifs)

    • active sites

    • structural motifs (membrane spanning, signal peptide, etc.)

  • The binding and match operators

    The Binding and Match Operators:=~ / /

    • The =~ operator bindsfunctions together

    • The // operator matches things to patterns

      It can be translated as “contains”

    • The forward slashes contain the pattern to be matched, like this:

      if ($dnaseq=~ /GAATTC/) {print “EcoRI site found\n”}

    A regular expression

    A regular expression..

    is a joy forever. And a pattern to match:

    can be just a text string, such as: /GATC/

    it can have alternative characters: /G[AT]TC/

    or contain a wildcard that matches any character: /G.TC/

    Or be something bizzare:/\/[^\/]*\/\.\./

    Perl regular expressions

    Perl Regular Expressions

    • It never ceases to amaze me what people can do with regular expressions, but you can match pretty much anything you can think of and a lot you can’t:

      #man perlrequick

    Alternative characters

    Alternative Characters

    • Square brackets within the match expression allow for alternative characters:

      if ($dna =~ /CAG[AT]CAG/)

      • This will match an DNA string that starts with CAG; has A or T in the 4th position, followed by another CAG.

  • A vertical line within the /expression/ means “or”; it allows you to look for either of two completely different patterns:

    if ($dna=~/GAAT|ATTC/)

  • Special characters

    Special characters

    • Perl has a large set of special characters to use in regular expressions:

      • the dot (.) matches any character

      • \d matches any digit (a number from 0-9)

      • \w matches any “word” character (a letter or a number, not punctuation or space)

      • \s matches white space (any amount)

      • \t matches a tab (useful for tab delimited files)

      • ^ matches the beginning of a line

      • $ matches the end of a line

      • Knowing this makes you lots of fun at parties.

    Special characters1

    “Special” characters

    • What if you need to match text that contains a special character?

      • Aren’t there dots at the end of sentences?

  • Now you have to use a backslash (\) to “escape” the special meaning of that character:

    if $onewordsentence =~ /\w+\ ./

    -This would match any text that has one or more text characters, followed by a dot.


  • Bringing it together

    Bringing it together

    • So now, when you think about it, you can:

      Open a file

      Check whether each line of the file contains a particular pattern

      Recover part of that line

      Write it out to another file

      So.. isn’t that what you wanted to know?

      But really, it’s very useful combined with the UNIX command line.

    A last exercise

    A last exercise?...

    • Now we’re getting up to speed with Perl, lets try something more fun:

    • Open up a BLAST output file

    • Spit out the name of the query sequence, the top hit, and how many hits there were.

    Only the beginning

    Only the beginning

    • Sadly, there is much, much more than this to the Power of Perl.

    • You can make, create and download other people’s websites

    • Make Linux and Windows graphical programs

    • Do almost anything on the internet

    • Interact with databases

    • And much much more

    Why won t i teach you more stuff

    Why won’t I teach you more stuff?

    • Whoa!

    • Programming takes time to learn properly

    • You’ve got the tools now to get started on a programming project

    • We will go through some more Perl functions in the later classes, especially modules such as Bioperl.

    Practice makes perfect

    Practice makes perfect

    • You can now practice your Perl skills and understand a lot of the books and help files, which are probably more useful.

      #man perlintro

      #man perlrequick

      #perldoc bioperl

    • Also, check out Lincoln Stein’s course at:

  • Login