Welcome to lecture 4: An introduction to modular PERL. IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA. Last time…. We covered a bit of material… Try to keep up with the reading – it’s all in there!
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Welcome to lecture 4:An introduction to modular PERL
IGERT – Sponsored Bioinformatics Workshop Series
Michael Janis and Max Kopelevich, Ph.D.
Dept. of Chemistry & Biochemistry, UCLA
Gene Finding(A very simplified example)
Conversely, what is the likelihood that a given region of sequence is a coding region?
Note that we used the term ‘likelihood’:
We don’t have to choose. We can use both (heuristics)?
We’ll be dealing with sometimes complex and contradicting information
Genes are nonlinear linear:
possible stop codons
ATG possible start codons
Inframe start to stop putative ORF
Note that we used the term ‘likelihood’:We need to introduce, briefly, someProbability and statistics
Just a little Evil…
Permutations
Permutations
IQR
Where n1, n2, … nr are
the number of objects
that are alike
6! / (3!2!) = 60
Combinations
IQR
“n choose k”
Set Theory
Sample Space of an experiment is the set of all possible values/outcomes of the experiment
S = {a, b, c, d, e, f, g, h, i, j}S = {Heads, Tails}
S = {1, 2, 3, 4, 5, 6}
IQR
E = {a, b, c, d}F = {b, c, d, e, f, g}
E SF S
Ec = {e, f, g, h, i, j}Fc = {a, h, i, j}
E F = {a, b, c, d, e, f, g}
E F = EF = {b, c, d}
S
E
F
G
Venn Diagrams
IQR
Simple Probability
P(S) = 1 0 < P(E) < 1P(Ec) = 1 – P(E)
P(E F) = P(E) + P(F) – P(EF)
Independence
Two events, E & F are independent if neither of their outcomes depends on the outcomes of others.
So if E & F are independent, then:
P(EF) = P(E)*P(F)
If E, F & G are independent, then:
P(EFG) = P(E)*P(F)*P(G)
IQR
EF
S
E
F
ConditionalProbability
Given E, the probability of F is:
IQR
ASSUME
Here’s a simple question. I come from a family of two children (prior information states: I am a male). What’s the probability that my sibling is a sister?
Assumptions
Here’s a simple question. I come from a family of two children. What’s the probability that my sibling is a sister?
Assumptions…
Here’s a simple question. I come from a family of two children. What’s the probability that my sibling is a sister?
IQR
EF
S
E
F
ConditionalProbability
Given E, the probability of F is:
IQR
Random Variables
IQR
Discrete vs. Continuous
IQR
Probability Density Function
Many problems don’t have simple probabilities. For those the probabilities are expressed as a function…
aka “pdf”
Plug a into some function
i.e. 2a2 – a3
Some Useful pdf’s
Simple cases (like fair/loaded coin/dice, etc…)
Uniform random variable (“equally likely”)
For a = Heads
For a = Tails
pdf of a Binomial
Very useful function!
Where p = P(success) & q = P(failure)
P(success) + P(failure) = 1
n choose k is the total number of possible ways to get k successes in n attempts
IQR
Hypergeometric distribution
IQR
Hypergeometric distribution
IQR
Hypergeometric distribution
Using the p.d.f.
IQR
Notice how these are Binomials…
IQR
Expectation of a Discrete Random Variable
Weighted average of a random variable
…Of a function
IQR
Measures of central tendency
IQR
Variance
Variation, or spread of the values of a random variable
Where μ = E[X]
IQR
Variance and standard deviation: measures of variation in statistics:
IQR
Statistics of populations
IQR
Note that and are both elements of the normal probability curve.
Source: http://www.bsos.umd.edu/socy/smartin/601/
Measuring probabilities under the normal curve
IQR
Did rounding occur?
Ordered Array (radix sort) yields stem and leaf plots
Difficult to integrate… But probabilities have been
Mapped out to this curve. Transformations from other
Curves possible…
Box plots (box and whiskers plots, Tukey, 1977)
min((Q3+1.5(IQR)),largest X)
Outliers
Fence / whiskers
Q3
Median
IQR
Q1
Fence / whiskers
max((Q1+1.5(IQR)),smallest X)
IQR
Statistics of populations
IQR
Note that and are both elements of the normal probability curve.
Source: http://www.bsos.umd.edu/socy/smartin/601/
Measuring probabilities under the normal curve
IQR
IQR
Let’s start with a simple model that utilizes codon bias
What we need:
possible stop codons
ATG possible start codons
Inframe start to stop putative ORF
Codon bias assumptions
IQR
EF
S
E
F
ConditionalProbability
EF
S
E
F
ConditionalProbability
EF
S
E
F
Our model
EF
S
E
F
Our model
>one CTAAACAAAGTGCTGCCACCCCGAATTGCCAATATAAT…
#!/usr/bin/perl –w
Use strict;
open(IN, “chr.fsa”);
while (<IN>)
{
chomp;
# load the fasta file into a hash
# the header will be the key
# the sequence will be the value
}
close(IN);
(fasta file looks like this)
foreach my $file(@chrFiles) {
chomp($file); # get rid of metacharacters, newlines from filenames %fastaSeqs=();
$header='';
open (IN,"$file"); # create a file handle for the file being processed $file=~s/\.fsa//;
while (<IN>) {
chomp;
# INSERT YOUR CODE TO READ IN A FASTA FILE HERE # # (Hint: use the hash function you learned about)
}
close IN; # close that file, filehandle. we'll need to use it for the next file
### you'll need to update the hash...
}
First, let’s write a fasta file reader
On the first pass, write it for one small file
We’ll also need some functions…
First, let’s write a fasta file reader
On the first pass, write it for one small file
I’ve written an outline for you to follow
my %fastaSeqs; # declaration of the minor hash to (re)used to hold the fasta data files
my $header; # declaration of scalar to hold the fasta header for each instance
my %chrList; # declaration of master hash, a hash of hashes, for every file
my @chrFiles=`ls 1 *.fsa`; # a way to use wildcards to load all fasta files in the CWD
foreach my $file(@chrFiles) { # process each file in turn; here we populate each minor hash,
# then pass that hash to the master hash
chomp($file); # get rid of metacharacters, newlines from filenames
%fastaSeqs=(); # clear out the minor hash from the last instance
$header=''; # clear out the header scalar
open (IN,"$file"); # create a file handle for the file being processed
$file=~s/\.fsa//; # we've used the filename to create the filehandle;
# we don't need the filename any longer, so we'll
# remove the .fsa extension and use the filename as
# the major hash key
while (<IN>) { # use a while loop to go through the file one line at a time
chomp; # remove newlines/metacharacters
if ($_=~/^\s*$/) { # let's ignore blank lines in the file that may exist
next;
}
elsif ($_=~/^>/) { # here we grad the header; note that the key  value pair
# is empty at this point in the minor hash
$header=$_; # just take the whole header
$header=~s/>//; # strip out the leading > from the header
}
else { # here's where we grab the sequence;
# (if it's not a header, it's sequence in our fasta)
$fastaSeqs{$header} .= $_; # we simply concatenate all the sequence lines together
# into a cohesive sequence.
}
}
close IN; # close that file, filehandle! we'll need to use it for
# next file
$chrList{$file}={%fastaSeqs}; # finally, for each minor hash created, we append it to
# major hash (called chrList).
}
Hopefully your fasta file reader makes sense… Now let’s build some functionality for the sequences we’ve loaded…First steps in modular programming
An analogy for programmers  procedural C++
C is the basis of most OS
C is a compiled language
C is portable
C extends functionality of many programs well (especially if they are slow)
C++ is the OOC
ANSI C/C++ is handled by gcc/g++ compiler
Of course emacs has edit/compile/debug functions for both!
C/C++ are modular, based on libraries – this is the hardest part of learning C, remembering all the libraries!
We’ll look at procedural C++, and leave the OOP for later…
#include<iostream>
using namespace std;
int main() {
int numberGenes;
cout << “Enter number of genes\”;
cin >> numberGenes;
Return 0;
}
#include<iostream>
using namespace std;
int main() {
int numberGenes;
cout << “Enter number of genes\”;
cin >> numberGenes;
Return 0;
}
No matrices; arrays of arrays are used instead
Double a[10]=(1.2,3.3,4.4);
Char b[3]={‘a’,’b’,’c’,};
For (int b=0; b<10; b++) {
cout << a[b] << endl;
}
Write a function, which accept an integer array and return the sum of the array.
int sumOfArray(int a[], int size)
{
int sum = 0;
for(int j = 0; j < size; j++)
sum = sum + a[j];
return (sum);
}
Perl has functions as well… subroutines and Modules.
This is the basis of modern bioinformatics programming – modularity
The environment gives C or PERL the language references it should use – kind of like locality or accent for a language
We have been using a form of subroutines all along. Perl functions are basically built in subroutines. You call them (or "invoke") a function by typing its name, and giving it one or more arguments.
Perl gives you the opportunity to define your own functions, called "subroutines". In the simplest sense, subroutines are named blocks of code that can be reused as many times as you wish.
sub hypotenuse {
my ($a,$b) = @_;
return sqrt($a**2 + $b**2);
}
sub E {
return 2.71828182845905;
}
$y = 3;
$x = hypotenuse($y,4);
# $x now contains 5
$x = hypotenuse((3*$y),12);
# $x now contains 15
$value_e = E();
# $value_e now contains 2.71828182845905
This way of using subroutines makes them look suspiciously like functions.
Note: Unlike a function, you must use parentheses when calling a subroutine in this manner, even if you are giving it no arguments.
Perhaps the most important concept to understand is that values are passed to the subroutine in the default array @_. This array springs magically into existence (like the scalar $_ we learned about earlier), and contains the list of values that you gave to subroutine (within the parentheses).
sub Add_two_numbers {
my ($number1) = shift; # get first argument from @_#and put it in $number1
my ($number2) = shift; # get second argument from @_#and put it in $number2
my $sum = $number1 + $number2;
return $sum;
}
Sometimes you need a more complex data structure! (we’ll be using a HoH!)
Examples:
* An array of arrays (can do the job of a 2dimensional matrix).
DATA:
Spot_numCh1BKGDCH1Ch2BKGDCh2
0000.12443.20.10280.4
0010.11360.70.09122.6
0020.084112.20.14435.3
my @spotarray = ([0.124, 43.2, 0.102, 80.4], [0.113, 60.7, 0.091, 22.6], [0.084, 112.2, 0.144, 35.3]);
Well, first, what is a variable?
Think of a variable as a (named) box that holds a value. The name of the box is the name of the variable. After
$x = 1;
we have
++
$x:  1 
++
After
@y = (1, 'a', 23);
we have
++
@y:  (1, 'a', 23) 
++
$list_ref = \@array;
$map_ref = \%hash;
$c_ref = \$count;
Refs to subroutines:
$sub_ref = \&subroutine;
A reference is an additional, rather different way, to name the variable.
Ex: from $ref_to_y = \@y we have
++
+> @y:  (1, 'a', 23) 
 ++

+ +
$ref_to_y:  * 
++
$ref_to_y contains a reference (pointer) to @y.
print @y yields 1a23 and print $ref_to_y yields ARRAY(0x80cd6ac).
@{array_reference}
%{hash_reference}
${scalar_reference}
print @{$ref_to_y} yields 1a23.
#!/usr/bin/perl w
use strict;
@ARGV = '/home/mako/DATA/sequences.txt' unless @ARGV;
$/ = ">";
my %DATA;
while (<>) {
chomp;
my ($id_line,@rest) = split "\n";
$id_line =~ /^(\S+)/ or next;
my $id = $1;
my $sequence = join '',@rest;
my $length = length $sequence;
my $gc_count = $sequence =~ tr/gcGC/gcGC/;
my $gc_content = $gc_count/$length;
$DATA{$id} = { sequence => $sequence,
length => $length,
gc_content => sprintf("%3.2f",$gc_content)
};
}
my @ids = sort { $DATA{$a}>{gc_content} <=> $DATA{$b}>{gc_content}
} keys %DATA;
foreach my $id (@ids) {
print "$id\n";
print "\tgc content = $DATA{$id}>{gc_content}\n";
print "\tlength = $DATA{$id}>{length}\n";
print "\n";
}
The File::Basename module is a standard module that is distributed with Perl. When you load the File::Basename module, you get two new functions, basename and dirname.
basename takes a long UNIX path name and returns the file name at the end. dirname takes a long UNIX path name and returns the directory part.
The File::Basename is the syntax for accessing any module you create
But you might have to tell perl where you put it…
#!/usr/bin/perl
# file: basename.pl
use strict;
use File::Basename;
my $path = '/home/mako/DATA/chrT.fsa';
my $base = basename($path);
my $dir = dirname($path);
print "The base is $base and the directory is $dir.\n";
Each module will automatically import a different set of variables and subroutines when you use it. You can control what gets imported by providing use with a list of what to import.
To find out what modules come with perl, look in Appendix A of Perl 5 Pocket Reference. From the command line, use the perldoc command from the UNIX shell. All the Perl documentation is available with this command:
% perldoc perlmodlib
To learn more about a module, run perldoc with the module's name:
% perldoc File::Basename
You can find thousands of Perl Modules on CPAN, the Comprehensive Perl Archive Network:
http://www.cpan.org
Perl has a CPAN module installer built into it. You run it like this:
% perl MCPAN e shell
cpan shell  CPAN exploration and modules installation (v1.59_54)
ReadLine support enabled
cpan>
cpan> install Text::Wrap
Some modules are objectoriented. Instead of importing a series of subroutines that are called directly, these modules define a series of object types that you can create and use.
We’ll see what OOP is and why we want to use it next time…