Regular expressions

Regular expressions • Perl provides a pattern-matching engine • Patterns are called regular expressions • They are extremely powerful • probably Perl's strongest feature, compared to other languages • Often called "regexps" for short

Motivation: N-glycosylation motif • Common post-translational modification in ER • Membrane & secreted proteins • Purpose: folding, stability, cell-cell adhesion • Attachment of a 14-sugar oligosaccharide • Occurs at asparagine residues with the consensus sequence NX1X2, where • X1 can be anything(but proline & aspartic acid inhibit) • X2 is serine or threonine • Can we detect potential N-glycosylationsites in a protein sequence?

User Input from Keyboard • Input a line of input from the user and save it into a variable: • We can also input a file name from the user that we want to open: print "Enter your DNA sequence:"; $dna = <STDIN>; chomp($dna); It is often needed to remove the “new line” character from user input print "Enter data file name:"; $data = <STDIN>; chomp($data); open F, $data; Create a file handle called F for the file name stored in the $data variable

Interactive testing • This script echoes input from the keyboard • Sometimes (e.g. in Windows IDEs) the output isn’t printed until the script stops • This is because of buffering. • To stop buffering, set to "autoflush": while (<STDIN>) { print; } The special filehandle STDIN means "standard input", i.e. the keyboard $| is the autoflush flag $| = 1; while (<STDIN>) { print; }

Matching alternative characters • [ACGT] matches one A, C, G or T: • In general square brackets denote a set of alternative possibilities • Use - to match a range of characters: [A-Z] • . matches anything • \s matches spaces or tabs • \S is anything that's not a space or tab • [^X] matches anything but X this is not printed This is printed Matched: This is printed while (<STDIN>) { print "Matched: $_" if /[ACGT]/; } Italics denote input text

Matching alternative strings • /(this|that)/ matches "this" or "that" • ...and is equivalent to /th(is|at)/ while (<STDIN>) { print "Matched: $_" if /this|that|other/; } Remember, regexps are case-sensitive Won't match THIS Will match this Matched: Will match this Won't match ThE oThER Will match the other Matched: Will match the other

Matching multiple characters • x* matches zero or more x's (greedily) • x*? matches zero or more x's (sparingly) • x+ matches one or more x's (greedily) • x{n} matches n x's • x{m,n} matches from m to n x's Word and string boundaries • ^ matches the start of a string • $ matches the end of a string • \b matches word boundaries

"Escaping" special characters • \ is used to "escape" characters that otherwise have meaning in a regexp • so \[ matches the character "[" • if not escaped, "[" signifies the start of a list of alternative characters, as in [ACGT]

Retrieving what was matched • If parts of the pattern are enclosed by parentheses, then (following the match) those parts can be retrieved from the scalars $1, $2... • e.g. /the (\S+) sat on the (\S+) drinking (\S+)/ • matches "the cat sat on the mat drinking milk" • with $1="cat", $2="mat", $3="milk" $| = 1; while (<STDIN>) { if (/(a|the) (\S+)/i) { print "Noun: $2\n"; } } Pick up the cup Noun: cup Sit on a chair Noun: chair Put the milk in the tea Noun: milk Note: only the first "the" is picked up by this regexp

Variations and modifiers • //i ignores upper/lower case distinctions: • //g starts search where last match left off • pos($_) is index of first character after last match • s/OLD/NEW/ replaces first "OLD" with "NEW" • s/OLD/NEW/g is "global" (i.e. replaces every occurrence of "OLD" in the string) while (<STDIN>) { print "Matched: $_" if /pattern/i; } pAttERn Matched pAttERn

N-glycosylation site detector Convert to upper case $| = 1; while (<STDIN>) { $_ = uc $_; while (/(N[^PD][ST])/g) { print "Potential N-glycosylation sequence ", $1, " at residue ", pos() - 2, "\n"; } } Regexp uses 'g' modifier to get all matches in sequence pos() is index of first residue after match, starting at zero; so, pos()-2 is index of first residue of three-residue match, starting at one. while (/(N[^PD][ST])/g) { ... } The main regular expression

PROSITE and Pfam Pfam – a database of Hidden Markov Models (HMMs) – equivalent to probabilistic regular expressions PROSITE – a database of regular expressions for protein families, domains and motifs

Subroutines • Often, we can identify self-contained tasks that occur in so many different places we may want to separate their description from the rest of our program. • Code for such a task is called a subroutine. • Examples of such tasks: • finding the length of a sequence • reverse complementing a sequence • finding the mean of a list of numbers NB: Perl provides the subroutine length($x) to do this already

Finding all sequence lengths (2) open FILE, "fly3utr.txt"; while (<FILE>) { chomp; if (/>/) { print_name_and_len(); $name = $_; $len = 0; } else { $len += length; } } print_name_and_len(); close FILE; sub print_name_and_len { if (defined ($name)) { print "$name $len\n"; } } Subroutine calls Subroutine definition; code in here is not executed unless subroutine is called

Reverse complement subroutine sub revcomp { my $rev; $rev = reverse ($dna); $rev =~ tr/acgt/tgca/; return $rev; } $rev = 12345; $dna = "accggcatg"; $rev1 = revcomp(); print "Revcomp of $dna is $rev1\n"; $dna = "cggcgt"; $rev2 = revcomp(); print "Revcomp of $dna is $rev2\n"; print "Value of rev is $rev\n"; "my" announces that $rev is local to the subroutine revcomp "return" announces that the return value of this subroutine is whatever's in $rev Value of $rev is unchanged by calls to revcomp Revcomp of accggcatg is catgccggt Revcomp of cggcgt is acgccg Value of rev is 12345

Revcomp with arguments sub revcomp { my ($dna) = @_; my $rev = reverse ($dna); $rev =~ tr/acgt/tgca/; return $rev; } $dna1 = "accggcatg"; $rev1 = revcomp ($dna1); print "Revcomp of $dna1 is $rev1\n"; $dna2 = "cggcgt"; $rev2 = revcomp ($dna2); print "Revcomp of $dna2 is $rev2\n"; The array @_ holds the arguments to the subroutine (in this case, the sequence to be revcomp'd) Now we don't have to re-use the same variable for the sequence to be revcomp'd Revcomp of accggcatg is catgccggt Revcomp of cggcgt is acgccg

Mean & standard deviation @xdata = (1, 5, 1, 12, 3, 4, 6); ($x_mean, $x_sd) = mean_sd (@xdata); @ydata = (3.2, 1.4, 2.5, 2.4, 3.6, 9.7); ($y_mean, $y_sd) = mean_sd (@ydata); sub mean_sd { my @data = @_; my $n = @data + 0; my $sum = 0; my $sqSum = 0; foreach $x (@data) { $sum += $x; $sqSum += $x * $x; } my $mean = $sum / $n; my $variance = $sqSum / $n - $mean * $mean; my $sd = sqrt ($variance); return ($mean, $sd); } Subroutine takes a list of $n numeric arguments Square root Subroutine returns a two-element list: (mean,sd)

Maximum element of an array • Subroutine to find the largest entry in an array @num = (1, 5, 1, 12, 3, 4, 6); $max = find_max (@num); print "Numbers: @num\n"; print "Maximum: $max\n"; sub find_max { my @data = @_; my $max = pop @data; foreach my $x (@data) { if ($x > $max) { $max = $x; } } return $max; } Numbers: 1 5 1 12 3 4 6 Maximum: 12

Including variables in patterns • Subroutine to find number of instances of a given binding site in a sequence $dna = "ACGCGTAAGTCGGCACGCGTACGCGT"; $mcb = "ACGCGT"; print "$dna has ", count_matches ($mcb, $dna), " matches to $mcb\n"; sub count_matches { my ($pattern, $text) = @_; my $n = 0; while ($text =~ /$pattern/g) { ++$n } return $n; } ACGCGTAAGTCGGCACGCGTACGCGT has 3 matches to ACGCGT

Data structures • Suppose we have a file containing a table of Drosophila gene names and cellular compartments, one pair on each line: Cyp12a5 Mitochondrion MRG15 Nucleus Cop Golgi bor Cytoplasm Bx42 Nucleus Suppose this file is in "genecomp.txt"

Reading a table of data • We can split eachline into a 2-elementarray using thesplit command. • This breaks the lineat each space: • The opposite of split is join, which makes a scalar from an array: open FILE, "genecomp.txt"; while (<FILE>) { ($g, $c) = split; push @gene, $g; push @comp, $c; } close FILE; print "Genes: @gene\n"; print "Compartments: @comp\n"; Genes: Cyp12a5 MRG15 Cop bor Bx42 Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus print join (" and ", @gene); Cyp12a5 and MRG15 and Cop and bor and Bx42

Finding an entry in a table • The following code assumes that we've already read in the table from the file: • Example:$ARGV[0] = "Cop" $geneToFind = shift @ARGV; print "Searching for gene $geneToFind\n"; for ($i = 0; $i < @gene; ++$i) { if ($gene[$i] eq $geneToFind) { print "Gene: $gene[$i]\n"; print "Compartment: $comp[$i]\n"; exit; }} print "Couldn't find gene\n"; Searching for gene Cop Gene: Cop Compartment: Golgi

Binary search • The previous algorithm is inefficient. If there are N entries in the list, then on average we have to search through ½(N+1) entries to find the one we want. • For the full Drosophila genome, N=12,000. This is painfully slow. • An alternative is the Binary Search algorithm: Start with a sorted list. Compare the middle elementwith the one we want. Pick thehalf of the list that contains ourelement. Iterate this procedure tolocate the right element. This takes around log2(N) steps.

Associative arrays (hashes) • Implementing algorithms like binary search is a common task in languages like C. • Conveniently, Perl provides a type of array called an associative array (also called a hash) that is pre-indexed for quick search. • An associative array is a set of keyvalue pairs (like our genecompartment table) $comp{"Cop"} = "Golgi"; Curly braces {} are used to index an associative array

Reading a table using hashes open FILE, "genecomp.txt"; while (<FILE>) { ($g, $c) = split; $comp{$g} = $c; } $geneToFind = shift @ARGV; print "Gene: $geneToFind\n"; print "Compartment: ", $comp{$geneToFind}, "\n"; Gene: Cop Compartment: Golgi ...with $ARGV[0] = "Cop" as before:

Reading a FASTA file into a hash sub read_FASTA { my ($filename) = @_; my (%name2seq, $name, $seq); open FILE, $filename; while (<FILE>) { chomp; if (/>/) { s/>//; if (defined $name) { $name2seq{$name} = $seq; } $name = $_; $seq = ""; } else { $seq .= $_; } } $name2seq{$name} = $seq; close FILE; return %name2seq; }

Formatted output of sequences sub print_seq { my ($name, $seq) = @_; print ">$name\n"; my $width = 50; for (my $i = 0; $i < length($seq); $i += $width) { if ($i + $width > length($seq)) { $width = length($seq) - $i; } print substr ($seq, $i, $width), "\n"; } } 50-column output The term substr($x,$i,$len) returns the substring of $x starting at position $i with length $len. For example, substr("Biology",3,3) is "log"

keys and values • keys returns the list of keys in the hash • e.g. names, in the %name2seq hash • values returns the list of values • e.g. sequences, in the %name2seq hash %name2seq = read_FASTA ("fly3utr.txt"); print "Sequence names: ", join (" ", keys (%name2seq)), "\n"; my $len = 0; foreach $seq (values %name2seq) { $len += length ($seq); } print "Total length: $len\n"; Sequence names: CG11488 CG11604 CG11455 Total length: 210

Files of sequence names • Easy way to specify a subset of a given FASTA database • Each line is the name of a sequence in a given database • e.g. CG1167 CG685 CG1041 CG1043

Get named sequences • Given a FASTA database and a "file of sequence names", print every named sequence: ($fasta, $fosn) = @ARGV; %name2seq = read_FASTA ($fasta); open FILE, $fosn; while ($name = <FILE>) { chomp $name; $seq = $name2seq{$name}; if (defined $seq) { print_seq ($name, $seq); } else { warn "Can't find sequence: $name. ", "Known sequences: ", join (" ", keys %name2seq), "\n"; } } close FILE;

Intersection of two sets CG1167 CG685 CG1041 CG1043 CG215 CG1041 CG483 CG1167 CG1163 • Two files of sequence names: • What is the overlap? • Find intersection using hashes: fosn1.txt fosn2.txt open FILE1, "fosn1.txt"; while (<FILE1>) { $gotName{$_} = 1; } close FILE1; open FILE2, "fosn2.txt"; while (<FILE2>) { print if $gotName{$_}; } close FILE2; CG1041 CG1167

Assigning hashes • A hash can be assigned directly,as a list of "key=>value" pairs: %comp = ('Cyp12a5' => 'Mitochondrion', 'MRG15' => 'Nucleus', 'Cop' => 'Golgi', 'bor' => 'Cytoplasm', 'Bx42' => 'Nucleus'); print "keys: ", join(";",keys(%comp)), "\n"; print "values: ", join(";",values(%comp)), "\n"; keys: bor;Cop;Bx42;Cyp12a5;MRG15 values: Cytoplasm;Golgi;Nucleus;Mitochondrion;Nucleus

The genetic code as a hash %aa = ('ttt'=>'F', 'tct'=>'S', 'tat'=>'Y', 'tgt'=>'C', 'ttc'=>'F', 'tcc'=>'S', 'tac'=>'Y', 'tgc'=>'C', 'tta'=>'L', 'tca'=>'S', 'taa'=>'!', 'tga'=>'!', 'ttg'=>'L', 'tcg'=>'S', 'tag'=>'!', 'tgg'=>'W', 'ctt'=>'L', 'cct'=>'P', 'cat'=>'H', 'cgt'=>'R', 'ctc'=>'L', 'ccc'=>'P', 'cac'=>'H', 'cgc'=>'R', 'cta'=>'L', 'cca'=>'P', 'caa'=>'Q', 'cga'=>'R', 'ctg'=>'L', 'ccg'=>'P', 'cag'=>'Q', 'cgg'=>'R', 'att'=>'I', 'act'=>'T', 'aat'=>'N', 'agt'=>'S', 'atc'=>'I', 'acc'=>'T', 'aac'=>'N', 'agc'=>'S', 'ata'=>'I', 'aca'=>'T', 'aaa'=>'K', 'aga'=>'R', 'atg'=>'M', 'acg'=>'T', 'aag'=>'K', 'agg'=>'R', 'gtt'=>'V', 'gct'=>'A', 'gat'=>'D', 'ggt'=>'G', 'gtc'=>'V', 'gcc'=>'A', 'gac'=>'D', 'ggc'=>'G', 'gta'=>'V', 'gca'=>'A', 'gaa'=>'E', 'gga'=>'G', 'gtg'=>'V', 'gcg'=>'A', 'gag'=>'E', 'ggg'=>'G' );

Translating: DNA to protein $prot = translate ("gatgacgaaagttgt"); print $prot; sub translate { my ($dna) = @_; $dna = lc ($dna); my $len = length ($dna); if ($len % 3 != 0) { die "Length $len is not a multiple of 3"; } my $protein = ""; for (my $i = 0; $i < $len; $i += 3) { my $codon = substr ($dna, $i, 3); if (!defined ($aa{$codon})) { die "Codon $codon is illegal"; } $protein .= $aa{$codon}; } return $protein; } DDESC

Counting residue frequencies %count = count_residues ("gatgacgaaagttgt"); @residues = keys (%count); foreach $residue (@residues) { print "$residue: $count{$residue}\n"; } sub count_residues { my ($seq) = @_; my %freq; $seq = lc ($seq); for (my $i = 0; $i < length($seq); ++$i) { my $residue = substr ($seq, $i, 1); ++$freq{$residue}; } return %freq; } g: 5 a: 5 c: 1 t: 4

Counting N-mer frequencies %count = count_nmers ("gatgacgaaagttgt", 2); @nmers = keys (%count); foreach $nmer (@nmers) { print "$nmer: $count{$nmer}\n"; } sub count_nmers { my ($seq, $n) = @_; my %freq; $seq = lc ($seq); for (my $i = 0; $i <= length($seq) - $n; ++$i) { my $nmer = substr ($seq, $i, $n); ++$freq{$nmer}; } return %freq; } cg: 1 tt: 1 ga: 3 tg: 2 gt: 2 aa: 2 ac: 1 at: 1 ag: 1

N-mer frequencies for a whole file my %name2seq = read_FASTA ("fly3utr.txt"); while (($name, $seq) = each %name2seq) { %count = count_nmers ($seq, 2, %count); } @nmers = keys (%count); foreach $nmer (@nmers) { print "$nmer: $count{$nmer}\n"; } sub count_nmers { my ($seq, $n, %freq) = @_; $seq = lc ($seq); for (my $i = 0; $i <= length($seq) - $n; ++$i) { my $nmer = substr ($seq, $i, $n); ++$freq{$nmer}; } return %freq; } ct: 5 tc: 9 tt: 26 cg: 4 ga: 11 tg: 12 gc: 2 gt: 17 aa: 39 ac: 10 gg: 4 at: 17 ca: 11 ag: 15 ta: 20 cc: 2 Note how we keep passing %freq back into the count_nmers subroutine, to get cumulative counts The each command is a shorthand for looping through each (key,value) pair in an array

Files and filehandles This XYZ is the filehandle • Opening a file: • Closing a file: • Reading a line: • Reading an array: • Printing a line: • Read-only: • Write-only: • Test if file exists: open XYZ, $filename; close XYZ; $data = <XYZ>; @data = <XYZ>; print XYZ $data; open XYZ, "<$filename"; open XYZ, ">$filename"; if (-e $filename) { print "$filename exists!\n"; }

Regular expressions

Regular expressions

Presentation Transcript

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular expressions

Regular Expressions

Regular Expressions