Perl (2)
Download
1 / 67

Perl (2) Hongkang Mei, Ph.D. March 10, 2002 - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on

Perl (2) Hongkang Mei, Ph.D. March 10, 2002. Review of Perl (1) More on I/O Regular Expression basics More on regular expression Using regular expressions File and directory handles. Scalar data something single or just one number or string, interchangeable acted upon with operators

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Perl (2) Hongkang Mei, Ph.D. March 10, 2002' - amish


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Perl (2)

Hongkang Mei, Ph.D.

March 10, 2002


  • Review of Perl (1)

  • More on I/O

  • Regular Expression basics

  • More on regular expression

  • Using regular expressions

  • File and directory handles


  • Scalar data

    something single or just one

    number or string, interchangeable

    acted upon with operators

    (a scalar variables holds value of a scalar)

  • List data

    list of scalars

    (array is a variable contains list)

    (hash or associative array is a variable contains a list with pairs of scalars associated to each other)


  • Scalar variables

    ‘$’ followed by Perl identifier

  • Should be descriptive

  • Perl built-in scalar variables

  • $ARGV, $_, ”……

    *Perl identifier

    letters, ‘_’, digits, not begin with digit


  • Numeric operators

  • ++ incrementing the value

  • $counter++;

  • $v = $counter++;

  • is different from

  • $v = ++$counter;

  • -- decrementing the value


  • List data

    listliterals

    scalars separated by ‘,’ in ()

    (1, 2, 3, 4, 5)

    (“dnaA”, “argC”, “rnpA”)

    qw/ dnaA argC rnpA/

    range operator ..

    (1..5) # (1, 2, 3, 4, 5)

    (1.2..5.7) # same

    (5..1) # empty

    ($a..$b) # depend on current values

    * The qw shortcut

    treated like ‘’ string

    uses any punctuation pairs

    / /, “”, {}, [], (), <>, ##, !!


  • Array variables

  • @ + identifier

  • no unnecessary limit

  • Array elements: scalar variables

0

1

2

3

4

Array{

C

Scalar variable

indices


  • Accessing array elements

  • achieved by calling the scalar variables:

  • $seq[0]

  • print $seq[3];

  • $seq[1] = ‘acg’;

  • $seq is a different thing!

  • @ and $ have different namespaces


  • Hashes

  • A hash is a variable containing list with

  • paired scalar values associated each other

  • % + identifier

  • no unnecessary limit

  • keys values

C

Hash {

Scalar variables


  • Hash element access

  • $hash{$key}

  • $seq{“dnaA”} = “CAGACTCGAT”;

  • foreach $gene (qw/dnaA argC rnt/) {

  • print “The sequence for $gene is $seq{$gene}.\n”;

  • }

  • $key can be expr.

  • $seq{“unknown”} # undef


  • Interpolation of variables into strings

  • Scalar:

  • print $aa_seq;

  • print “The sequence is $aa_seq.\n”;

  • print “The file contains $count ${type}s.\n”;

  • Array:

  • print “The list contains @array\n”;

  • print @array;

  • print 3 * @array;

  • Hash:

  • print “The AC# for ‘dnaA’ is $g_ac{‘dnaA’};

  • NO interpolation for the whole hash!!

  • Printf “The %s has %d AAs.\n”, $prot, $len;


  • SCALAR and LIST CONTEXT

  • Using the same variable in different context

  • means different things

  • depending on what Perl is expecting

  • 5 + @aa; # scalar

  • sort @aa; # list

  • @list = @aa;

  • @list = $aa;

  • $aa[0] = @list;

  • print “The full aa list is @aa.\n”;

  • print “The number of aa is “ . @aa . “.\n”

  • print @aa;


  • Control structures

  • if (true) {...}elsif{…}else{…}

  • while (true) {...}

  • foreach $line (list){...}

  • for($i=1; $i<11; $i++) {…}

  • unless(true){…} #if(false){…}

  • until(true){…} #while(false){…}


  • Control structures

  • autoincrement autodecrement

  • $n++; $n--;

  • ++$n; --$n;

  • $m = $n++;

  • $m = ++$n;

  • $m = $n; $n++;

  • logical operators

  • &&, ||, !

  • and, not, or


  • Control structures

  • expression modifier

  • print “Acidic\n” if $pH < 7;

  • print “ “, ($n += 2) while $n < 10;

  • print “$aa{$_[0]}\t” foreach (keys %codon);

  • short-circuit operator

  • my $n_aa = $aa{$codon} || “not in the list”;

  • the ternary operator ?:

  • $aa = ($pI{$aa} < 7) ? “acidic” :

  • ($pI{$aa} = 7) ? “neutral” :

  • ($pI{$aa} > 7) ? “basic”;


  • Subroutines

  • functions or subroutines

  • define:

  • sub my_funct {

  • $dna_length = 3 * length($aa_seq);

  • print “DNA is $dna_length basepairs\n”;

  • }

  • Invoke:

  • &my_funct;


  • Built-in functions

  • print

  • chomp

  • defined

  • chop

  • reverse, sort

  • pop, push, shift, unshift

  • return

  • length

  • scalar # a fake one

  • ……

  • perlfunc manpage


  • Review of Perl (1)

  • More on I/O

  • Regular Expression basics

  • More on regular expression

  • Using regular expressions

  • File and directory handles


  • <STDIN>: get user input

  • from commandline:

  • chomp ($a = <STDIN>); print $a;

  • # input ends up at newline

  • file redirection:

  • %>myprog.pl < my_input.txt

  • ……

  • $line_n = 1;

  • while (<STDIN>){

  • print “$line_n\t$_;

  • $line_n++;

  • }


  • <>: get user input from commandline

  • %>myprog.pl input1 - input2

  • ……

  • $line_n = 1;

  • while (<>){

  • print “$line_n\t$_;

  • $line_n++;

  • }

  • The difference between <> and <STDIN>

  • <> works from @ARGV


  • more on print

  • buffer

  • print <>; #string operator

  • # work like cat in commandline

  • print () function

  • print (3+4)*5;

  • print “The result is: “, (3+4)*5;


  • printf

  • printf “The mutation is at %s position.\n”,

  • $count_mut;

  • %s, %f, %d, %g……

  • %2d

  • %-12s (left justified)

  • %12.3f (right justified)

  • %: does not interpolate whole hash

  • %% to print ‘%’


  • Review of Perl (1)

  • More on I/O

  • Regular Expression basics

  • More on regular expression

  • Using regular expressions

  • File and directory handles


  • regular expression or pattern

  • mini-program

  • match or doesn’t match a given string

  • match any number of strings

  • doesn’t matter how many times

  • it matches to a string

  • works like grep

  • $p_seq = “ADCSFTSCGNYEQ”;

  • if(/SFT/){

  • print “It has the motif \”SFT\”.\n”

  • }


  • metacharacters

  • . Matches anything but “\n”

  • \ escape (/3\.14)

  • () grouping


  • simple qualifiers

  • the following qualifiers repeat the previous pattern

  • * 0 or more times

  • + 1 or more times

  • ? 0 or 1 times


  • the ‘|’alternative pattern

  • /T|S/

  • /protein(and|or)DNA/

  • /arg(ser|cys)lys/


  • Review of Perl (1)

  • More on I/O

  • Regular Expression basics

  • More on regular expression

  • Using regular expressions

  • File and directory handles


  • character classes

  • [] matches any single character inside

  • [AGCT] # any deoxynucleotides

  • [a-zA-Z0-9]+ # 1 or more of letters or digits

  • [;\-,] # ‘-’ needs to be escaped


  • character classes shortcuts

  • \d [0-9]

  • \w [A-Za-z0-9_] # only a char, \w+ a word

  • \s [\f\t\n\r ] # whitespace

  • negating the shortcuts

  • \D [^\d]

  • \W [^\w]

  • \S [^\s]

  • can be part of a larger class

  • [\dA-F]

  • [\d\D] (any char)

  • [^\d\D] (nothing)


  • general qualifiers

    • * 0 or more repetitions

    • + 1 or more

    • ? 0 or 1

  • {3, 5} 3 to 5

  • {3,} 3 or more

  • {3} exactly 3 repetitions

  • /U{5,8}/

  • /\w{8}/

  • /A{15,100}/

  • /(arg){2,}/

  • * {0,}

  • how about + and ?


  • anchors

  • ^ marks beginning of the string

  • /^ATG/ # initiation codon

  • [^AGCT] # ?

  • $ marks the end

  • /(UA[AG]|UGA)$/ # stop codons

  • /^\s*$/ # a blank line


  • word anchors

  • \b word boundary anchor

  • matches either end of a word

  • /\barg/ # arg, arginine, arginyl, argue……

  • /\barg\b/ # arg

  • \B nonword boundary anchor

  • matches any point that \b would not

  • \barg\B/ # arginine, arginyl, argue……


  • memory ()

  • () grouping

  • matched part kept in memory

  • /A(ACGT)T/ # ACGT in memory

  • backreferences

  • \1 \2

  • /(AACGTT).*\1/ # can EcoRI cut the insert out?

  • /(.)\1/ NOT /../ #two same char; two char

  • memory variables

  • $1


  • precedence

  • which parts of the pattern stick together

  • more tightly

  • ()

  • *+?{}

  • ^$\b\B sequence

  • |

  • atoms chars, classes, backreferences

  • examples

  • /^fred|barnay$/

  • /^(\w+)\s+(\w+)$/


  • Review of Perl (1)

  • More on I/O

  • Regular Expression basics

  • More on regular expression

  • Using regular expressions

  • File and directory handles


  • m//

  • a more general pattern match operator

  • can use any pairs of delimiters

  • //

  • m,, m!! m^^ m##

  • m<> m{} m[] m()

  • example:

  • m%^http://% is better than /^http:\/\//


  • option modifiers

  • /i case insensitive

  • matches both cases for all letters

  • /\byes\b/ # Yes yes YES

  • /s matches any character

  • more than .

  • /\d\D/

  • $_ = “ACGTTTGCG\nAACACGT”;

  • /^(ACG).*(CGT)$/s

  • do not confuse with the \s shortcut


  • combiningoption modifiers

  • /si # both /s and /i

  • $_ = “aCGTTTGCG\nAACAcGT”;

  • if(/^(ACG).*(CGT)$/si){

  • print “That sequence begins with ACG”,

  • “and ends with CGT.\n”

  • }

  • other options


  • the binding operator =~

  • if (/\w+/i){……} # only works on $_

  • if ($seq =~ /^(ACG).*(CGT)$/si){

  • print “That sequence begins with ACG”,

  • “and ends with CGT.\n”

  • }

  • $prot_seq = <STDIN> =~ /[^ACGT]/i;

  • if ($prot_seq) {blastp;}


  • interpolating into patterns

  • my $p = “arg”;

  • if ($seq =~ /($p)$/si){

  • print “That sequence ends with $p.\n”

  • }

  • $profile = shift @ARGV; # get commandline args

  • if ($prot_seq =~ /$profile/si) {

  • print “$prot_seq has motif $profile;

  • }


  • the match variables

  • /(A)\1/ # use \1 inside pattern

  • $1 # hold memory value in Perl code

  • if ($seq =~ /(g.)\1/si){

  • print “That sequence has a $1 repeat.\n”

  • }

  • if ($prot_seq =~ /([stavli]{3,}).*([Deq]{3,})/si) {

  • print “$prot_seq has hydrophobic region $1 ”,

  • “followed by hydrophilic region $2.\n”;

  • }


  • the persistence of match

  • next successful match will overwrite the earlier one

  • store your $1 away!

  • if ($prot_seq =~ /([cstv]+)/si) {

  • my $motif = $1;

  • }

  • test your match before using $1

  • it could be a leftover

  • $prot_seq =~ /([cstv]+)/si;

  • print “I found the motif $1, correct?\n”


  • automatic matched variables (PP121)

  • $& matched part of string

  • $` part before the match

  • $’ part after the match

  • $`$&$’ the whole string

  • ……

  • print “The matched string is the following,”,

  • “the part matched is in <>:\n”,

  • “$`<$&>$’\n”;


  • substitutions with s///

  • m// search

  • s/// search and replace

  • s/match_pattern/replacement_string/

  • returns true if successful, false if not

  • replacement string:

  • $1

  • empty

  • $&

  • words

  • whitespaces

  • ……


  • examples of s///

  • if (s/([a-z]{3})([cstyleu]{3})/$2/){

  • print “The mutant protein has a $1 deletion ”,

  • “before $2.\n”;

  • s/(arg)/$1$1/; # arg insertion

  • s/arg/cys/; # cys substitution

  • s/\s+//g; # get rid of all whitespaces

  • s/\s+/ /g; # single space delimiters

  • s/[^acgt]//gi; # clean up DNA sequence

  • s/[tT]/U/g; # translate to RNA sequence

  • s/_END_.*//s; # chop off after END mark


  • s/// different delimiters

  • just like:

  • m//

  • qw//

  • can use unpaired or paired delimiters

  • ,, “” {} [] %% ##

  • s#^https://#http://#

  • s{T}{U}

  • s<T>#U#



  • case shifting

  • $dna =~ s/(.+)/\U$1/;

  • $prot =~ s/(.+)/\L$1/;

  • $prot =~ s/(\w+)/\u\L$1/gi;


  • split

  • split /seperator/, $string;

  • @aa = split / /, $aa;

  • split /seperator/;

  • @aa = split /:/; # used $_ eg. “a:b:c:d”

  • split //;

  • @data = split //; # still $_, split each char

  • split; split /\s+/, $_;

  • @data = split; # split $_ at whitespaces


  • join

  • join glue, list of pieces;

  • $full_name = join ‘ ‘, $first, $middle, $last;

  • $x = join ‘y’, @empty; # empty string


  • Review of Perl (1)

  • More on I/O

  • Regular Expression basics

  • More on regular expression

  • Using regular expressions

  • File and directory handles


  • File handle

  • a name for I/O connection, not file name

  • usually named uppercases, _ and digits

  • Perl’s special file handles

  • do not name your’s with the 6 handles

  • STDIN

  • STDOUT

  • %>myprog.pl <input >output

  • STDERR

  • another stream

  • DATA ARGV ARGVOUT


  • opening filehandles

  • STDIN, STDOUT, STDERR automatically opened

  • open SEQ “e_coli_dna”;

  • open INPUT “<e_coli_dna”;

  • open OUT1 “>intergene_seq”;

  • open LOG “>>genome__Update_log”;

  • my $out_file = <STDIN>;

  • open OUT2 “>$out_file”;



  • return value of opening filehandle

  • open returns true or false

  • reasons fail to open:

  • permission, spelling, not created (for input)

  • consequence of fail:

  • EOF (undef), no data input

  • output discarded

  • turn on -w

  • die, warn


  • die when having fatal error

  • $length = $a/$b or die “Can’t calculate: $!”;

  • # if $b is 0

  • open LOG, “>>log_file”

  • or die “Cannot create logfile: $!”;

  • die “Not enough arguments\n” @ARGV < 2;

  • * $! Is the system error message


  • warn when not fetal

  • just like die except not quitting the program

  • warn “Input sequence is too short.\n”

  • if $seq_len < 30;


  • using filehandles

  • while (<SEQ>) {

  • if /^AUG/

  • {

  • print INITIAL_SEQ $_;

  • print OUT1 (“>$accession\n$_\n”);

  • print LOG “The sequence is updated\n”;

  • }

  • }


  • file tests

  • warn “$filename is not updated.\n”

  • if -M INPUT > 14;

  • die “File named $filename already exists.\n”

  • if -e $filename;

  • if (-s $filename) {

  • print “File made successfully.\n”;

  • }


  • chdir

  • similar to UNIX cd

  • chdir “/fasta” or die “Can’t chdir to fasta: $!”;

  • chdir; # Perl finds your home, not using $_


  • glob and <something>

  • similar to UNIX ls, returns a list

  • my @all_files = glob “.* *”;

  • my @seq_files = glob “*.seq”;

  • my @dir_files = <$dir/.* $dir/*>;

  • my @files = <FASTA/*>;

  • my @lines = <FASTA>;

  • my @files = <$name/*>;

  • my @lines = readline FASTA;


  • directory handles

  • opendir FASTA, $dir or die “Can’t open: $!”;

  • @files = readdir FASTA;

  • closedir FASTA;

  • while ($name = readdir FASTA){

  • if ($name =~ /\.seq$/){

  • do something here……

  • }


  • unlink files

  • similar to UNIX rm

  • unlink “seq1”, “seq2”, “seq3”;

  • unlink glob “*.seq”;

  • rename files

  • similar to UNIX mv

  • rename “old”, “new”;

  • rename “/bin/somewhere/e_coli”, “e_coli”;


  • mkdir

  • similar to UNIX mkdir

  • mkdir “fasta”, 0755;

  • mkdir $name, oct($permission);

  • rmdir

  • similar to UNIX rmdir

  • rmdir $dir or warn “Can’t remove $dir: $!”;

  • rmdir glob “$dir/*”;


  • change permissions with chmod

  • similar to UNIX chmod

  • chmod 0640, “seq1”, “seq2”, “seq3”;

  • change ownership with chown

  • similar to UNIX chown

  • chown $user, $group, glob “*.seq”;


Example 1: Expression

# Take in what a user types, and turn .com web sites into .orgs, and change

# the "@" in their email address to something else

while (<STDIN>) {

if (/^quit$/i) { # Leave the program if the use types "quit"

last;

}

else {

# replace .coms in URLs and with .orgs. Only do it

# for the "first match" in the string

s/(http:\/\/[\w\d\.]+)\.com/$1\.org/i;

# replace the @ in email addresses with the ^ symbol. Do it for

# ALL occurrences in the string

s/([\w\d]+)\@([\w\d\.]+)/$1\^$2/ig;

# Print out the modified string

print;

}

}


ad