Regular Expressions

Software Tools Regular Expressions

What is a Regular Expression? • A regular expression is a pattern to be matched against a string. For example, the pattern Bill. • Matching either succeeds or fails. • Sometimes you may want to replace a matched pattern with another string. • Regular expressions are used by many other Unix commands and programs, such as grep, sed, awk, vi, emacs, and even some shells.

Simple Uses of Regular Expressions • If we are looking for all the lines in a file that contain the string Shakespeare, we could use the grep command: $ grep Shakespeare movie > result • Here, Shakespeare is the regular expression that grep looks for in the file movie. • Lines that match are redirected to result.

Simple Uses of Regular Expressions • In Perl, we can make Shakespeare a regular expression by enclosing it in slashes: if(/Shakespeare/){ print $_; } • What is tested in the if-statement? Answer: $_. • When a regular expression is enclosed in slashes, $_ is tested against the regular expression, returning true if there is a match, false otherwise.

Simple Uses of Regular Expressions if(/Shakespeare/){ print $_; } • The previous example tests only one line, and prints out the line if it contains Shakespeare. • To work on all lines, add a loop: while(<>){ if(/Shakespeare/){ print; } }

Simple Uses of Regular Expressions • What if we are not sure how to spell Shakespeare? • Certainly the first part is easy Shak, and there must be a r near the end. • How can we express our idea? grep: grep "Shak.*r" movie > result Perl: while(<>){ if(/Shak.*r/){ print; } } • .* means “zero or more of any character”.

Simple Uses of Regular Expressions grep: grep "Shak.*r" movie > result • The double quotes in this grep example are needed to prevent the shell from interpreting * as “all files”. • Since Shakespeare ends in “e”, shouldn’t it be: Shak.*r.* Answer: No need. Any character can come before or after the pattern. Shak.*r is the same as .*Shak.*r.*

Substitution • Another simple regular expression is the substitute operator. • It replaces part of a string that matches the regular expression with another string. s/Shakespeare/Bill Gates/; • $_ is matched against the regular expression (Shakespeare). • If the match is successful, the part of the string that matched is discarded and replaced by the replacement string (Bill Gates). • If the match is unsuccessful, nothing happens.

Substitution • The program: $ cat movie Titanic Saving Private Ryan Shakespeare in Love Life is Beautiful $ cat sub1 #!/usr/local/bin/perl5 -w while(<>){ if(/Shakespeare/){ s/Shakespeare/Bill Gates/; print; } } $ sub1 movie Bill Gates in Love $

Substitution • An even shorter way to write it: $ cat sub2 #!/usr/local/bin/perl5 -w while(<>){ if(s/Shakespeare/Bill Gates/){ print; } } $ sub2 movie Bill Gates in Love $

Patterns • A regular expression is a pattern. • Some parts of the pattern match a single character (a). • Other parts of the pattern match multiple characters(.*).

Single-Character Patterns • The dot “.” matches any single character except the newline (\n). • For example, the pattern /a./ matches any two-letter sequence that starts with a and is not “a\n”. • Use \. if you really want to match the period. $ cat test hi hi bob. $ cat sub3 #!/usr/local/bin/perl5 -w while(<>){ if(/\./){ print; } } $ sub3 test hi bob. $

Single-Character Groups • If you want to specify one out of a group of characters to match use [ ]: /[abcde]/ This matches a string containing any one of the first 5 lowercase letters, while: /[aeiouAEIOU]/ matches any of the 5 vowels in either upper or lower case.

Single-Character Groups • If you want ] in the group, put a backslash before it, or put it as the first character in the list: /[abcde]]/ # matches [abcde] + ] /[abcde\]]/ # okay /[]abcde]/ # also okay • Use - for ranges of characters (like a through z): /[0123456789]/ # any single digit /[0-9]/ # same • If you want - in the list, put a backslash before it, or put it at the beginning/end: /[X-Z]/ # matches X, Y, Z /[X\-Z]/ # matches X, -, Z /[XZ-]/ # matches X, Z, - /[-XZ]/ # matches -, X, Z

Single-Character Groups • More range examples: /[0-9\-]/ # match 0-9, or minus /[0-9a-z]/ # match any digit or lowercase letter /[a-zA-Z0-9_]/ # match any letter, digit, underscore • There is also a negated character group, which starts with a ^ immediately after the left bracket. This matches any single character not in the list. /[^0123456789]/ # match any single non-digit /[^0-9]/ # same /[^aeiouAEIOU]/ # match any single non-vowel /[^\^]/ # match any single character except ^

Single-Character Groups • For convenience, some common character groups are predefined: Predefined Group Negated Negated Group \d (a digit) [0-9] \D (non-digit) [^0-9] \w (word char) [a-zA-Z0-9_] \W (non-word) [^a-zA-Z0-9_] \s (space char) [ \t\n] \S (non-space) [^ \t\n] • \d matches any digit • \w matches any letter, digit, underscore • \s matches any space, tab, newline • You can use these predefined groups in other groups: /[\da-fA-F]/ # match any hexadecimal digit

Multipliers • Multipliers allows you to say “one or more of these” or “up to four” of these.” • * means zero or more of the immediately previous character (or character group). • + means one or more of the immediately previous character (or character group). • ? means zero or one of the immediately previous character (or character group).

Multipliers • Example: /Ga+te?s/ matches a G followed by one or more a’s followed by t, followed by an optional e, followed by s. • *, +, and ? are greedy, and will match as many characters as possible: $_ = "Bill xxxxxxxxx Gates"; s/x+/Cheap/; # gives: Bill Cheap Gates

General Multiplier • How do you say “five to ten x’s”? /xxxxxx?x?x?x?x?/ # works, but ugly /x{5,10}/ # nicer • How do you say “five or more x’s”? /x{5,}/ • How do you say “exactly five x’s”? /x{5}/ • How do you say “up to five x’s”? /x{0,5}/

General Multiplier • How do you say “c followed by any 5 characters (which can be different) and ending with d”? /c.{5}d/ • * is the same as {0,} • + is the same as {1,} • ? is the same as {0,1}

Pattern Memory • How would we match a pattern that starts and ends with the same letter or word? • For this, we need to remember the pattern. • Use ( ) around any pattern to put that part of the string into memory (it has no effect on the pattern itself). • To recall memory, include a backslash followed by an integer. /Bill(.)Gates\1/

Pattern Memory • Example: /Bill(.)Gates\1/ This example matches a string starting with Bill, followed by any single non-newline character, followed by Gates, followed by that same single character. • So, it matches: Bill!Gates! Bill-Gates- but not: Bill?Gates! Bill-Gates_ (Note that /Bill.Gates./ would match all four)

Pattern Memory • More examples: /a(.)b(.)c\2d\1/ • This example matches a string starting with a, a character (#1), followed by b, another single character (#2), c, the character #2, d, and the character #1. • So it matches: a-b!c!d-.

Pattern Memory • The reference part can have more than a single character. • For example: /a(.*)b\1c/ • This example matches an a, followed by any number of characters (even zero), followed by b, followed by the same sequence of characters, followed by c. • So it matches: aBillbBillc and abc, but not: aBillbBillGatesc.

Alteration • How about picking from a set of alternatives when there is more than one character in the patterns. • The following example matches either Gates or Clinton or Shakespeare: /Gates|Clinton|Shakespeare/ • For single character alternatives, /[abc]/ is the same as /a|b|c/.

Anchoring Patterns • Anchors requires that the pattern be at the beginning or end of the line. • ^ matches the beginning of the line (only if ^ is the first character of the pattern): /^Bill/ # match lines that begin with Bill /^Gates/ # match lines that begin with Gates /Bill\^/ # match lines containing Bill^ somewhere /\^/ # match lines containing ^ • $ matches the end of the line (only if $ is the last character of the pattern): /Bill$/ # match lines that end with Bill /Gates$/ # match lines that end with Gates /$Bill/ # match with contents of scalar $Bill /\$/ # match lines containing $

Precedence • So what happens with the pattern: a|b* • Is this (a|b)*or a|(b*)? • Precedence of patterns from highest to lowest: Name Representation Parentheses ( ) Multipliers ? + * {m,n} Sequence & anchoring abc ^ $ Alternation | • By the table, * has higher precedence than |, so it is interpreted as a|(b*).

Precedence • What if we want the other interpretation in the previous example? • Answer: Simple, just use parentheses: (a|b)* • Use parentheses in ambiguous cases to improve clarity, even if not strictly needed. • When you use parentheses for precedence, they also go into memory (\1, \2, \3).

Precedence • More precedence examples: abc* # matches ab, abc, abcc, abccc,… (abc)* # matches "", abc, abcabc, abcabcabc,… ^a|b # matches a at beginning of line, or b anywhere ^(a|b) # matches either a or b at the beginning of line a|bc|d # a, or bc, or d (a|b)(c|d) # ac, ad, bc, or bd (Bill Gates)|(Bill Clinton) # Bill Gates, Bill Clinton Bill (Gates|Clinton) # Bill Gates, Bill Clinton (Mr\. Bill)|(Bill (Gates|Clinton)) # Mr. Bill, Bill Gates, Bill Clinton (Mr\. )?Bill( Gates| Clinton)? # Bill, Mr. Bill, Bill Gates, Bill Clinton, # Mr. Bill Gates, Mr. Bill Clinton

=~ • What if you want to match a different variable than $_? • Answer: Use =~. • Examples: $name = "Bill Shakespeare"; $name =~ /^Bill/; # true $name =~ /(.)\1/; # also true (matches ll) if($name =~ /(.)\1/){ print "$name\n"; }

=~ • An example using =~ to match <STDIN>: $ cat match1 #!/usr/local/bin/perl5 -w print "Quit (y/n)? "; if(<STDIN> =~ /^[yY]/){ print "Quitting\n"; exit; } print "Continuing\n"; $ match1 Quit (y/n)? y Quitting $

=~ • Another example using =~ to match <STDIN>: $ cat match2 #!/usr/local/bin/perl5 -w print "Wakeup (y/n)? "; while(<STDIN> =~ /^[nN]/){ print "Sleeping\n"; print "Wakeup (y/n)? "; } $ match2 Wakeup (y/n)? n Sleeping Wakeup (y/n)? N Sleeping Wakeup (y/n)? y $

Ignoring Case • In the previous examples, we used [yY] and [nN] to match either upper or lower case. • Perl has an “ignore case” option for pattern matching: /somepattern/i $ cat match1a #!/usr/local/bin/perl5 -w print "Quit (y/n)? "; if(<STDIN> =~ /^y/i){ print "Quitting\n"; exit; } print "Continuing\n"; $ match1a Quit (y/n)? Y Quitting $

Slash and Backslash • If your pattern has a slash character (/), you must precede each with a backslash (\): $ cat slash1 #!/usr/local/bin/perl5 -w print "Enter path: "; $path = <STDIN>; if($path =~ /^\/usr\/local\/bin/){ print "Path is /usr/local/bin\n"; } $ slash1 Enter path: /usr/local/bin Path is /usr/local/bin $

Different Pattern Delimiters • If your pattern has lots of slash characters (/), you can also use a different pattern delimiter with the form: m#somepattern# • The # can be any non-alphanumeric character. $ cat slash1a #!/usr/local/bin/perl5 -w print "Enter path: "; $path = <STDIN>; if($path =~ m#^/usr/local/bin#){ # if($path =~ m@^/usr/local/bin@){ # also works print "Path is /usr/local/bin\n"; } $ slash1a Enter path: /usr/local/bin Path is /usr/local/bin $

Special Read-Only Variables • After a successful pattern match, the variables $1, $2, $3,… are set to the same values as \1, \2, \3,… • You can use $1, $2, $3,… later in your program. $ cat read1 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; /(\w+)\W+(\w+)/; # match first two words # $1 is now "Bill" and $2 is now "Shakespeare" print "The first name of $2 is $1\n"; $ read1 The first name of Shakespeare is Bill

Special Read-Only Variables • You can also use $1, $2, $3,… by placing the match in a list context: $ cat read2 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; ($first, $last) = /(\w+)\W+(\w+)/; print "The first name of $last is $first\n"; $ read2 The first name of Shakespeare is Bill

Special Read-Only Variables • Other read-only variables: • $& is the part of the string that matched the pattern. • $` is the part of the string before the match • $’ is the part of the string after the match $ cat read3 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; / in /; print "Before: $`\n"; print "Match: $&\n"; print "After: $'\n"; $ read3 Before: Bill Shakespeare Match: in After: Love

More on Substitution • If you want to replace all matches instead of just the first match, use the g option for substitution: $ cat sub3 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in love with Bill Gates"; s/Bill/William/; print "Sub1: $_\n"; $_ = "Bill Shakespeare in love with Bill Gates"; s/Bill/William/g; print "Sub2: $_\n"; $ sub3 Sub1: William Shakespeare in love with Bill Gates Sub2: William Shakespeare in love with William Gates $

More on Substitution • You can use variable interpolation in substitutions: $ cat sub4 #!/usr/local/bin/perl5 -w $find = "Bill"; $replace = "William"; $_ = "Bill Shakespeare in love with Bill Gates"; s/$find/$replace/g; print "$_\n"; $ sub4 William Shakespeare in love with William Gates $

More on Substitution • Pattern characters in the regular expression allows patterns to be matched, not just fixed characters: $ cat sub5 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in love with Bill Gates"; s/(\w+)/<$1>/g; print "$_\n"; $ sub5 <Bill> <Shakespeare> <in> <love> <with> <Bill> <Gates> $

More on Substitution • Substitution also allows you to: • ignore case • use alternate delimiters • use =~ $ cat sub6 #!/usr/local/bin/perl5 -w $line = "Bill Shakespeare in love with bill Gates"; $line =~ s#bill#William#gi; $line =~ s@Shakespeare@Gates@gi; print "$line\n"; $ sub6 William Gates in love with William Gates $

split • The split function allows you to break a string into fields. • split takes a regular expression and a string, and breaks up the line wherever the pattern occurs. $ cat split1 #!/usr/local/bin/perl5 -w $line = "Bill Shakespeare in love with Bill Gates"; @fields = split(/ /,$line); # split $line using space as delimiter print "$fields[0] $fields[3] $fields[6]\n"; $ split1 Bill love Gates $

split • You can use $_ with split. • split defaults to look for space delimiters. $ cat split2 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in love with Bill Gates"; @fields = split; # split $_ using space (default) as delimiter print "$fields[0] $fields[3] $fields[6]\n"; $ split2 Bill love Gates $

join • The join function allows you to glue strings in a list together. $ cat join1 #!/usr/local/bin/perl5 -w @list = qw(Bill Shakespeare dislikes Bill Gates); $line = join(" ", @list); print "$line\n"; $ join1 Bill Shakespeare dislikes Bill Gates $ • Note that the glue string is not a regular expression, just a normal string.

Regular Expressions