260 likes | 338 Views
Understand the concept of regular expressions, learn commonly used operations, handle special cases, and explore simple uses of regular expressions in Perl with practical examples. Enhance your understanding of pattern matching with single-character patterns and groups. Dive into pattern memory for more advanced expression manipulation.
Regular Expression (1) Learning Objectives: To understand the concept of regular expression To learn commonly used operations involving regular expression / pattern matching To learn the special cases occurred in regular expression / pattern matching
Simple Uses of Regular Expressions • In Perl, we can make Shakespeare a regular expression by enclosing it in slashes: if(/Shakespeare/){ print $_; } • What is tested in the if-statement? Answer: $_. • Can you write a even shorter statement using &&?
Simple Uses of Regular Expressions if(/Shakespeare/){ print $_; } • The previous example tests only one line, and prints out the line if it contains Shakespeare. • To work on all lines, add a loop: while(<>){ if(/Shakespeare/){ print; } }
Simple Uses of Regular Expressions • What if we are not sure how to spell Shakespeare? • Certainly the first part is easy Shak, and there must be a r near the end. • How can we express our idea? grep: grep "Shak.*r" movie > result Perl: while(<>){ if(/Shak.*r/){ print; } } • .* means “zero or more of any character”.
Single-Character Patterns • The dot “.” matches any single character except the newline (\n). • For example, the pattern /a./ matches any two-letter sequence that starts with a and is not “a\n”. • Use \. if you really want to match the period. $ cat test hi hi bob. $ cat sub3 test #!/usr/local/bin/perl5 -w while(<>){ if(/\./){ print; } } $ sub3 test hi bob. $
Single-Character Groups (1) • If you want to specify one out of a group of characters to match use [ ]: /[abcde]/ This matches a string containing any one of the first 5 lowercase letters, while: /[aeiouAEIOU]/ matches any of the 5 vowels in either upper or lower case.
Single-Character Groups (2) • If you want ] in the group, put a backslash before it, or put it as the first character in the list: /[abcde]]/ # matches [abcde] + ] /[abcde\]]/ # okay /[]abcde]/ # also okay • Use - for ranges of characters (like a through z): /[0123456789]/ # any single digit /[0-9]/ # same • If you want - in the list, put a backslash before it, or put it at the beginning/end: /[X-Z]/ # matches X, Y, Z /[X\-Z]/ # matches X, -, Z /[XZ-]/ # matches X, Z, - /[-XZ]/ # matches -, X, Z
Single-Character Groups (3) • More range examples: /[0-9\-]/ # match 0-9, or minus /[0-9a-z]/ # match any digit or lowercase letter /[a-zA-Z0-9_]/ # match any letter, digit, underscore • There is also a negated character group, which starts with a ^ immediately after the left bracket. This matches any single character not in the list. /[^0123456789]/ # match any single non-digit /[^0-9]/ # same /[^aeiouAEIOU]/ # match any single non-vowel /[^\^]/ # match any single character except ^
Single-Character Groups (4) • For convenience, some common character groups are predefined: Predefined Group Negated Negated Group \d (a digit) [0-9] \D (non-digit) [^0-9] \w (word char) [a-zA-Z0-9_] \W (non-word) [^a-zA-Z0-9_] \s (space char) [ \t\n] \S (non-space) [^ \t\n] • \d matches any digit • \w matches any letter, digit, underscore • \s matches any space, tab, newline • You can use these predefined groups in other groups: /\da-fA-F/ # match any hexadecimal digit
Split (1) • The split function allows you to break a string into fields. • split takes a regular expression and a string, and breaks up the line wherever the pattern occurs. $ cat split1 #!/usr/local/bin/perl5 -w $line = "Bill Shakespeare in love with Bill Gates"; @fields = split(/ /,$line); # split $line using space as delimiter print "$fields[0] $fields[3] $fields[6]\n"; $ split1 Bill love Gates $
Split (2) • You can use $_ with split. • split defaults to look for space delimiters. $ cat split2 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in love with Bill Gates"; @fields = split; # split $line using space (default) as delimiter print "$fields[0] $fields[3] $fields[6]\n"; $ split2 Bill love Gates $
Pattern Memory (1) • How would we match a pattern that starts and ends with the same letter or word? • For this, we need to remember the pattern. • Use ( ) around any pattern to put that part of the string into memory (it has no effect on the pattern itself). • To recall memory, include a backslash followed by an integer. /Bill(.)Gates\1/
Pattern Memory (2) • Example: /Bill(.)Gates\1/ This example matches a string starting with Bill, followed by any single non-newline character, followed by Gates, followed by that same single character. • So, it matches: Bill!Gates! Bill-Gates- but not: Bill?Gates! Bill-Gates_ (Note that /Bill.Gates./ would match all four)
Pattern Memory (3) • More examples: /a(.)b(.)c\2d\1/ • This example matches a string starting with a, a character (#1), followed by b, another single character (#2), c, the character #2, d, and the character #1. • So it matches: a-b!c!d-.
Pattern Memory (4) • The reference part can have more than a single character. • For example: /a(.*)b\1c/ • This example matches an a, followed by any number of characters (even zero), followed by b, followed by the same sequence of characters, followed by c. • So it matches: aBillbBillc and abc, but not: aBillbBillGatesc.
Or • How about picking from a set of alternatives when there is more than one character in the patterns. • The following example matches either Gates or Clinton or Shakespeare: /Gates|Clinton|Shakespeare/ • For single character alternatives, /[abc]/ is the same as /a|b|c/.
Anchoring Patterns • Anchors requires that the pattern be at the beginning or end of the line. • ^ matches the beginning of the line (only if ^ is the first character of the pattern): /^Bill/ # match lines that begin with Bill /^Gates/ # match lines that begin with Gates /Bill\^/ # match lines containing Bill^ somewhere /\^/ # match lines containing ^ • $ matches the end of the line (only if $ is the last character of the pattern): /Bill$/ # match lines that end with Bill /Gates$/ # match lines that end with Gates /$Bill/ # match with contents of scalar $Bill /\$/ # match lines containing $
Using =~ (1) • What if you want to match a different variable than $_? • Answer: Use =~. • Examples: $name = "Bill Shakespeare"; $name =~ /^Bill/; # true $name =~ /(.)\1/; # also true (matches ll) if($name =~ /(.)\1/){ print "$name\n"; }
Using =~ (2) • An example using =~ to match <STDIN>: $ cat match1 #!/usr/local/bin/perl5 -w print "Quit (y/n)? "; if(<STDIN> =~ /^[yY]/){ print "Quitting\n"; exit; } print "Continuing\n"; $ match1 Quit (y/n)? y Quitting $
Ignoring Case • In the previous examples, we used [yY] and [nN] to match either upper or lower case. • Perl has an “ignore case” option for pattern matching: /somepattern/i $ cat match1a #!/usr/local/bin/perl5 -w print "Quit (y/n)? "; if(<STDIN> =~ /^y/i){ print "Quitting\n"; exit; } print "Continuing\n"; $ match1a Quit (y/n)? Y Quitting $
Slash and Backslash • If your pattern has a slash character (/), you must precede each with a backslash (\): $ cat slash1 #!/usr/local/bin/perl5 -w print "Enter path: "; $path = <STDIN>; if($path =~ /^\/usr\/local\/bin/){ print "Path is /usr/local/bin\n"; } $ slash1 Enter path: /usr/local/bin Path is /usr/local/bin $
Different Pattern Delimiters • If your pattern has lots of slash characters (/), you can also use a different pattern delimiter with the form: m#somepattern# • The # can be any non-alphanumeric character. $ cat slash1a #!/usr/local/bin/perl5 -w print "Enter path: "; $path = <STDIN>; if($path =~ m#^/usr/local/bin#){ # if($path =~ m@^/usr/local/bin@){ # also works print "Path is /usr/local/bin\n"; } $ slash1a Enter path: /usr/local/bin Path is /usr/local/bin $
Special Read-Only Variables (1) • After a successful pattern match, the variables $1, $2, $3,… are set to the same values as \1, \2, \3,… • You can use $1, $2, $3,… later in your program. $ cat read1 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; /(\w+)\W+(\w+)/; # match first two words # $1 is now "Bill" and $2 is now "Shakespeare" print "The first name of $2 is $1\n"; $ read1 The first name of Shakespeare is Bill
Special Read-Only Variables (2) • You can also use $1, $2, $3,… by placing the match in a list context: $ cat read2 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; ($first, $last) = /(\w+)\W+(\w+)/; print "The first name of $last is $first\n"; $ read2 The first name of Shakespeare is Bill
Special Read-Only Variables (3) • Other read-only variables: • $& is the part of the string that matched the pattern. • $` is the part of the string before the match • $’ is the part of the string after the match $ cat read3 #!/usr/local/bin/perl5 -w $_ = "Bill Shakespeare in Love"; / in /; print "Before: $`\n"; print "Match: $&\n"; print "After: $'\n"; $ read3 Before: Bill Shakespeare Match: in After: Love
Repeat {n} • /(fred){5,15}/ • Match from five to fifteen repetitions of “fred” • /a{5,}/ • Match five or more times repetitions of “a” • /\w{8}/ • Match exactly 8 word characters.