Regular Expressions in Pearl - Part II

By Andrew Dougherty Regular Expressions in Pearl - Part II

Overview • Grouping and hierarchical matching • Extracting matches • Matching repetitions

Grouping & Hierarchical Matching • Grouping allows parts of a regular expression to be treated as a single unit • Example - house (cat | keeper) #matches housecat or housekeeper • () – represent grouping • | - represent alternatives (or) • [] – a set of characters • Examples • /(a | b) b/ ; # matches ‘ab’ or ‘bb’. • /(ac | b) b/ ; # matches ‘acb’ or ‘bb’. • /(^a | b ) c/ ; # matches ‘ac’ at start of string or ‘bc’ anywhere in the string • /(a | [bc] ) d/ ; #matches ‘ad’, ‘bd’ or ‘cd’ • /house(cat(s | ) | ) /; # matches ‘housecats’ or ‘housecat’ or ‘house’. Groups can be nested • Backtracking – The process of trying one alternative, seeing if it matches, and moving on to the next alternative if it doesn’t.

Extracting Matches • () – Also allow for the extraction of the parts of a string that matched. • $1, $2, ….$n are used by Perl as variables and store the parts of the regular expression that matched. • Example # extract hours, minutes, seconds If ($time =~ /(\d\d) : (\d\d) : (\d\d) /) #match hh:mm:ss format { $hours = $1; $minutes = $2; $seconds = $3; } This statement is the same as ($hours, $minutes, $seconds) = ($time =~ /(\d\d) : (\d\d) : (\d\d) /)

Extracting Matches Cont. • Backreferences - \1, \2, etc. Matching variables that can be used inside a regular expression. • Example, finding doubled words in text separated by a space, like ‘the the’. /(\w \w \w) \s \1/ ; - The three letters and a space are assigned to the \1 variable which matches the occurrence of the same three letters appearing after the space. • Finding repeating patterns in 4 letters, 3 letters, 2 letters, and 1 letter. % simple_grep ‘^(\w \w \w \w | \w \w \w | \w \w | \w) \1$’ /usr/dict/words beriberi, booboo, coco, aa #all match the grep pattern • +[n], -[n] returns the positions of what was matched in the substring $x = “Mmm...donut, thought Homer”; #String stored in variable x $x = ~/^(Mmm | Yech) \. \. \. (donut | peas) /; #regular expression to be matched foreach $expr (1..$#-) { print “Match $expr: ‘${$expr}’ at position ($- [$expr], $+[$expr]) \n”; } Prints Match 1: ‘Mmm’ at position (0,3) Match 2: ‘donut’ at position (6,11)

Matching Repetitions • Quantifier metacharacters – Determine the number of repeats of a portion of a regular expression we consider to be a match. • ? a? = match ‘a’ 1 or 0 times. • * a* = match ‘a’ 0 or more times. (any number of times) • + a+ = match ‘a’ 1 or more times. (at least once) • {x, y} a{n, m} = match ‘a’ at least n times, but not more than m times. • {x, } a{n, } = match ‘a’ at least n or more times. • {x} a{n} = match ‘a’ exactly n times. • / [a-z]+ \s+ \d*/; #match a lowercase word, at least some space, and any number of digits • /(\w+) \s+ \1/; #match doubled words of arbitrary length (like ‘the the’) • /y(es)?/i; #matches ‘y’, ‘Y’, or a case-insensitive ‘yes’ • $year =~ /\d{4} | \d{2}/; #makes sure year is 2 or 4 digits in length (like 10 or 2010)

Matching Repetitions Cont. • Maximal match/greedy quantifier – Quantifiers that grab as much of the string as possible. • $x =~ /^ (.*) (cat) (.*)$/; #$1 = ‘the ’ #$2 = ‘cat’ #$3 = ‘ in the hat’ • $x =~ /^(.*) (at) (.*)$/; #$1 = ‘the cat in the h’ #$2 = ‘at’ #$3 = ‘’ (no match)

Matching Repetitions Cont • Principle 0: Taken as a whole, any regular expression will be matched at the earliest possible position in the string. $x = “The programming republic of Perl”; $x =~ /^(.+) (e | r) (.*)$/; #$1 = ‘The programming republic of Pe’ #$2 = ‘r’ #$3 = ‘l’ • Principle 1: In an alternation (a | b | c…) the leftmost alternative that allows a match for the whole regular expression will be the one used. $x =~ /(m{1,2}) (.*)$/; #1 = ‘mm’ #2 = ‘ing republic of Pearl’

Matching Repetition Cont. • Principle 2: The maximal matching quantifiers will in general match as much of the string as possible while still allowing the whole expression to match. $x =~ /.* (m{1,2}) (.*)$/; # $1 = ‘m’ # $2 = ‘ing republic of Perl’ • Principle 3: If there are 2 or more elements in a regular expression, the leftmost greedy quantifier, if any, will match as much of the string as possible while still allowing the whole expression to match. The next greedy quantifier will take what’s left and still match and so on until all elements are gone. $x =~ /(.?)(m{1,2})(.*)$/; # $1 = 'a' # $2 = 'mm' # $3 = 'ing republic of Perl'

Matching Repetitions Cont. • Minimal match/non-greedy quantifiers – Match a minimal piece of string. • ?? # a?? = match ‘a’ 0 or 1 times. Try 0 first, then 1. • *? • +? • {x, y}? • {x, }? • {x}? $x = “The programming republic of Pearl”; $x =~ /^(.+?) (e | r) (.*)$/; #$1 = ‘Th’ #$2 = ‘e’ #$3 = ‘ programming republic of Perl’

Work Cited • https://www.cs.drexel.edu/~knowak/cs265_fall_2010/perlretut_2007.pdf • http://www.cs.tut.fi/~jkorpela/perl/regexp.html

Regular Expressions in Pearl - Part II

Regular Expressions in Pearl - Part II

Presentation Transcript

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular expressions

Regular Expressions

Regular Expressions