Mastering Regular Expressions: Subpatterns and Back-references

Regular expressions (contd.) -- remembering subpattern matches • When a <pattern> is being matched with a target string, substrings that match sub-patterns can be remembered and re-used later in the same pattern • Sub-patterns whose matching substrings are to be remembered are enclosed in parentheses • The sub-patterns are implicitly numbered, starting from 1 and their matching substrings can then be re-used later in the pattern by using back-references like \1 or \2 or \3 • However, to get the backslash, we need to escape it, so we must type \\1 or \\2 or \\3 in our regular expressions

Using back-references (contd.) • PHP code <?php $myString1 = ”klmAklmAAklmABklmBklmBBklm"; echo "myString is $myString "; $myString1 = preg_replace(”/([A-Z])\\1/",”_",$myString1); echo "myString1 is now $myString1 "; ?> • Resultant output is myString1 is klmAklmAAklmABklmBklmBBklm myString1 is now klmAklm_klmABklmBklm_klm

Using back-references (contd.) • PHP code <?php $myString = ”klmAklmAAklmABklmBklmBBklm"; echo "myString is $myString "; $myString = preg_replace(”/([A-Z])\\1/",”_",$myString); echo "myString is now $myString "; ?> • Resultant output is myString1 is klmAklmAAklmABklmBklmBBklm myString1 is now klmAklm_klmABklmBklm_klm

Regular expressions(contd.) -- using subpattern matches in replacements • We saw that, within a regular expression, substrings that matched sub-patterns can be re-used later in the pattern by preceding the appropriate integer with a pair of backslashes, \\ • Within a <replacement>, substrings that matched sub-patterns in the regular expressioncan be used by preceding the appropriate integer with a dollar $

Using sub-pattern matches in replacements (contd.) • PHP code <?php $myString = "This is paragraph 1.This is paragraph 2."; echo "myString is ".str_replace("<","<",$myString)." "; $myString = preg_replace("/<(\w+)>(.+?)<\/\\1>/","$2",$myString); echo "myString is now ".str_replace("<","<",$myString); ?> • Resultant output is myString is This is paragraph 1.This is paragraph 2. myString is now This is paragraph 1.This is paragraph 2.

A reminder about greedy/frugal quantifiers • What would happen if we had used a greedy quantifier in previous slide? • PHP code <?php $myString = "This is paragraph 1.This is paragraph 2."; echo "myString is ".str_replace("<","<",$myString)." "; $myString = preg_replace("/<(\w+)>(.+)<\/\\1>/","$2",$myString); echo "myString is now ".str_replace("<","<",$myString); ?> • Resultant output is myString is This is paragraph 1.This is paragraph 2. myString is now This is paragraph 1.This is paragraph 2.

Choice of regexp delimiters • Up to now, we have used the forward slash character to mark the start and end of regular expressions • In fact, we can use any non-alphanumeric character for the purpose • For example, instead of writing /ab+c/ • we could write %ab+c% • This is useful when we wish to use the / character in our regular expression • See next slide

Using different regexp delimiters (contd.) • In the regular expression below, we do not have to escape the / character at the start of the close tag, because we are not using the / character as the regexp delimiter: <?php $myString = "This is paragraph 1.This is paragraph 2."; echo "myString is ".str_replace("<","<",$myString)." "; $myString = preg_replace(“%<(\w+)>(.+?)</\\1>%","$2",$myString); echo "myString is now ".str_replace("<","<",$myString); ?> • Resultant output is myString is This is paragraph 1.This is paragraph 2. myString is now This is paragraph 1.This is paragraph 2.

Using regexps to process nested HTML • PHP code: <?php $myString = “<ol><li>fred</li><li>tom</li></ol>"; echo "myString is ".str_replace("<","<",$myString)." "; $myString = preg_replace(“%<(\w+)>(.+)</\\1>%","$2",$myString); echo "myString is now ".str_replace("<","<",$myString); ?> • Resultant output is myString is <ol><li>fred</li><li>tom</li></ol> myString is now <li>fred</li><li>tom</li> • Suppose we wanted to remove all pairs of HTML tags. That is, suppose we wanted myString is <ol><li>fred</li><li>tom</li></ol> myString is now fredtom • How would we achieve that?

Using regexps to process nested HTML (contd.) • Would a frugal quantifier do the trick? PHP code: <?php $myString = “<ol><li>fred</li><li>tom</li></ol>"; echo "myString is ".str_replace("<","<",$myString)." "; $myString = preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString); echo "myString is now ".str_replace("<","<",$myString); ?> • No. The resultant output is still myString is <ol><li>fred</li><li>tom</li></ol> myString is now <li>fred</li><li>tom</li> • The reason is that, while preg_replace does replace all matching substrings in the target substring, it does not perform replacement operations on the replacement string • The value <li>fred</li><li>tom</li>above is the result of a replacement operation, so it is not modified • However, suppose we wanted to remove all pairs, no matter how deep the nesting. How would we do that?

Using regexps to process nested HTML (contd.) • We must use repetition to attack the nested instances <?php $myString = “<ol><li>fred</li><li>tom</li></ol>"; echo "myString is ".str_replace("<","<",$myString)." "; $newString = preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString); while ($newString != $mystring) { $myString = $newString; echo "myString is now ". str_replace("<","<",$myString).” ”; $newString = preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString); } ?> • The resultant output is now myString is <ol><li>fred</li><li>tom</li></ol> myString is now <li>fred</li><li>tom</li> myString is now fredtom

Using regexps to process nested HTML (contd.) • Of course, we would not want to run words together like we did on the last slide, so we would use spaces in the replacement string <?php $myString = “<ol><li>fred</li><li>tom</li></ol>"; echo "myString is ".str_replace("<","<",$myString)." "; $newString = preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString); while ($newString != $mystring) { $myString = $newString; echo "myString is now ". str_replace("<","<",$myString).” ”; $newString = preg_replace(“%<(\w+)>(.+?)<\\1>%"," $2 ",$myString); } ?> • The resultant output is now myString is <ol><li>fred</li><li>tom</li></ol> myString is now <li>fred</li><li>tom</li> myString is now fred tom

More on regular expressions – checking for context • All the preg_replace operations we have written so far have consumed all the characters that matched the regular expression • There was no notion of examining the context surrounding the consumed characters • any characters that were matched were consumed • We often need some way of matching characters without removing them from the target string • There four meta-expression for doing this, two for forward context and two for backward context

Look-ahead context checks (?=regexp) This is a positive lookahead context check It matches characters in the target string against the pattern specified by the embedded regular expression regexp without consuming them from the target string • Example preg_replace(“/\w+(?= cat)/”,”_”,$myString) This replaces with an underscore any word that is followed by a space and the word cat, without removing the space or the word cat from the target string • An example application is on the next slide

Look-ahead checks (contd.) • Program fragment: $myString = "tabby is a big cat. fido is a fat dog."; echo "myString is $myString "; $myString = preg_replace("/\w+(?= cat)/","_",$myString); echo "myString is now $myString"; • Output produced myString is tabby is a big cat. fido is a fat dog. myString is now tabby is a _ cat. fido is a fat dog.

Look-ahead checks (contd.) (?!regexp) This is a negative lookahead context check It ensures that characters in the target string do not match the pattern specified by the embedded regular expression regexp • Example preg_replace(“/cow(?! boy)/”,”_”,$myString) This replaces all sub-strings “cow” with “_”, provided these sub-strings are not followed by the sub-string “boy”

Look-ahead checks (contd.) • Program fragment: $myString = "Fred is a cowboy. Dolly is a cow."; echo "myString is $myString "; $myString = preg_replace("/cow(?!boy)/","_",$myString); echo "myString is now $myString"; • Output produced myString is Fred is a cowboy. Dolly is a cow. myString is now Fred is a cowboy. Dolly is a _.

Look-behind context checks (?<=regexp) This is a positive look-behind context check It ensures that preceding characters in the target string match the pattern specified by the embedded regular expression regexp • Example preg_replace(“/(?<= cow)boy/”,”girl”,$myString) This replaces all sub-strings “boy” with “girl”, provided these sub-strings are preceded by the sub-string “cow”, but the sub-string “cow” is not consumed.

Look-ahead checks (contd.) • Program fragment: $myString = “Fred is a cowboy. Tom is a boy."; echo "myString is $myString "; $myString = preg_replace("/(?<=cow)boy/",”girl",$myString); echo "myString is now $myString"; • Output produced myString is Fred is a cowboy. Tom is a boy. myString is now Fred is a cowgirl. Tom is a boy.

Look-behind checks (contd.) (?<!regexp) This is a negative look-behind context check It ensures that preceding characters in the target string do not match the pattern specified by the embedded regular expression regexp • Example preg_replace(“/(?<!cow)boy/”,”girl”,$myString) This replaces all sub-strings “boy” with “girl”, provided these sub-strings are not preceded by the sub-string “cow”

Look-ahead checks (contd.) • Program fragment: $myString = “Fred is a cowboy. Tom is a boy."; echo "myString is $myString "; $myString = preg_replace("/(?<!cow)boy/",”girl",$myString); echo "myString is now $myString"; • Output produced myString is Fred is a cowboy. Tom is a boy. myString is now Fred is a cowboy. Tom is a girl.

Regexp pattern modifiers • We have seen that a regexp is of the form /…../ where the slash characters are delimiters (and could be replaced by other non-alphanumeric printable characters • The terminating character can be followed by a sequence of modifiers which affect the meaning of the regexp between the delimiting characters

Example pattern modifier: the caseless match modifier • Program fragment: $myString = "Fred is a boy. Tom is a BOY."; echo "myString is $myString "; $newstring1 = preg_replace("/boy/","_",$myString); echo "newstring1 is $newstring1 "; $newstring2 = preg_replace("/boy/i","_",$myString); echo "newstring2 is $newstring2"; • Output produced myString is Fred is a boy. Tom is a BOY. newstring1 is Fred is a _. Tom is a BOY. newstring1 is Fred is a _. Tom is a _.

Contrast these 2 examples • Program fragment 1: <?php $oldstring1 = "Fred is a boy."; echo "oldstring1 is ".str_replace("<","<",$oldstring1)." "; $newstring1 = preg_replace("%.+%","_",$oldstring1); echo "newstring1 is ".str_replace("<","<",$newstring1); ?> • Output produced oldstring1 is Fred is a boy. newstring1 is _

Program fragment 2: <?php $oldstring1 = "Fred is a boy."; echo "oldstring1 is ".str_replace("<","<",$oldstring1)." "; $newstring1 = preg_replace("%.+%","_",$oldstring1); echo "newstring1 is ".str_replace("<","<",$newstring1); ?> • Output produced oldstring1 is Fred is a boy. newstring1 is Fred is a boy. • Why no replacment? Why no match? • Answer: • the target string is a multi-line string • the . meta-character does not match newline characters

The dot-all modifier • Program fragment: <?php $oldstring1 = "Fred is a boy."; echo "oldstring1 is ".str_replace("<","<",$oldstring1)." "; $newstring1 = preg_replace("%.+%s","_",$oldstring1); echo "newstring1 is ".str_replace("<","<",$newstring1); ?> • Output produced oldstring1 is Fred is a boy. newstring1 is _ • The dot-all modifier says that the dot meta-character should match all characters, including newlines

The dot-all modifier again • Program fragment 2: <?php $oldstring1 = "Fred \n is a boy."; echo "oldstring1 is ".str_replace("<","<",$oldstring1)." "; $newstring1 = preg_replace("%.+%s","_",$oldstring1); echo "newstring1 is ".str_replace("<","<",$newstring1); ?> • Output produced oldstring1 is Fred is a boy. newstring1 is _ • \n is a newline but the dot-all modifier says that the dot meta-character should match all characters, including newlines

The Ungreedy modifier • This is very similar to the use of the ? character to stop quantifiers being greedy

Example usage of the ungreedy modifier • Program fragment: $myString = "Fred is a boy.Ann is a girl."; echo "myString is ".str_replace("<","<",$myString)." "; $myString = preg_replace("%(.+)%U","$1",$myString); echo echo "myString is now ".str_replace("<","<",$myString); • Output produced myString is Fred is a boy.Ann is a girl. myString is now Fred is a boy.Ann is a girl. • The + meta-character is normally greedy but the U modifier has made it ungreedy

2nd Example usage of U modifier, part 1 • Program fragment 1: $myString = "Fred is a boy.Ann is a girl.<hr>"; echo "myString is $myString "; $myString = preg_replace("%(.+?)(.+)%","$1x$2",$myString); echo "myString is now $myString"; • Output produced myString is Fred is a boy.Ann is a girl.<hr> myString is now Fred is a boy.xAnn is a girl.<hr> • Here, the first + is frugal but the second is greedy

2nd Example usage of U modifier, part 2 • Program fragment 2: $myString = "Fred is a boy.Ann is a girl.<hr>"; echo "myString is $myString "; $myString = preg_replace("%(.+?)(.+)%U","$1x$2",$myString); echo "myString is now $myString"; • Output produced myString is Fred is a boy.Ann is a girl.<hr> myString is now Fred is a boy.Ann is a girl.x<hr> • Here, the first + is greedy but the second is frugal • That is, the U modifier reverses the meaning of the presence or absence of the ? Character after a quantifier

More on multi-line strings • By default, the subject string is regarded as consisting of a single "line" of characters (even if it actually contains several newlines). • The "start of line" metacharacter (^) matches only at the start of the string • The "end of line" metacharacter ($) matches only at the end of the string • Suppose, however, that we want these metacharacters to also match at newlines inside the string

Multi-line strings(contd.) • Program fragment: $myString = “Fred is a boy.\nAnn is a girl."; echo "myString is $myString "; $myString = preg_replace("%^(.+)$%","x$1y",$myString); echo "myString is now $myString"; • Output produced myString is Fred is a boy.Ann is a girl. myString is now Fred is a boy.Ann is a girl. • One the one hand, the ^ and $ characters did not match the newline in the middle of the string • On the other hand, the dot did not match the newline

The multi-line modifier • Program fragment: $myString = “Fred is a boy.\nAnn is a girl."; echo "myString is $myString "; $myString = preg_replace("%^(.+)$%m","x$1y",$myString); echo "myString is now $myString"; • Output produced myString is Fred is a boy.Ann is a girl. myString is now xFred is a boy.yxAnn is a girl.y • The multiline modifier m has made the ^ and $ characters match the newline in the middle of the string

CS4408 got here on 21 oct 2005

More PHP functions for using regular expressions • To be continued in next slide set

Mastering Regular Expressions: Subpatterns and Back-references