1 / 36

Regular expressions (contd.) -- remembering subpattern matches

Regular expressions (contd.) -- remembering subpattern matches. When a <pattern> is being matched with a target string, substrings that match sub-patterns can be remembered and re-used later in the same pattern

ziv
Download Presentation

Regular expressions (contd.) -- remembering subpattern matches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular expressions (contd.) -- remembering subpattern matches • When a <pattern> is being matched with a target string, substrings that match sub-patterns can be remembered and re-used later in the same pattern • Sub-patterns whose matching substrings are to be remembered are enclosed in parentheses • The sub-patterns are implicitly numbered, starting from 1 and their matching substrings can then be re-used later in the pattern by using back-references like \1 or \2 or \3 • However, to get the backslash, we need to escape it, so we must type \\1 or \\2 or \\3 in our regular expressions

  2. Using back-references (contd.) • PHP code <?php $myString1 = ”klmAklmAAklmABklmBklmBBklm"; echo "myString is $myString <br>"; $myString1 = preg_replace(”/([A-Z])\\1/",”_",$myString1); echo "myString1 is now $myString1 "; ?> • Resultant output is myString1 is klmAklmAAklmABklmBklmBBklm myString1 is now klmAklm_klmABklmBklm_klm

  3. Using back-references (contd.) • PHP code <?php $myString = ”klmAklmAAklmABklmBklmBBklm"; echo "myString is $myString <br>"; $myString = preg_replace(”/([A-Z])\\1/",”_",$myString); echo "myString is now $myString "; ?> • Resultant output is myString1 is klmAklmAAklmABklmBklmBBklm myString1 is now klmAklm_klmABklmBklm_klm

  4. Regular expressions(contd.) -- using subpattern matches in replacements • We saw that, within a regular expression, substrings that matched sub-patterns can be re-used later in the pattern by preceding the appropriate integer with a pair of backslashes, \\ • Within a <replacement>, substrings that matched sub-patterns in the regular expressioncan be used by preceding the appropriate integer with a dollar $

  5. Using sub-pattern matches in replacements (contd.) • PHP code <?php $myString = "<p>This is paragraph 1.</p><p>This is paragraph 2.</p>"; echo "myString is ".str_replace("<","&lt;",$myString)." <br>"; $myString = preg_replace("/<(\w+)>(.+?)<\/\\1>/","$2",$myString); echo "myString is now ".str_replace("<","&lt;",$myString); ?> • Resultant output is myString is <p>This is paragraph 1.</p><p>This is paragraph 2.</p> myString is now This is paragraph 1.This is paragraph 2.

  6. A reminder about greedy/frugal quantifiers • What would happen if we had used a greedy quantifier in previous slide? • PHP code <?php $myString = "<p>This is paragraph 1.</p><p>This is paragraph 2.</p>"; echo "myString is ".str_replace("<","&lt;",$myString)." <br>"; $myString = preg_replace("/<(\w+)>(.+)<\/\\1>/","$2",$myString); echo "myString is now ".str_replace("<","&lt;",$myString); ?> • Resultant output is myString is <p>This is paragraph 1.</p><p>This is paragraph 2.</p> myString is now This is paragraph 1.</p><p>This is paragraph 2.

  7. Choice of regexp delimiters • Up to now, we have used the forward slash character to mark the start and end of regular expressions • In fact, we can use any non-alphanumeric character for the purpose • For example, instead of writing /ab+c/ • we could write %ab+c% • This is useful when we wish to use the / character in our regular expression • See next slide

  8. Using different regexp delimiters (contd.) • In the regular expression below, we do not have to escape the / character at the start of the close tag, because we are not using the / character as the regexp delimiter: <?php $myString = "<p>This is paragraph 1.</p><p>This is paragraph 2.</p>"; echo "myString is ".str_replace("<","&lt;",$myString)." <br>"; $myString = preg_replace(“%<(\w+)>(.+?)</\\1>%","$2",$myString); echo "myString is now ".str_replace("<","&lt;",$myString); ?> • Resultant output is myString is <p>This is paragraph 1.</p><p>This is paragraph 2.</p> myString is now This is paragraph 1.This is paragraph 2.

  9. Using regexps to process nested HTML • PHP code: <?php $myString = “<ol><li>fred</li><li>tom</li></ol>"; echo "myString is ".str_replace("<","&lt;",$myString)." <br>"; $myString = preg_replace(“%<(\w+)>(.+)</\\1>%","$2",$myString); echo "myString is now ".str_replace("<","&lt;",$myString); ?> • Resultant output is myString is <ol><li>fred</li><li>tom</li></ol> myString is now <li>fred</li><li>tom</li> • Suppose we wanted to remove all pairs of HTML tags. That is, suppose we wanted myString is <ol><li>fred</li><li>tom</li></ol> myString is now fredtom • How would we achieve that?

  10. Using regexps to process nested HTML (contd.) • Would a frugal quantifier do the trick? PHP code: <?php $myString = “<ol><li>fred</li><li>tom</li></ol>"; echo "myString is ".str_replace("<","&lt;",$myString)." <br>"; $myString = preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString); echo "myString is now ".str_replace("<","&lt;",$myString); ?> • No. The resultant output is still myString is <ol><li>fred</li><li>tom</li></ol> myString is now <li>fred</li><li>tom</li> • The reason is that, while preg_replace does replace all matching substrings in the target substring, it does not perform replacement operations on the replacement string • The value <li>fred</li><li>tom</li>above is the result of a replacement operation, so it is not modified • However, suppose we wanted to remove all pairs, no matter how deep the nesting. How would we do that?

  11. Using regexps to process nested HTML (contd.) • We must use repetition to attack the nested instances <?php $myString = “<ol><li>fred</li><li>tom</li></ol>"; echo "myString is ".str_replace("<","&lt;",$myString)." <br>"; $newString = preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString); while ($newString != $mystring) { $myString = $newString; echo "myString is now ". str_replace("<","&lt;",$myString).”<br>”; $newString = preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString); } ?> • The resultant output is now myString is <ol><li>fred</li><li>tom</li></ol> myString is now <li>fred</li><li>tom</li> myString is now fredtom

  12. Using regexps to process nested HTML (contd.) • Of course, we would not want to run words together like we did on the last slide, so we would use spaces in the replacement string <?php $myString = “<ol><li>fred</li><li>tom</li></ol>"; echo "myString is ".str_replace("<","&lt;",$myString)." <br>"; $newString = preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString); while ($newString != $mystring) { $myString = $newString; echo "myString is now ". str_replace("<","&lt;",$myString).”<br>”; $newString = preg_replace(“%<(\w+)>(.+?)<\\1>%"," $2 ",$myString); } ?> • The resultant output is now myString is <ol><li>fred</li><li>tom</li></ol> myString is now <li>fred</li><li>tom</li> myString is now fred tom

  13. More on regular expressions – checking for context • All the preg_replace operations we have written so far have consumed all the characters that matched the regular expression • There was no notion of examining the context surrounding the consumed characters • any characters that were matched were consumed • We often need some way of matching characters without removing them from the target string • There four meta-expression for doing this, two for forward context and two for backward context

  14. Look-ahead context checks (?=regexp) This is a positive lookahead context check It matches characters in the target string against the pattern specified by the embedded regular expression regexp without consuming them from the target string • Example preg_replace(“/\w+(?= cat)/”,”_”,$myString) This replaces with an underscore any word that is followed by a space and the word cat, without removing the space or the word cat from the target string • An example application is on the next slide

  15. Look-ahead checks (contd.) • Program fragment: $myString = "tabby is a big cat. fido is a fat dog."; echo "myString is $myString <br>"; $myString = preg_replace("/\w+(?= cat)/","_",$myString); echo "myString is now $myString"; • Output produced myString is tabby is a big cat. fido is a fat dog. myString is now tabby is a _ cat. fido is a fat dog.

  16. Look-ahead checks (contd.) (?!regexp) This is a negative lookahead context check It ensures that characters in the target string do not match the pattern specified by the embedded regular expression regexp • Example preg_replace(“/cow(?! boy)/”,”_”,$myString) This replaces all sub-strings “cow” with “_”, provided these sub-strings are not followed by the sub-string “boy”

  17. Look-ahead checks (contd.) • Program fragment: $myString = "Fred is a cowboy. Dolly is a cow."; echo "myString is $myString <br>"; $myString = preg_replace("/cow(?!boy)/","_",$myString); echo "myString is now $myString"; • Output produced myString is Fred is a cowboy. Dolly is a cow. myString is now Fred is a cowboy. Dolly is a _.

  18. Look-behind context checks (?<=regexp) This is a positive look-behind context check It ensures that preceding characters in the target string match the pattern specified by the embedded regular expression regexp • Example preg_replace(“/(?<= cow)boy/”,”girl”,$myString) This replaces all sub-strings “boy” with “girl”, provided these sub-strings are preceded by the sub-string “cow”, but the sub-string “cow” is not consumed.

  19. Look-ahead checks (contd.) • Program fragment: $myString = “Fred is a cowboy. Tom is a boy."; echo "myString is $myString <br>"; $myString = preg_replace("/(?<=cow)boy/",”girl",$myString); echo "myString is now $myString"; • Output produced myString is Fred is a cowboy. Tom is a boy. myString is now Fred is a cowgirl. Tom is a boy.

  20. Look-behind checks (contd.) (?<!regexp) This is a negative look-behind context check It ensures that preceding characters in the target string do not match the pattern specified by the embedded regular expression regexp • Example preg_replace(“/(?<!cow)boy/”,”girl”,$myString) This replaces all sub-strings “boy” with “girl”, provided these sub-strings are not preceded by the sub-string “cow”

  21. Look-ahead checks (contd.) • Program fragment: $myString = “Fred is a cowboy. Tom is a boy."; echo "myString is $myString <br>"; $myString = preg_replace("/(?<!cow)boy/",”girl",$myString); echo "myString is now $myString"; • Output produced myString is Fred is a cowboy. Tom is a boy. myString is now Fred is a cowboy. Tom is a girl.

  22. Regexp pattern modifiers • We have seen that a regexp is of the form /…../ where the slash characters are delimiters (and could be replaced by other non-alphanumeric printable characters • The terminating character can be followed by a sequence of modifiers which affect the meaning of the regexp between the delimiting characters

  23. Example pattern modifier: the caseless match modifier • Program fragment: $myString = "Fred is a boy. Tom is a BOY."; echo "myString is $myString <br>"; $newstring1 = preg_replace("/boy/","_",$myString); echo "newstring1 is $newstring1 <br>"; $newstring2 = preg_replace("/boy/i","_",$myString); echo "newstring2 is $newstring2"; • Output produced myString is Fred is a boy. Tom is a BOY. newstring1 is Fred is a _. Tom is a BOY. newstring1 is Fred is a _. Tom is a _.

  24. Contrast these 2 examples • Program fragment 1: <?php $oldstring1 = "<p>Fred is a boy.</p>"; echo "oldstring1 is ".str_replace("<","&lt;",$oldstring1)." <br>"; $newstring1 = preg_replace("%<p>.+</p>%","_",$oldstring1); echo "newstring1 is ".str_replace("<","&lt;",$newstring1); ?> • Output produced oldstring1 is <p>Fred is a boy.</p> newstring1 is _

  25. Program fragment 2: <?php $oldstring1 = "<p>Fred is a boy.</p>"; echo "oldstring1 is ".str_replace("<","&lt;",$oldstring1)." <br>"; $newstring1 = preg_replace("%<p>.+</p>%","_",$oldstring1); echo "newstring1 is ".str_replace("<","&lt;",$newstring1); ?> • Output produced oldstring1 is <p>Fred is a boy.</p> newstring1 is <p>Fred is a boy.</p> • Why no replacment? Why no match? • Answer: • the target string is a multi-line string • the . meta-character does not match newline characters

  26. The dot-all modifier • Program fragment: <?php $oldstring1 = "<p>Fred is a boy.</p>"; echo "oldstring1 is ".str_replace("<","&lt;",$oldstring1)." <br>"; $newstring1 = preg_replace("%<p>.+</p>%s","_",$oldstring1); echo "newstring1 is ".str_replace("<","&lt;",$newstring1); ?> • Output produced oldstring1 is <p>Fred is a boy.</p> newstring1 is _ • The dot-all modifier says that the dot meta-character should match all characters, including newlines

  27. The dot-all modifier again • Program fragment 2: <?php $oldstring1 = "<p>Fred \n is a boy.</p>"; echo "oldstring1 is ".str_replace("<","&lt;",$oldstring1)." <br>"; $newstring1 = preg_replace("%<p>.+</p>%s","_",$oldstring1); echo "newstring1 is ".str_replace("<","&lt;",$newstring1); ?> • Output produced oldstring1 is <p>Fred is a boy.</p> newstring1 is _ • \n is a newline but the dot-all modifier says that the dot meta-character should match all characters, including newlines

  28. The Ungreedy modifier • This is very similar to the use of the ? character to stop quantifiers being greedy

  29. Example usage of the ungreedy modifier • Program fragment: $myString = "<p>Fred is a boy.</p><p>Ann is a girl.</p>"; echo "myString is ".str_replace("<","&lt;",$myString)." <br>"; $myString = preg_replace("%<p>(.+)</p>%U","$1",$myString); echo echo "myString is now ".str_replace("<","&lt;",$myString); • Output produced myString is <p>Fred is a boy.</p><p>Ann is a girl.</p> myString is now Fred is a boy.Ann is a girl. • The + meta-character is normally greedy but the U modifier has made it ungreedy

  30. 2nd Example usage of U modifier, part 1 • Program fragment 1: $myString = "<p>Fred is a boy.</p><p>Ann is a girl.</p><hr>"; echo "myString is $myString <br>"; $myString = preg_replace("%<p>(.+?)</p>(.+)%","$1x$2",$myString); echo "myString is now $myString"; • Output produced myString is <p>Fred is a boy.</p><p>Ann is a girl.</p><hr> myString is now Fred is a boy.x<p>Ann is a girl.</p><hr> • Here, the first + is frugal but the second is greedy

  31. 2nd Example usage of U modifier, part 2 • Program fragment 2: $myString = "<p>Fred is a boy.</p><p>Ann is a girl.</p><hr>"; echo "myString is $myString <br>"; $myString = preg_replace("%<p>(.+?)</p>(.+)%U","$1x$2",$myString); echo "myString is now $myString"; • Output produced myString is <p>Fred is a boy.</p><p>Ann is a girl.</p><hr> myString is now Fred is a boy.</p><p>Ann is a girl.x<hr> • Here, the first + is greedy but the second is frugal • That is, the U modifier reverses the meaning of the presence or absence of the ? Character after a quantifier

  32. More on multi-line strings • By default, the subject string is regarded as consisting of a single "line" of characters (even if it actually contains several newlines). • The "start of line" metacharacter (^) matches only at the start of the string • The "end of line" metacharacter ($) matches only at the end of the string • Suppose, however, that we want these metacharacters to also match at newlines inside the string

  33. Multi-line strings(contd.) • Program fragment: $myString = “<p>Fred is a boy.</p>\n<p>Ann is a girl.</p>"; echo "myString is $myString <br>"; $myString = preg_replace("%^<p>(.+)</p>$%","x$1y",$myString); echo "myString is now $myString"; • Output produced myString is <p>Fred is a boy.</p><p>Ann is a girl.</p> myString is now <p>Fred is a boy.</p><p>Ann is a girl.</p> • One the one hand, the ^ and $ characters did not match the newline in the middle of the string • On the other hand, the dot did not match the newline

  34. The multi-line modifier • Program fragment: $myString = “<p>Fred is a boy.</p>\n<p>Ann is a girl.</p>"; echo "myString is $myString <br>"; $myString = preg_replace("%^<p>(.+)</p>$%m","x$1y",$myString); echo "myString is now $myString"; • Output produced myString is <p>Fred is a boy.</p><p>Ann is a girl.</p> myString is now xFred is a boy.yxAnn is a girl.y • The multiline modifier m has made the ^ and $ characters match the newline in the middle of the string

  35. CS4408 got here on 21 oct 2005

  36. More PHP functions for using regular expressions • To be continued in next slide set

More Related