1 / 50

Text manipulation

Text manipulation. Suppose you want to build a web-page which will always contain the latest sports headlines collected from several newspaper websites You might, for example, wish to include the Guardian’s sports headlines on your page. Adding these headlines manually.

Download Presentation

Text manipulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text manipulation • Suppose you want to build a web-page which will always contain the latest sports headlines collected from several newspaper websites • You might, for example, wish to include the Guardian’s sports headlines on your page

  2. Adding these headlines manually • You would have to access the source of the Guardian page

  3. You would then have to find the text which defines the headlines • Analyse it • And copy the relevant bits into the HTML for your own web-page

  4. Examining it, we find that the source contains one HTML table for each sport in the list of top stories • Here is the table for the tennis headlines on the page seen earlier: <table cellspacing="0"><tr> <td class="imgholder"><a HREF="/tennis/story/0,10069,1581862,00.html"><img src="http://image.guardian.co.uk/sys-images/Sport/Pix/pictures/2005/09/30/andy2.jpg" width="128" height="128" border="0" alt="Andy Murray in action during his win over Robby Ginepri" /></a></td> <td><font face="Geneva, Arial, Helvetica, sans-serif" size="2"><span class="mainlink"><a HREF="/tennis/story/0,10069,1581862,00.html">Murray magic books semi spot</a></span><br /><b>Tennis:</b> The biggest win of his career to-date saw Andy Murray stun Robby Ginepri and reach the last four at the Thailand Open. <br /> <a HREF="/tennis/story/0,10069,1580918,00.html">Tough home Davis Cup tie for GB</a><br /> <a HREF="/tennis/0,10067,495916,00.html">More tennis</a></font></td> </tr></table>

  5. Here is the text which defines the main tennis headline on the page shown earlier: <span class="mainlink"> <a HREF="/tennis/story/0,10069,1581862,00.html"> Murray magic books semi spot </a></span> <br /> <b>Tennis:</b> The biggest win of his career to-date saw Andy Murray stun Robby Ginepri and reach the last four at the Thailand Open.

  6. To get this story onto your own web-page you could then copy the relevant HTML segment into the source code for your web-page • But … • … doing this manually is very labour-intensive • We ought to automate the complete task

  7. Adding headlines automatically • To add headlines automatically, you would have to write a program which would • Download the source code for the Guardian page • Analyse this source code to extract the appropriate text • Add the relevant text to source code for your own web-page

  8. Adding headlines automatically • Later, we will see how to download page sources from other websites • Now, we will focus on the issue of text analysis

  9. Regular Expressions • Regular expression technology provides a convenient way of searching string for patterns of interest

  10. Regular expressions (contd.) • Example regular expression: /ab*c/ this searches the target string for substring(s) that comprise “an a followed byzero or more instances of b followed by by a c” • It will match any of the following substrings: ac abc abbc abbbc ….

  11. Using regular expressions in PHP • Regular expressions are supported in several languages, including PHP • PHP provides a group of pre-defined functions for using them • For now, we will focus on just one of these, the preg_replace function

  12. The preg_replace function • Format of call: preg_replace (regexp, replacement, subject [, int limit]) • This function returns the result of replacing substrings in subject which match regexp with replacement • The number of matching substrings which are replaced is controlled by the optional parameterlimit • An example application is on the next slide

  13. Regular expressions (contd.) • PHP code <?php $myString = "xyzacklmabbcpqrabbbbbcstu"; echo "myString is $myString <br />"; $myString = preg_replace("/ab*c/","_",$myString); echo "myString is now $myString"; ?> • Resultant output is myString is xyzacklmabbcpqrabbbbbcstu myString is now xyz_klm_pqr_stu

  14. Using the limit parameter in preg_replace • PHP code <?php $myString = "xyzacklmabbcpqrabbbbbcstu"; echo "myString is $myString <br />"; $myString = preg_replace("/ab*c/","_",$myString,1); echo "myString is now $myString"; ?> • Resultant output is myString is xyzacklmabbcpqrabbbbbcstu myString is now xyz_klmabbcpqrabbbbbcstu

  15. Meta-characters • We have seen that certain characters have a special meaning in regular expressions: • the example on the last few slides used the * character which means “0 or more instances of the preceding character or pattern” • These are called meta-characters • Other meta-characters are listed on the next slide

  16. The meta-characters include: • the * character which means “0 or more instances of preceding” • the + character, which means “1 or more instances of preceding” • the ? character, which means “0 or 1 instances of preceding” • the { and } character delimit an expression specifying a range of acceptable occurrences of the preceding character • Examples: {m} means exactly m occurences of preceding character/pattern {m,} means at least m occurrences of preceding char/pattern {m,n} means at least m, but not more than n, occurrences of preceding char/pattern • Thus, {0,} is equivalent to * {1,} is equivalent to + {0,1} is equivalent to ?

  17. Regular expressions (contd.) • Further meta-characters are: • the^ character, which matches the start of a string • the $ character, which matches the end of a string • the . character which matching anything except a newline character • the [ and ]character starts an equivalence class of characters, any of which can match one character in the target string • the( and ) characters delimit a group of sub-patterns • the| character separates alternative patterns

  18. Regular expressions (contd.) • Example expression: /^a.*d$/ this matches the entire target string provided the target string starts with an a, followed by zero or more non-newline characters, and ends with a d • An example application is on the next slide

  19. Example application • PHP code <?php $myString1 = ”abcdefghijklmnopqrstuvd"; echo "myString1 is $myString1 <br />"; $myString1 = preg_replace(”/^a.*d$/","_",$myString1); echo "myString1 is now $myString1 <br />"; $myString2 = ”xabcdefghijklmnopqrstuvd"; echo "myString2 is $myString2 <br />"; $myString2 = preg_replace(”/^a.*d$/","_",$myString2); echo "myString2 is now $myString2"; ?> • Resultant output is myString1 is abcdefghijklmnopqrstuvd myString1 is now _ myString2 is xabcdefghijklmnopqrstuvd myString2 is now xabcdefghijklmnopqrstuvd

  20. Regular expressions (contd.) • Example expression: /^a.{2,5}d$/ this replaces the entire target string with “x”, provided the target string starts with an a, followed by between two and five non-newline characters, and ends with a d • An example application is on the next slide

  21. Regular expressions (contd.) • PHP code <?php $myString1 = "adabbbbccccaaaabbbbccccd"; echo "myString1 is $myString1 <br />"; $myString1 = preg_replace(”/^a.{2,5}d$/","_",$myString1); echo "myString1 is now $myString1 <br>"; $myString2 = "afghd"; echo "myString2 is $myString2 <br />"; $myString2 = preg_replace(”/^a.{2,5}d$/","_",$myString2); echo "myString2 is now $myString2"; ?> • Resultant output is myString1 is adabbbbccccaaaabbbbccccd myString1 is now adabbbbccccaaaabbbbccccd myString2 is afghd myString2 is now _

  22. Regular expressions (contd.) • Example regular expression: /(abc){2,5}d/ this matches sub-string(s) in the target that comprise “between 2 and 5 repeats of the pattern abc followed by a d” • An example application is on the next slide

  23. Regular expressions (contd.) • PHP code <?php $myString = "klmabcabcabcdpqrabcdklmabcabcabcabcdxyz"; echo "myString is $myString <br />"; $myString = preg_replace("/(abc){2,5}d/","_",$myString); echo "myString is now $myString"; ?> • Resultant output is myString is klmabcabcabcdpqrabcdklmabcabcabcabcdxyz myString is now klm_pqrabcdklm_xyz

  24. Regular expressions (contd.) • Example regular expression: /(foo|bar)/ this matches sub-strings foo or bar • An example application is on the next slide

  25. Regular expressions (contd.) • PHP code <?php $myString = ”abcfoodefbarghi"; echo "myString is $myString <br />"; $myString = preg_replace("/(foo|bar)/","_",$myString); echo "myString is now $myString"; ?> • Resultant output is myString is abcfoodefbarghi myString is now abc_def_ghi

  26. Regular expressions (contd.) • Although some characters have special meanings in regular expressions, we may, sometimes, just want to use them to match themselves in the target string • We do this by escaping them in the regular expression, by preceding them with a backslash \ • Example regular expression: /^a\^+.*d$/ this matches the entire target string, provided the target string starts with an a, followed by one or more carat characters, followed by zero or more non-newline characters, and ends with a d • An example application is on the next slide

  27. Example application • PHP code <?php $myString1 = ”adabbbbcabbcabced"; echo "myString1 is $myString1 <br />"; $myString1 = preg_replace(”/^a\^+.*d$/","_",$myString1); echo "myString1 is now $myString1 <br />"; $myString2 = ”a^^^abbbbcabbcabceed"; echo "myString2 is $myString2 <br />"; $myString2 = preg_replace(”/^a\^+.*d$/","_",$myString2); echo "myString2 is now $myString2"; ?> • Resultant output is myString1 is adabbbbcabbcabced myString1 is now adabbbbcabbcabced myString2 is a^^^abbbbcabbcabceed myString2 is now _

  28. Regular expressions (contd.) • As mentioned earlier, the [ and ] characters have a special meaning in regular expressions • they delimit an equivalence class of characters, any one of which may be used to match one character in the target string • Example regular expression: /a[KLM]b/ replaces any substring comprising “the letter a followed by one of the three letters KLM, followed by the letter b”

  29. Regular expressions (contd.) • The ^ character has a special meaning when used as the first character between [ and ] characters; this meaning is different from its special meaning when used outside the [ and ] characters • when used as the first character between the [ and ] characters, the ^ character specifies the complement of the equivalence class that would have been specified if its were absent • Example regular expression: /a[^KLM]b/ replaces any substring comprising “the letter a followed by any single letter that is not one of KLM, followed by the letter b”

  30. Regular expressions (contd.) • The - character also has a special meaning when used between [ and ] characters: • it is used to join the start and end of a sequence of characters, any one of which may be used to match one character in the target string • Example regular expression: /a[0-9]b/ matches any substring comprising “the letter a followed by one digit, followed by the letter b”

  31. Regular expressions (contd.) • Example regular expression: / %[a-fA-F0-9]/ matches any substring comprising “an % followed by a hexadecimal digit”

  32. Regular expressions (contd.) • Certain escape sequences also have a special meaning in regular expressions. They define certain commonly used equivalence classes of characters: \wis equivalent to [a-zA-Z0-9_] \Wis equivalent to [^a-zA-Z0-9_] \dis equivalent to [0-9] \Dis equivalent to [^0-9] \sis equivalent to [ \n\t\f\r] \Sis equivalent to [^ \n\t\f\r] \bdenotes a word boundary \Bdenotes a non-word boundary • Note the SP characters in the meaning of \s and \S, that is the white-space equivalence includes SP • Byt the way, \f is formFeed and \r is carriageReturn

  33. Regular expressions (contd.) • Example regular expression: / %\d\d\d\D/ matches any substring comprising “an % followed by three decimal digits, followed by a non-digit” • Example regular expression: / \s\w\w\s/ matches any substring comprising “a white-space character, followed by two word characters, followed by another white-space character”

  34. Regular expressions (contd.) • PHP code <?php $myString = ”This is not an apple"; echo "myString is $myString <br />"; $myString = preg_replace("/\s\w\w\s/","_",$myString); echo "myString is now $myString"; ?> • Resultant output is myString is This is not an apple myString is now This_not_apple

  35. Regular expressions (contd.) • The standard quantifiers are all "greedy” • they match as many occurrences as possible without causing the pattern to fail. • It is possible to make them “frugal” • that is, make them match the minimum number of times necessary • We do this by following the quantifier with a "?" • *? Match 0 or more times, preferably only 0 • +? Match 1 or more times, preferably only 1 time • ?? Match 0 or 1 time, preferably only 0 • {n}? Match exactly n times • {n,}? Match at least n times, preferably only n times • {n,m}? Match at least n but not more than m times, preferably only n times

  36. Regular expressions (contd.) • PHP code <?php $myString1 = ”abcabcabcabc"; echo "myString1 is $myString1 <br />"; $myString1 = preg_replace(”/(abc){2,5}/",”x",$myString1); echo "myString1 is now $myString1 <br />"; $myString2 = "abcabcabcabc"; echo "myString2 is $myString2 <br />"; $myString2 = preg_replace(”/(abc){2,5}?/",”x",$myString2); echo "myString2 is now $myString2"; ?> • Resultant output is myString1 is abcabcabcabc myString1 is now x myString2 is abcabcabcabc myString2 is now xx • What is going on here? See next slide for contrast

  37. Regular expressions (contd.) • PHP code <?php $myString1 = ”abcabcabcabc"; echo "myString1 is $myString1 <br />"; $myString1 = preg_replace(”/(abc){2,5}/",”x",$myString1,1); echo "myString1 is now $myString1 <br />"; $myString2 = "abcabcabcabc"; echo "myString2 is $myString2 <br />"; $myString2 = preg_replace(”/(abc){2,5}?/",”x",$myString2,1); echo "myString2 is now $myString2"; ?> • Resultant output is myString1 is abcabcabcabc myString1 is now x myString2 is abcabcabcabc myString2 is now xabcabc • Discussion of contrast with previous slide ...

  38. A digression • Before proceeding to further regexp concepts, let’s look at applying to HTML manipulation what we have already seen

  39. Example task • Suppose we have the following HTML <ul><li>wine</li><li>f12</li><li>cheese</li></ul> • Suppose we want to eliminate from the list any list item whose content comprises only non-digits • That is, we want the HTML to become <ul><li>f12</li></ul>

  40. Regular expressions (contd.) • PHP code <?php $myString = ”<ul><li>wine</li><li>f12</li><li>cheese</li></ul>"; echo "myString is $myString <br />"; $myString = preg_replace(”/<li>\D+<\/li>/",”",$myString); echo "myString is now $myString <br />"; ?> • Resultant output is myString is • wine • f12 • cheese myString is now • f12

  41. Seeing the raw-HTML • Suppose we want to see the raw HTML in our output • That is, suppose we wanted to see myString is <ul><li>wine</li><li>f12</li><li>cheese</li></ul> myString is now<ul><li>f12</li></ul> • We would have to replace all occurrences of < with &lt; • We could use regular expressions for this but, • the string to be replaced is a constant • so we can use a simpler technology

  42. Regular expressions (contd.) • PHP code <?php $myString = ”<ul><li>wine</li><li>f12</li><li>cheese</li></ul>"; echo "myString is ".str_replace(“<“,”&lt;”,$myString).”<br>"; $myString = preg_replace("/<li>\D+<\/li>/",”x",$myString); echo "myString is now ".str_replace(“<“,”&lt;”,$myString); ?> • Now the resultant output is myString is <ul><li>wine</li><li>f12</li><li>cheese</li></ul> myString is now <ul><li>f12</li></ul>

  43. Suppose we want to replace every list item with the fixed phrase listItem • That is, we wanted to see this output myString is <ul><li>wine</li><li>f12</li><li>cheese</li></ul> myString is now <ul> listItem listItem listItem </ul>

  44. Regular expressions (contd.) • Suppose we try this <?php $myString = ”<ul><li>wine</li><li>f12</li><li>cheese</li></ul>"; echo "myString is ".str_replace(“<“,”&lt;”,$myString).”<br>"; $myString = preg_replace("/<li>.+<\/li>/",” listItem ",$myString); echo "myString is now ".str_replace(“<“,”&lt;”,$myString); ?> • Resultant output is myString is <ul><li>wine</li><li>f12</li><li>cheese</li></ul> myString is now <ul> listItem </ul> • What is wrong? • We need to make the + quantifier ungreedy

  45. Regular expressions (contd.) • We must do this <?php $myString = ”<ul><li>wine</li><li>f12</li><li>cheese</li></ul>"; echo "myString is ".str_replace(“<“,”&lt;”,$myString).”<br>"; $myString = preg_replace("/<li>.+?<\/li>/",” listItem ",$myString); echo "myString is now ".str_replace(“<“,”&lt;”,$myString); ?> • Resultant output is myString is <ul><li>wine</li><li>f12</li><li>cheese</li></ul> myString is now <ul> listItem listItem listItem </ul>

  46. End of digression • Back to regular expressions ...

  47. Regular expressions (contd.) -- remembering subpattern matches • When a <pattern> is being matched with a target string, substrings that match sub-patterns can be remembered and re-used later in the same pattern • Sub-patterns whose matching substrings are to be remembered are enclosed in parentheses • The sub-patterns are implicitly numbered, starting from 1 and their matching substrings can then be re-used later in the pattern by using back-references like \1 or \2 or \3 • However, to get the backslash, we need to escape it, so we must type \\1 or \\2 or \\3 in our regular expressions

  48. Using back-references (contd.) • PHP code <?php $myString1 = ”klmAklmAAklmABklmBklmBBklm"; echo "myString is $myString <br />"; $myString1 = preg_replace(”/([A-Z])\\1/",”_",$myString1); echo "myString1 is now $myString1 "; ?> • Resultant output is myString1 is klmAklmAAklmABklmBklmBBklm myString1 is now klmAklm_klmABklmBklm_klm

  49. http://www.cs.ucc.ie/j.bowen/cs4408/slides/

More Related