Regular Expressions

Regular Expressions

Handling Strings: String substitution and string matching http://www.zend.com/manual/ref.strings.php • A PHP string is a zero indexed array of characters. This means that the first letter of a string is at position zero in the array. • All string functions use zero as the index of the first character.

String Syntax • Single quotes: The easiest way to specify a simple string is to enclose it in single quotes (the character '). • No variable interpolation, \’ is only escape code • Double quotes: “a $better string\n” • Variables work, standard escape codes work • “Here-doc” syntax: $foo = <<<END … END; • Great for large multi-line blocks of text or html • Variables are interpolated • Note: newline must follow <<<END • END; must be the entire line, with no whitespace variableinterpolation. The word "interpolation" means to insert or introduce between other elements or parts. Variableinterpolation therefore means to use one variable inside another.

String Operators • Array-like character access: • $str = “MyBigString” => $str{3} == “B” • Concatenation: the dot operator • ”This lets you join strings into ”. ”bigger ones” • Note: Avoiding embedded newlines “in strings that wrap onto multiple lines” is a good idea • Concatenating Assignment : .= • $str = ”My name is”; $str .= ” Mac.\n”;

Variables in strings • “String with a $var in it\n” • “You can use $an_array[$var] too\n” • “Sometimes you need ${curl}ies to mark where the {$var}iable ends” • “Curlies help on {$big[‘fancy’][$stuff]} too” • “Where it’s confusing to embed “. $big[‘ugly’][$var].”iables, break it up as needed with concatenation.”

Basic String Manipulation • Any of this can be done with regular expressions • and in more complex cases, can only be done with regular expressions • But regular expressions are slower • str_replace(“bar”,”baz”,”foobar”); • str_repeat(“1234567890”,8);

What are Regular Expressions? • A regular expression is a formula for matching strings that follow a pattern. • Regular expressions are made up of normal characters and metacharacters. Normal characters include upper and lower case letters and digits. The metacharacters have special meanings • In the simplest case, a regular expression looks like a standard search string. For example, the regular expression "testing" contains no metacharacters. It will match "testing" and "123testing" but it will not match "Testing". • To really make good use of regular expressions it is critical to understand metacharacters.

Regular Expressions Metacharcters Examples: 1. Clean an HTML formatted text 2. Grab URLs from a Web page 3. Transform all lines from a file into lower case • \b: word boundaries • \d: digits • \n: newline • \r: carriage return • \s: white space characters • \t: tab • \w: alphanumeric characters • ^: beginning of string • $: end of string • .: any character • [ ] matches any character between the brackets • [bdkp]: characters b, d, k and p • [a-f]: characters a to f • [^a-f]: all characters except a to f • abc|def: string abc or string def • *: match pattern zero or more times • +: match pattern one or more times • ?: match pattern zero or one time • {p,q}: match pattern at least p times and at most q times • {p,}: match pattern at least p times • {p}: match pattern exactly p times

Regular Expressions Metacharacters Here are a few codes for matching special characters in regular expressions: \W // Any non-word character. Same as [^a-zA-Z0-9_] \D //Any non-digit. The same as [^0-9] \S //Any non-whitespace character \B //No word boundary [:alnum:] alphanumeric character [:alpha:] alphabetic character, any case [:blank:] space and tab [:digit:] digits [:lower:] lowercase alphabetics [:punct:] punctuation characters [:space:] all whitespace characters, including newline and carriage return [:upper:] uppercase alphabetics

Regular Expressions • a caret (^) may be used to indicate the beginning of the string • a dollar sign ($) is used to indicate the end PHP // Matches “What is PHP?” ^PHP // Matches “PHP rules!” but not “What is PHP?” PHP$ // Matches “I love PHP” but not “What is PHP?” ^PHP$ // Matches “PHP” but nothing else

Regular Expressions • To use ^, $, or other special characters as regular characters in a search string prefix it with a backslash: \$\$\$ // Matches “Show me the $$$!” • Other cases: \| // Vertical bar \[ // An open square bracket \) // A closing parenthesis \* // An asterisk \^ // A carat symbol \/ // A slash \\ // A backslash

Regular Expressions • Square brackets [] may be used to define a set of characters that may match. For example, the following regular expression will match any digit from 1 to 5 inclusive. [12345] // Matches "1" and "3", but not "a" or "12“ • Ranges of numbers and letters may also be specified. [1-5] // Same as previous ^[a-z]$ // Matches any lowercase letter [0-9a-zA-Z] // Matches any letter or digit

Regular Expressions • The characters ?, +, and * also have special meanings. ? means "the preceding character is optional", + means "one or more of the previous character", and * means "zero or more of the previous character". bana?na // Matches "banana" and "banna", // but not "banaana". bana+na // Matches "banana" and "banaana", // but not "banna". bana*na // Matches "banna", "banana", and "banaaana", // but not "bnana". ^[a-zA-z]+$ // Matches any string of one or more // letters and nothing else.

Regular Expressions • Parentheses () may be used to group strings together to apply ?, +, or * to them as a whole. ba(na)+na // Matches "banana" and "banananana", // but not "bana" or "banaana". • The period (.) matches any character except a newline ^.+$ // Matches a newline character Parentheses or Round Brackets create backreference: Besides grouping part of a regular expression together, parentheses also create a "backreference". A backreference stores the part of the string matched by the part of the regular expression inside the parentheses. The entire regex match as backreference zero – in PHP, the entire regular expression will be backreference zero

Regular Expressions – more examples "a(bc)*": matches a string that has an a followed by zero or more copies of the sequence "bc"; "a(bc){1,5}": one through five copies of "bc.“ There's also the '|' symbol, which works as an OR operator: "hi|hello": matches a string that has either "hi" or "hello" in it; "(b|cd)ef": a string that has either "bef" or "cdef"; "(a|b)*c": a string that has a sequence of alternating a's and b's ending in a c; A period ('.') stands for any single character: "a.[0-9]": matches a string that has an a followed by one character and a digit; "^.{3}$": a string with exactly 3 characters. Bracket expressions specify which characters are allowed in a single position of a string: "[ab]": matches a string that has either an a or a b (that's the same as "a|b"); "[a-d]": a string that has lowercase letters 'a' through 'd' (that's equal to "a|b|c|d" and even "[abcd]"); "^[a-zA-Z]": a string that starts with a letter; "[0-9]%": a string that has a single digit before a percent sign; ",[a-zA-Z0-9]$": a string that ends in a comma followed by an alphanumeric character. You can also list which characters you DON'T want -- just use a '^' as the first symbol in a bracket expression (i.e., "%[^a-zA-Z]%" matches a string with a character that is not a letter between two percent signs).

Regular Expressions A regular expression is a specially formatted pattern that can be used to find instances of one string in another. Posix style: PHP has six functions that work with regular expressions. They all take a regular expression string as their first argument, and are shown below: • ereg: The most common regular expression function, ereg allows us to search a string for matches of a regular expression. • ereg_replace: Allows us to search a string for a regular expression and replace any occurrence of that expression with a new string. • eregi: Performs exactly the same matching as ereg, but is case insensitive. • eregi_replace: Performs exactly the same search-replace functionality as ereg_replace, but is case insensitive. • split: Allows us to search a string for a regular expression and returns the matches as an array of strings. • spliti: Case insensitive version of the split function

Language syntax Ereg a basic function, can be used to determine whether a regular expression is "satisfied" by a particular text string. Consider the following code: $text = “PHP rules!”; if (ereg(“PHP”, $text)) {echo( '$text contains the string “PHP”.' );} else {echo( '$text does not contain the string “PHP”.' );} Output: $text contains the string "PHP".

Language syntax Eregi a function that behaves almost identically to ereg, except it ignores the case of text when looking for matches: $text = “What is Php?”; if (eregi("PHP", $text)) {echo( '$text contains the string “PHP”.' );} else {echo( '$text does not contain the string “PHP”.' );} Output: $text contains the string "PHP".

Regular Expression syntax Beginning of string: To search from the beginning of a string, use ^. For example, <?php echo ereg “hello”, “hello world!”); ?> Would return true, however <?php echo ereg(“^hello”, “i say hello world”); ?> would return false, because hello wasn't at the beginning of the string. End of string: To search at the end of a string, use $. For example, <?php echo ereg(“bye$”, “goodbye”); ?> Would return true, however <?php echo ereg(“bye$”, “goodbye my friend”); ?> would return false, because bye wasn't at the very end of the string.

Regular Expression syntax Any single character: To search for any character, use the dot. For example, <?php echo ereg(“.”, “cat”); ?> would return true, however <?php echo ereg(“.”, “”); ?> would return false, because our search string contains no characters. You can optionally tell the regular expression engine how many single characters it should match using curly braces. If I wanted a match on five characters only, then I would use ereg like this: <?php echo ereg(“.{5}$”, “12345”); ?> The code above tells the regular expression engine to return true if and only if at least five successive characters appear at the end of the string.

Regular Expression syntax We can also limit the number of characters that can appear in successive order: <?php echo ereg(“a{1,3}$’, “aaa”); ?> In the example above, we have told the regular expression engine that in order for our search string to match the expression, it should have between one and three 'a' characters at the end. <?php echo ereg(‘a{1,3}$”, ‘aaab”); ?> The example above wouldn't return true, because there are three 'a' characters in the search string, however they are not at the end of the string. If we took the end-of-string match $ out of the regular expression, then the string would match. We can also tell the regular expression engine to match at least a certain amount of characters in a row, and more if they exist. We can do so like this: <?php echo ereg(‘a{3,}$’, “aaaa”); ?>

Regular Expression syntax Repeat character zero or more times To tell the regular expression engine that a character may exist, and can be repeated, we use the * character. Here are two examples that would return true: <?php echo ereg("t*", "tom"); ?> <?php echo ereg("t*", "fom"); ?> Even though the second example doesn't contain the 't' character, it still returns true because the * indicates that the character may appear, and that it doesn't have to. In fact, any normal string pattern would cause the second call to ereg above to return true, because the 't' character is optional. Repeat character one or more times To tell the regular expression engine that a character must exist and that it can be repeated more than once, we use the + character, like this: <?php echo ereg("z+", "i like the zoo"); ?> The following example would also return true: <?php echo ereg("z+", "i like the zzzzzzoo!"); ?>

Regular Expression syntax Repeat character zero or one times We can also tell the regular expression engine that a character must either exist just once, or not at all. We use the ? character to do so, like this: <?php echo ereg("c?", "cats are fuzzy"); ?> If we wanted to, we could even entirely remove the 'c' from the search string shown above, and this expression would still return true. The '?' means that a 'c' may appear anywhere in the search string, but doesn't have to.

String Replacement ereg_replace accepts a regular expression and a string of text and attempts to match the regular expression in the string. Also takes a second string of text, and replaces every match of the regular expression with that string (eregi_replace is case insensitive): $newstring = ereg_replace(<regexp>, <replacewith>, <oldstring>); <regexp> is the regular expression, <replacewith> is the string that will replace matches to <regexp> in <oldstring>. The function returns the new string that is the outcome of the replacement operation. In the above, this gets stored in $newstring.

Regular Expressions PCRE style – Perl-Compatible Regular Expressions: The preg functions require you to specify the regular expression as a string using Perl syntax. In Perl, /regex/ defines a regular expression. In PHP, this becomes preg_match('/regex/', $string). When using Perl-style matching, the pattern is enclosed by special delimiters. The default is the forward slash, though you can use others. For example: /colour/ • preg_match_all (string pattern, string subject, array matches, int flags) fills the array "matches" with all the matches of the regular expression pattern in the subject string. If you specify PREG_SET_ORDER as the flag, then $matches[0] is an array containing the match and backreferences of the first match, $matches[1] holds the results for the second match, and so on. If you specify PREG_PATTERN_ORDER, then $matches[0] is an array with full subsequent regex matches, $matches[1] an array with the first backreference of all matches, $matches[2] an array with the second backreference of each match, etc.

Regular Expressions • preg_grep (string pattern, array subjects) returns an array that contains all the strings in the array "subjects" that can be matched by the regular expression pattern. • preg_replace (mixed pattern, mixed replacement, mixed subject [, int limit]) returns a string with all matches of the regex pattern in the subject string replaced with the replacement string. At most limit replacements are made. • preg_split (string pattern, string subject [, int limit]) works just like split, except that it uses the Perl syntax for the regex pattern.

Regular Expressions Forward slashes in the regular expression have to be escaped with a backslash: • http://www.unt.edu/ becomes '/http:\/\/www.unt.edu\//'. Some regex matching options: • case insensitivity - '/regex/i' applies the regex case insensitively • '/regex/s' makes the dot match all characters • '/regex/m' makes the start and end of line anchors match at embedded newlines in the subject string. Specify multiple letters to turn on several options: • '/regex/mis' turns on all three options. See the PHP manual for the complete list of options http://www.php.net/manual/en/function.preg-match.php

Regular Expressions preg_split Example 1. preg_split() example : Get the parts of a search string <?php // split the phrase by any number of commas or space characters, // which include " ", \r, \t, \n and \f $keywords = preg_split ("/[\s,]+/", "hypertext language, programming"); ?> Example 2. Splitting a string into component characters <?php $str = 'string'; $chars = preg_split('//', $str, -1, PREG_SPLIT_NO_EMPTY); print_r($chars); ?>

Regular Expressions Example 3. Splitting a string into matches and their offsets <?php $str = 'hypertext language programming'; $chars = preg_split('/ /', $str, -1, PREG_SPLIT_OFFSET_CAPTURE); print_r($chars); ?> will yield: Array ( [0] => Array ( [0] => hypertext [1] => 0 ) [1] => Array ( [0] => language [1] => 10 ) [2] => Array ( [0] => programming [1] => 19 ) )

Regular Expressions preg_match Example 1. Find the string of text "php" <?php // The "i" after the pattern delimiter indicates a case-insensitive search if (preg_match ("/php/i", "PHP is the web scripting language of choice.")) { print "A match was found."; } else { print "A match was not found."; } ?>

Regular Expressions Example 2. Find the word "web" <?php /* The \b in the pattern indicates a word boundary, so only the distinct * word "web" is matched, and not a word partial like "webbing" or "cobweb" */ if (preg_match ("/\bweb\b/i", "PHP is the web scripting language of choice.")) { print "A match was found."; } else { print "A match was not found."; } if (preg_match ("/\bweb\b/i", "PHP is the website scripting language of choice.")) { print "A match was found."; } else { print "A match was not found."; } ?>

Regular Expressions Example 3. Getting the domain name out of a URL <?php // get host name from URL preg_match("/^(http:\/\/)?([^\/]+)/i", "http://www.php.net/index.html", $matches); $host = $matches[2]; // get last two segments of host name preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches); echo "domain name is: {$matches[0]}\n"; ?> This example will produce: domain name is: php.net

Regular Expressions – more preg_match examples Example: $pattern = "/\b(do(ugh)?nut)\b.*\b(Homer|Fred)\b/i"; $target = "Have a donut, Homer."; if (preg_match($pattern, $target, $matches)) { print("Match: $ matches[0]"); print("Pastry: $ matches[1]"); print("Variant: $ matches[2]"); print("Name: $ matches[3]"); } else { print("No match."); } Results: Match: donut, Homer Pastry: donut Variant: [blank because there was no "ugh"] Name: Homer If you use the $target "Doughnut, Frederick?" there will be no match, since there has to be a word boundary after Fred. but "Doughnut, fred?" will match since we've specified it to be case-insensitive.

More Examples

Regular Expressions – find a valid ip number <? function valid_ipv4($ip_addr) { $num="([0-9]|1?\d\d|2[0-4]\d|25[0-5])"; $range="([1-9]|1\d|2\d|3[0-2])"; if(preg_match("/^$num\.$num\.$num\.$num(\/$range)?$/",$ip_addr)) { return 1; } return 0; } $ip_array[] = "127.0.0.1"; $ip_array[] = "127.0.0.256"; $ip_array[] = "127.0.0.1/36"; $ip_array[] = "127.0.0.1/1"; foreach ($ip_array as $ip_addr) { if(valid_ipv4($ip_addr)) { echo "$ip_addr is valid \n"; } else { echo "$ip_addr is NOT valid \n"; } } ?>

Regular Expressions PHP Get_title tag code which uses simple regex and php string functions <?php $readfile = "index.html"; function get_title_tag($file){ $content = file_get_contents($file); //echo $content; if (preg_match("/<title>(.*)<\/title>/i", $content, $matches)){ print("Title: $matches[1]"); } } get_title_tag($readfile); ?>

Performance/Speed • Rule of thumb: use the simplest function that will get the job done right • strpos instead of substr • str_replace instead of preg_replace • And so forth… • The PHP manual online usually includes notes about speed differences • PCRE is faster than POSIX Regex

Fuzzy String Matching • Fuzzy matching: It is tough to match two strings and say that they are quite similar, but not exact. • Levenshtein Distance: This calculates the minimum number of insertions, deletions, and substitutions necessary to convert one string into another. A low distance between two strings means that the strings are more similar. • PHP's implementation of levenshtein() gives each operation equal weight, while many other implementations give substitution twice the cost of insertion or deletion. The cost of each operation can be defined by setting the optional insert_cost , substitution_cost , and delete_cost parameters. • A demo of how Levenshtein works: • Peter Kleiweg's Excellent Levenshtein Demo

Regex Coach • Regular expressions are incredibly useful, they also easily get out of hand when trying to match complex strings. Furthermore, anything past twelve or so characters gets hard to read and understand, which is a common source of bugs. • The Regex Coach, available from www.weitz.de/regex-coach - it is free to use non-commercially, and is able to help you check your regular expressions are correct by visually highlighting strings that match. The Coach is fully compatible with all the options shown here, including string replacement, and can even break down a regexp and describe it in plain English.

Resources Using Strings By Nathan Wallace http://www.webreference.com/programming/php/regexps/ Regular Expression Library http://www.regexlib.com/ levenshtein() function http://www.zend.com/manual/function.levenshtein.php

Regular Expressions