Strings and Regular Expressions in PHP, or “PCRE, POSIX, and Bears, Oh My!” - PowerPoint PPT Presentation

adamdaniel
strings and regular expressions in php or pcre posix and bears oh my n.
Skip this Video
Loading SlideShow in 5 Seconds..
Strings and Regular Expressions in PHP, or “PCRE, POSIX, and Bears, Oh My!” PowerPoint Presentation
Download Presentation
Strings and Regular Expressions in PHP, or “PCRE, POSIX, and Bears, Oh My!”

play fullscreen
1 / 33
Download Presentation
Strings and Regular Expressions in PHP, or “PCRE, POSIX, and Bears, Oh My!”
576 Views
Download Presentation

Strings and Regular Expressions in PHP, or “PCRE, POSIX, and Bears, Oh My!”

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Strings and Regular Expressions in PHP, or“PCRE, POSIX, and Bears, Oh My!” UPHPU Meeting January 18, 2005 Mac Newbold mac@macnewbold.com

  2. Who am I? • Full-time self-employed computer geek • MNE, LLC (macnewbold.com, owner) and • Digital Media Consulting, LLC (a.k.a. Dmedia, www.dmedia.ws, partner) • Wide variety of PHP-driven web sites, mostly with MySQL and without Javascript and Flash • Background: B.S. C.S. ’01, M.S. C.S. ’05 • University of Utah – Go Utes! UPHPU - Mac Newbold

  3. Campaign Promises • Intro to Strings in PHP • (Feel free to tell me how fast or slow to go) • Functions relating to HTML, SQL, etc. • Regular Expressions • PCRE • POSIX • Performance/Speed considerations • Grab bag of cool string functions UPHPU - Mac Newbold

  4. Introducing: Strings in PHP • Much like strings in any other language • Major difference: Boundary between string, integer, float, and boolean is very blurred • Actually a benefit: if it’s not a string, but should be, it will be • Though this can lead to some unexpected results • Info in PHP Manual: • www.php.net/strings • www.php.net/manual/en/language.types.string.php UPHPU - Mac Newbold

  5. String Syntax • Single quotes: ’a string’ • No variable interpolation, \’ is only escape code • Double quotes: ”a $better string\n” • Variables work, standard escape codes work • “Here-doc” syntax: $foo = <<<END … END; • Great for large multi-line blocks of text or html • Variables are interpolated • Gotchas: newline must follow <<<END • END; must be the entire line, with no whitespace UPHPU - Mac Newbold

  6. String Operators • Array-like character access: • $str = “MyBigString” => $str{3} == “B” • Concatenation: the dot operator • ”This lets you join strings into ”. ”bigger ones” • Note: Avoiding embedded newlines “in strings that wrap onto multiple lines” is a good idea • Concatenating Assignment : .= • $str = ”My name is”; $str .= ” Mac.\n”; UPHPU - Mac Newbold

  7. Variables in Strings • “Simple string with a $var in it\n” • “You can use $an_array[$var] too\n” • “Sometimes you need ${curl}ies to mark where the {$var}iable ends” • “Curlies help on {$big[‘fancy’][$stuff]} too” • “Where it’s confusing to embed “. $big[‘ugly’][$var].”iables, break it up as needed with concatenation.” UPHPU - Mac Newbold

  8. Must-Have String Functions • www.php.net/strings • echo/print – (print $foo)==1, echo “can”, $take,”more than one”,”argument”; • Echo shortcut: <b><?=$foo?></b> • trim, ltrim, rtrim/chop – remove whitespace • explode, implode/join • $arr = explode(“ “, “List of words”); • $str = implode(“,”,$arr); UPHPU - Mac Newbold

  9. Obligatory C-like Functions • All your old favorites are in there: • printf, sprintf, sscanf, fprintf • strcmp, strlen, strpos, strtok • They all do just what you expect, though many of them have easier alternatives • Gotcha: Some of them (like strpos and friends) return boolean false, because 0 is a valid result. Always use “===false”. UPHPU - Mac Newbold

  10. Basic String Manipulation • Any of this can be done with regular expressions as well… • and in more complex cases, can only be done with regular expressions • But regular expressions are slower (more later) • str_replace(“bar”,”baz”,”foobar”); • str_repeat(“1234567890”,8); UPHPU - Mac Newbold

  11. Formatting functions • strtolower, strtoupper • ucfirst, ucwords – uppercase first char, or first char of each word • wordwrap – wrap text to a given width • str_pad(“tooshort”,15,” ”); • vprintf, vfprintf, vsprintf – formatted output • number_format – add thousands grouping • money_format – format as currency UPHPU - Mac Newbold

  12. Special-Purpose Functions • One of PHP’s strengths is the way it caters to the common things people need • Many string functions are specifically for use with things like dates/times, URLs, HTML, and SQL databases • Advice: When you need them, use them. “Rolling your own” doesn’t usually work out the way you plan it. UPHPU - Mac Newbold

  13. Date and Time Functions • www.php.net/datetime • A variety of functions to not only do calculations with dates, but to convert dates to strings – date(), strftime() • And more importantly, to convert strings to dates – strtotime(), strptime() • Great example of why not to “roll your own”, even if it doesn’t seem that complex at first UPHPU - Mac Newbold

  14. URL Functions • www.php.net/url • urlencode, urldecode • Turn non-alphanumerics to %[hex] and ‘ ‘->’+’ • rawurl{en,de}code do the same except for ’+’ • parse_url – break into host, path, query, etc. • http_build_query – turn array to URL query • base64_{en,de}code – base64 conversions for use with MIME, etc. UPHPU - Mac Newbold

  15. HTML Functions • htmlspecialchars – encode &, “, <, and > with &amp;, &quot;, &lt;, and &gt; • htmlentities is same but for every char • html_entity_decode is the reverse • nl2br – turn newline (\n) into <br> tags • parse_str – parse GET query into variables or an array (see also: extract) • strip_tags – strip html tags [selectively] UPHPU - Mac Newbold

  16. SQL Functions • “Magic Quotes” – on by default • Misnamed – adds magic slashes, not quotes • addslashes, stripslashes – escape ‘, “, and \ • Advice: do db queries first, then use $var = htmlspecialchars(stripslashes($input)) for use in <input value=‘$var’> tags • quotemeta – escape . \ + * ? [ ^ ] ( $ ) • Good for commands: system() and `backticks` UPHPU - Mac Newbold

  17. Now for the fun stuff… • Intro to Strings in PHP • (Feel free to tell me how fast or slow to go) • Functions relating to HTML, SQL, etc. • Regular Expressions • PCRE • POSIX • Performance/Speed considerations • Grab bag of cool string functions UPHPU - Mac Newbold

  18. Regular Expressions • Extremely powerful tool for pattern matching – same thing used by compilers and interpreters to run your programs • Two flavors in PHP: • PCRE – Perl-Compatible Regular Expressions • POSIX Extended • I favor PCRE – multiple languages, more features, faster, and binary-safe UPHPU - Mac Newbold

  19. Basics of RE’s • They match patterns – the magic is in the pattern you tell them to match • They have to be precise, including and excluding exactly what you want • People get scared of them because the details can be tricky • But they’re one of the best tools you have for doing some pretty fancy string stuff UPHPU - Mac Newbold

  20. RE Patterns • Start with strings and grouping: “abc(def)” • Add alternative branches: “abc(def|123)” • Wildcard: . matches any char but \n • Quantifiers/Repeating: • * = “0 or more”, + = “1 or more”, ? = “0 or 1” • {n} = “n times”, {n,m} = “n to m times” • “(abc)+(def|123)*(.{2})*” • At least one abc, maybe some triplets, then an even number of characters UPHPU - Mac Newbold

  21. Character Classes and Types • [] makes character classes • List of characters and ranges: [a-zA-Z0-9] • If you want to use -, put it at the beginning • Escape any special chars with \ as usual • If first char is ^, class is negated • \d = [0-9], \D = [^0-9] • \s = whitespace, \S = non-whitespace • \w = [a-zA-Z0-9_], \W = [^a-zA-Z0-9_] • \b = word boundary – “zero-width assertion” UPHPU - Mac Newbold

  22. Anchors • What if you want to force it to match only at the beginning of the string? Or to match the entire string? • Use an anchor! • ^ as the first char anchors the beginning • $ as the last char anchors the end • (Varies slightly in multi-line mode) UPHPU - Mac Newbold

  23. Greediness and Modifiers • Regular Expressions are Greedy • They’ll keep eating characters as long as they can keep matching. • Consider: “<.*>” vs. “<[^>]*>” when matching against “<b>Hi</b>” • PCRE has modifiers: /<pattern>/<mods> • /i = case insensitive • /U = un-greedy • /m = multi-line UPHPU - Mac Newbold

  24. Back References • Most commonly used in replace operations, but can be used in match patterns as well • Parentheses not only group, but capture too • Use \ followed by the number of the capture • “ab(.)\1(.)\2” will match abccdd or abxxyy, but not abcccd or abdcdc • Can get tricky to count which backref goes where with nested parentheses UPHPU - Mac Newbold

  25. Modifiers for Parentheses • PCRE Only – makes some things possible that otherwise couldn’t be done • Non-capturing grouping: (?: ) • Can simplify back-reference counting • Look-ahead Assertions: • They don’t advance the matching position • Positive: (?= ), or Negative: (?! ) • Very powerful, but not always easy to understand. Trial and error can be your friend! UPHPU - Mac Newbold

  26. PCRE Specifics • www.php.net/pcre • preg_match, preg_match_all, preg_replace, preg_split, preg_grep (filter an array) • Perl RE’s have a delimiter, usually /, but can be anything: • preg_match(“/foo/”,$bar); • preg_match(“%/usr/local/bin/%”,$path); UPHPU - Mac Newbold

  27. POSIX Specifics • www.php.net/regex • ereg, ereg_replace, split, eregi, spliti, etc. • [Only] Advantage over PCRE: It doesn’t require the PCRE library to be installed, so it’s always there in any PHP installation • Other regex engines support this specification, though the Perl style seems to be more popular. UPHPU - Mac Newbold

  28. Almost there… • Intro to Strings in PHP • (Feel free to tell me how fast or slow to go) • Functions relating to HTML, SQL, etc. • Regular Expressions • PCRE • POSIX • Performance/Speed considerations • Grab bag of cool string functions UPHPU - Mac Newbold

  29. Performance/Speed • Rule of thumb: use the simplest function that will get the job done right • strpos instead of substr • str_replace instead of preg_replace • And so forth… • The PHP manual online usually includes notes about speed differences • PCRE is faster than POSIX Regex UPHPU - Mac Newbold

  30. Grab Bag • md5, md5_file – Calculate md5 hashes • Great for passwords in databases, etc. • levenshtein, similar_text – calculate the “similarity” of two strings • metaphone, soundex – calculate how similar two strings sound when spoken out loud • str_rot13 – Encryption algorithm • Protected by the DMCA UPHPU - Mac Newbold

  31. Grab Bag 2 • str_shuffle – words are much more fun once they’ve been randomized • count_chars, str_word_count – statistics about your strings • str_rev – if it doesn’t make sense forward, try it backwards UPHPU - Mac Newbold

  32. Grand Finale • Any questions? UPHPU - Mac Newbold

  33. Group Practice • 8.3 filenames - anything but zip files • /^.{0,8}(\.[^z][^i]?[^p]?)?$/i – fails filename.ftp • /^.{0,8}\.(!?zip)$/I – PCRE only • Sometimes easier to match rejects rather than keepers • Apache access log example: • 4.79.40.166 - - [07/Jan/2005:04:35:42 -0700] "GET /robots.txt HTTP/1.0" 404 337 "-" "Holmes/1.0" • preg_match("/^(\d{1,3}(:?\.\d{1,3}){3}) ". #IP • "- - \[(.+)\] \"\w+ (\S+) (\S+)\" (\d+) (\d+) ". • "\"-\" \"([^"]*)\"$/",$row,$matches); UPHPU - Mac Newbold