Strings and regular expressions in php or pcre posix and bears oh my
Download
1 / 33

Strings and Regular Expressions in PHP - PowerPoint PPT Presentation


  • 552 Views
  • Updated On :

Strings and Regular Expressions in PHP, or “PCRE, POSIX, and Bears, Oh My!”. UPHPU Meeting January 18, 2005 Mac Newbold [email protected] Who am I?. Full-time self-employed computer geek MNE, LLC (macnewbold.com, owner) and

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Strings and Regular Expressions in PHP' - adamdaniel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Strings and regular expressions in php or pcre posix and bears oh my

Strings and Regular Expressions in PHP, or“PCRE, POSIX, and Bears, Oh My!”

UPHPU Meeting

January 18, 2005

Mac Newbold

[email protected]


Who am i
Who am I?

  • Full-time self-employed computer geek

    • MNE, LLC (macnewbold.com, owner) and

    • Digital Media Consulting, LLC (a.k.a. Dmedia, www.dmedia.ws, partner)

    • Wide variety of PHP-driven web sites, mostly with MySQL and without Javascript and Flash

  • Background: B.S. C.S. ’01, M.S. C.S. ’05

    • University of Utah – Go Utes!

UPHPU - Mac Newbold


Campaign promises
Campaign Promises

  • Intro to Strings in PHP

    • (Feel free to tell me how fast or slow to go)

  • Functions relating to HTML, SQL, etc.

  • Regular Expressions

    • PCRE

    • POSIX

  • Performance/Speed considerations

  • Grab bag of cool string functions

UPHPU - Mac Newbold


Introducing strings in php
Introducing: Strings in PHP

  • Much like strings in any other language

    • Major difference: Boundary between string, integer, float, and boolean is very blurred

    • Actually a benefit: if it’s not a string, but should be, it will be

      • Though this can lead to some unexpected results

  • Info in PHP Manual:

    • www.php.net/strings

      • www.php.net/manual/en/language.types.string.php

UPHPU - Mac Newbold


String syntax
String Syntax

  • Single quotes: ’a string’

    • No variable interpolation, \’ is only escape code

  • Double quotes: ”a $better string\n”

    • Variables work, standard escape codes work

  • “Here-doc” syntax: $foo = <<<END … END;

    • Great for large multi-line blocks of text or html

    • Variables are interpolated

      • Gotchas: newline must follow <<<END

      • END; must be the entire line, with no whitespace

UPHPU - Mac Newbold


String operators
String Operators

  • Array-like character access:

    • $str = “MyBigString” => $str{3} == “B”

  • Concatenation: the dot operator

    • ”This lets you join strings into ”. ”bigger ones”

      • Note: Avoiding embedded newlines “in strings that wrap onto multiple lines” is a good idea

  • Concatenating Assignment : .=

    • $str = ”My name is”; $str .= ” Mac.\n”;

UPHPU - Mac Newbold


Variables in strings
Variables in Strings

  • “Simple string with a $var in it\n”

  • “You can use $an_array[$var] too\n”

  • “Sometimes you need ${curl}ies to mark where the {$var}iable ends”

  • “Curlies help on {$big[‘fancy’][$stuff]} too”

  • “Where it’s confusing to embed “. $big[‘ugly’][$var].”iables, break it up as needed with concatenation.”

UPHPU - Mac Newbold


Must have string functions
Must-Have String Functions

  • www.php.net/strings

  • echo/print – (print $foo)==1, echo “can”, $take,”more than one”,”argument”;

  • Echo shortcut: <b><?=$foo?></b>

  • trim, ltrim, rtrim/chop – remove whitespace

  • explode, implode/join

    • $arr = explode(“ “, “List of words”);

    • $str = implode(“,”,$arr);

UPHPU - Mac Newbold


Obligatory c like functions
Obligatory C-like Functions

  • All your old favorites are in there:

    • printf, sprintf, sscanf, fprintf

    • strcmp, strlen, strpos, strtok

  • They all do just what you expect, though many of them have easier alternatives

  • Gotcha: Some of them (like strpos and friends) return boolean false, because 0 is a valid result. Always use “===false”.

UPHPU - Mac Newbold


Basic string manipulation
Basic String Manipulation

  • Any of this can be done with regular expressions as well…

    • and in more complex cases, can only be done with regular expressions

      • But regular expressions are slower (more later)

  • str_replace(“bar”,”baz”,”foobar”);

  • str_repeat(“1234567890”,8);

UPHPU - Mac Newbold


Formatting functions
Formatting functions

  • strtolower, strtoupper

  • ucfirst, ucwords – uppercase first char, or first char of each word

  • wordwrap – wrap text to a given width

  • str_pad(“tooshort”,15,” ”);

  • vprintf, vfprintf, vsprintf – formatted output

  • number_format – add thousands grouping

  • money_format – format as currency

UPHPU - Mac Newbold


Special purpose functions
Special-Purpose Functions

  • One of PHP’s strengths is the way it caters to the common things people need

  • Many string functions are specifically for use with things like dates/times, URLs, HTML, and SQL databases

  • Advice: When you need them, use them. “Rolling your own” doesn’t usually work out the way you plan it.

UPHPU - Mac Newbold


Date and time functions
Date and Time Functions

  • www.php.net/datetime

  • A variety of functions to not only do calculations with dates, but to convert dates to strings – date(), strftime()

    • And more importantly, to convert strings to dates – strtotime(), strptime()

  • Great example of why not to “roll your own”, even if it doesn’t seem that complex at first

UPHPU - Mac Newbold


Url functions
URL Functions

  • www.php.net/url

  • urlencode, urldecode

    • Turn non-alphanumerics to %[hex] and ‘ ‘->’+’

    • rawurl{en,de}code do the same except for ’+’

  • parse_url – break into host, path, query, etc.

  • http_build_query – turn array to URL query

  • base64_{en,de}code – base64 conversions for use with MIME, etc.

UPHPU - Mac Newbold


Html functions
HTML Functions

  • htmlspecialchars – encode &, “, <, and > with &amp;, &quot;, &lt;, and &gt;

    • htmlentities is same but for every char

    • html_entity_decode is the reverse

  • nl2br – turn newline (\n) into <br> tags

  • parse_str – parse GET query into variables or an array (see also: extract)

  • strip_tags – strip html tags [selectively]

UPHPU - Mac Newbold


Sql functions
SQL Functions

  • “Magic Quotes” – on by default

    • Misnamed – adds magic slashes, not quotes

  • addslashes, stripslashes – escape ‘, “, and \

    • Advice: do db queries first, then use $var = htmlspecialchars(stripslashes($input)) for use in <input value=‘$var’> tags

  • quotemeta – escape . \ + * ? [ ^ ] ( $ )

    • Good for commands: system() and `backticks`

UPHPU - Mac Newbold


Now for the fun stuff
Now for the fun stuff…

  • Intro to Strings in PHP

    • (Feel free to tell me how fast or slow to go)

  • Functions relating to HTML, SQL, etc.

  • Regular Expressions

    • PCRE

    • POSIX

  • Performance/Speed considerations

  • Grab bag of cool string functions

UPHPU - Mac Newbold


Regular expressions
Regular Expressions

  • Extremely powerful tool for pattern matching – same thing used by compilers and interpreters to run your programs

  • Two flavors in PHP:

    • PCRE – Perl-Compatible Regular Expressions

    • POSIX Extended

  • I favor PCRE – multiple languages, more features, faster, and binary-safe

UPHPU - Mac Newbold


Basics of re s
Basics of RE’s

  • They match patterns – the magic is in the pattern you tell them to match

  • They have to be precise, including and excluding exactly what you want

  • People get scared of them because the details can be tricky

  • But they’re one of the best tools you have for doing some pretty fancy string stuff

UPHPU - Mac Newbold


Re patterns
RE Patterns

  • Start with strings and grouping: “abc(def)”

  • Add alternative branches: “abc(def|123)”

  • Wildcard: . matches any char but \n

  • Quantifiers/Repeating:

    • * = “0 or more”, + = “1 or more”, ? = “0 or 1”

    • {n} = “n times”, {n,m} = “n to m times”

  • “(abc)+(def|123)*(.{2})*”

    • At least one abc, maybe some triplets, then an even number of characters

UPHPU - Mac Newbold


Character classes and types
Character Classes and Types

  • [] makes character classes

  • List of characters and ranges: [a-zA-Z0-9]

    • If you want to use -, put it at the beginning

    • Escape any special chars with \ as usual

    • If first char is ^, class is negated

  • \d = [0-9], \D = [^0-9]

  • \s = whitespace, \S = non-whitespace

  • \w = [a-zA-Z0-9_], \W = [^a-zA-Z0-9_]

  • \b = word boundary – “zero-width assertion”

UPHPU - Mac Newbold


Anchors
Anchors

  • What if you want to force it to match only at the beginning of the string? Or to match the entire string?

  • Use an anchor!

  • ^ as the first char anchors the beginning

  • $ as the last char anchors the end

  • (Varies slightly in multi-line mode)

UPHPU - Mac Newbold


Greediness and modifiers
Greediness and Modifiers

  • Regular Expressions are Greedy

    • They’ll keep eating characters as long as they can keep matching.

    • Consider: “<.*>” vs. “<[^>]*>” when matching against “<b>Hi</b>”

  • PCRE has modifiers: /<pattern>/<mods>

    • /i = case insensitive

    • /U = un-greedy

    • /m = multi-line

UPHPU - Mac Newbold


Back references
Back References

  • Most commonly used in replace operations, but can be used in match patterns as well

  • Parentheses not only group, but capture too

  • Use \ followed by the number of the capture

  • “ab(.)\1(.)\2” will match abccdd or abxxyy, but not abcccd or abdcdc

  • Can get tricky to count which backref goes where with nested parentheses

UPHPU - Mac Newbold


Modifiers for parentheses
Modifiers for Parentheses

  • PCRE Only – makes some things possible that otherwise couldn’t be done

  • Non-capturing grouping: (?: )

    • Can simplify back-reference counting

  • Look-ahead Assertions:

    • They don’t advance the matching position

    • Positive: (?= ), or Negative: (?! )

    • Very powerful, but not always easy to understand. Trial and error can be your friend!

UPHPU - Mac Newbold


Pcre specifics
PCRE Specifics

  • www.php.net/pcre

  • preg_match, preg_match_all, preg_replace, preg_split, preg_grep (filter an array)

  • Perl RE’s have a delimiter, usually /, but can be anything:

    • preg_match(“/foo/”,$bar);

    • preg_match(“%/usr/local/bin/%”,$path);

UPHPU - Mac Newbold


Posix specifics
POSIX Specifics

  • www.php.net/regex

  • ereg, ereg_replace, split, eregi, spliti, etc.

  • [Only] Advantage over PCRE: It doesn’t require the PCRE library to be installed, so it’s always there in any PHP installation

  • Other regex engines support this specification, though the Perl style seems to be more popular.

UPHPU - Mac Newbold


Almost there
Almost there…

  • Intro to Strings in PHP

    • (Feel free to tell me how fast or slow to go)

  • Functions relating to HTML, SQL, etc.

  • Regular Expressions

    • PCRE

    • POSIX

  • Performance/Speed considerations

  • Grab bag of cool string functions

UPHPU - Mac Newbold


Performance speed
Performance/Speed

  • Rule of thumb: use the simplest function that will get the job done right

    • strpos instead of substr

    • str_replace instead of preg_replace

    • And so forth…

    • The PHP manual online usually includes notes about speed differences

  • PCRE is faster than POSIX Regex

UPHPU - Mac Newbold


Grab bag
Grab Bag

  • md5, md5_file – Calculate md5 hashes

    • Great for passwords in databases, etc.

  • levenshtein, similar_text – calculate the “similarity” of two strings

  • metaphone, soundex – calculate how similar two strings sound when spoken out loud

  • str_rot13 – Encryption algorithm

    • Protected by the DMCA

UPHPU - Mac Newbold


Grab bag 2
Grab Bag 2

  • str_shuffle – words are much more fun once they’ve been randomized

  • count_chars, str_word_count – statistics about your strings

  • str_rev – if it doesn’t make sense forward, try it backwards

UPHPU - Mac Newbold


Grand finale
Grand Finale

  • Any questions?

UPHPU - Mac Newbold


Group practice
Group Practice

  • 8.3 filenames - anything but zip files

    • /^.{0,8}(\.[^z][^i]?[^p]?)?$/i – fails filename.ftp

    • /^.{0,8}\.(!?zip)$/I – PCRE only

    • Sometimes easier to match rejects rather than keepers

  • Apache access log example:

    • 4.79.40.166 - - [07/Jan/2005:04:35:42 -0700] "GET /robots.txt HTTP/1.0" 404 337 "-" "Holmes/1.0"

    • preg_match("/^(\d{1,3}(:?\.\d{1,3}){3}) ". #IP

    • "- - \[(.+)\] \"\w+ (\S+) (\S+)\" (\d+) (\d+) ".

    • "\"-\" \"([^"]*)\"$/",$row,$matches);

UPHPU - Mac Newbold


ad