Strings and Regular Expressions in PHP, or “PCRE, POSIX, and Bears, Oh My!” - PowerPoint PPT Presentation

Strings and regular expressions in php or pcre posix and bears oh my
Download
1 / 33

Strings and Regular Expressions in PHP, or “PCRE, POSIX, and Bears, Oh My!”. UPHPU Meeting January 18, 2005 Mac Newbold mac@macnewbold.com. Who am I?. Full-time self-employed computer geek MNE, LLC (macnewbold.com, owner) and

Related searches for Strings and Regular Expressions in PHP, or “PCRE, POSIX, and Bears, Oh My!”

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Strings and Regular Expressions in PHP, or “PCRE, POSIX, and Bears, Oh My!”

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Strings and regular expressions in php or pcre posix and bears oh my

Strings and Regular Expressions in PHP, or“PCRE, POSIX, and Bears, Oh My!”

UPHPU Meeting

January 18, 2005

Mac Newbold

mac@macnewbold.com


Who am i

Who am I?

  • Full-time self-employed computer geek

    • MNE, LLC (macnewbold.com, owner) and

    • Digital Media Consulting, LLC (a.k.a. Dmedia, www.dmedia.ws, partner)

    • Wide variety of PHP-driven web sites, mostly with MySQL and without Javascript and Flash

  • Background: B.S. C.S. ’01, M.S. C.S. ’05

    • University of Utah – Go Utes!

UPHPU - Mac Newbold


Campaign promises

Campaign Promises

  • Intro to Strings in PHP

    • (Feel free to tell me how fast or slow to go)

  • Functions relating to HTML, SQL, etc.

  • Regular Expressions

    • PCRE

    • POSIX

  • Performance/Speed considerations

  • Grab bag of cool string functions

UPHPU - Mac Newbold


Introducing strings in php

Introducing: Strings in PHP

  • Much like strings in any other language

    • Major difference: Boundary between string, integer, float, and boolean is very blurred

    • Actually a benefit: if it’s not a string, but should be, it will be

      • Though this can lead to some unexpected results

  • Info in PHP Manual:

    • www.php.net/strings

      • www.php.net/manual/en/language.types.string.php

UPHPU - Mac Newbold


String syntax

String Syntax

  • Single quotes: ’a string’

    • No variable interpolation, \’ is only escape code

  • Double quotes: ”a $better string\n”

    • Variables work, standard escape codes work

  • “Here-doc” syntax: $foo = <<<END … END;

    • Great for large multi-line blocks of text or html

    • Variables are interpolated

      • Gotchas: newline must follow <<<END

      • END; must be the entire line, with no whitespace

UPHPU - Mac Newbold


String operators

String Operators

  • Array-like character access:

    • $str = “MyBigString” => $str{3} == “B”

  • Concatenation: the dot operator

    • ”This lets you join strings into ”. ”bigger ones”

      • Note: Avoiding embedded newlines “in strings that wrap onto multiple lines” is a good idea

  • Concatenating Assignment : .=

    • $str = ”My name is”; $str .= ” Mac.\n”;

UPHPU - Mac Newbold


Variables in strings

Variables in Strings

  • “Simple string with a $var in it\n”

  • “You can use $an_array[$var] too\n”

  • “Sometimes you need ${curl}ies to mark where the {$var}iable ends”

  • “Curlies help on {$big[‘fancy’][$stuff]} too”

  • “Where it’s confusing to embed “. $big[‘ugly’][$var].”iables, break it up as needed with concatenation.”

UPHPU - Mac Newbold


Must have string functions

Must-Have String Functions

  • www.php.net/strings

  • echo/print – (print $foo)==1, echo “can”, $take,”more than one”,”argument”;

  • Echo shortcut: <b><?=$foo?></b>

  • trim, ltrim, rtrim/chop – remove whitespace

  • explode, implode/join

    • $arr = explode(“ “, “List of words”);

    • $str = implode(“,”,$arr);

UPHPU - Mac Newbold


Obligatory c like functions

Obligatory C-like Functions

  • All your old favorites are in there:

    • printf, sprintf, sscanf, fprintf

    • strcmp, strlen, strpos, strtok

  • They all do just what you expect, though many of them have easier alternatives

  • Gotcha: Some of them (like strpos and friends) return boolean false, because 0 is a valid result. Always use “===false”.

UPHPU - Mac Newbold


Basic string manipulation

Basic String Manipulation

  • Any of this can be done with regular expressions as well…

    • and in more complex cases, can only be done with regular expressions

      • But regular expressions are slower (more later)

  • str_replace(“bar”,”baz”,”foobar”);

  • str_repeat(“1234567890”,8);

UPHPU - Mac Newbold


Formatting functions

Formatting functions

  • strtolower, strtoupper

  • ucfirst, ucwords – uppercase first char, or first char of each word

  • wordwrap – wrap text to a given width

  • str_pad(“tooshort”,15,” ”);

  • vprintf, vfprintf, vsprintf – formatted output

  • number_format – add thousands grouping

  • money_format – format as currency

UPHPU - Mac Newbold


Special purpose functions

Special-Purpose Functions

  • One of PHP’s strengths is the way it caters to the common things people need

  • Many string functions are specifically for use with things like dates/times, URLs, HTML, and SQL databases

  • Advice: When you need them, use them. “Rolling your own” doesn’t usually work out the way you plan it.

UPHPU - Mac Newbold


Date and time functions

Date and Time Functions

  • www.php.net/datetime

  • A variety of functions to not only do calculations with dates, but to convert dates to strings – date(), strftime()

    • And more importantly, to convert strings to dates – strtotime(), strptime()

  • Great example of why not to “roll your own”, even if it doesn’t seem that complex at first

UPHPU - Mac Newbold


Url functions

URL Functions

  • www.php.net/url

  • urlencode, urldecode

    • Turn non-alphanumerics to %[hex] and ‘ ‘->’+’

    • rawurl{en,de}code do the same except for ’+’

  • parse_url – break into host, path, query, etc.

  • http_build_query – turn array to URL query

  • base64_{en,de}code – base64 conversions for use with MIME, etc.

UPHPU - Mac Newbold


Html functions

HTML Functions

  • htmlspecialchars – encode &, “, <, and > with &amp;, &quot;, &lt;, and &gt;

    • htmlentities is same but for every char

    • html_entity_decode is the reverse

  • nl2br – turn newline (\n) into <br> tags

  • parse_str – parse GET query into variables or an array (see also: extract)

  • strip_tags – strip html tags [selectively]

UPHPU - Mac Newbold


Sql functions

SQL Functions

  • “Magic Quotes” – on by default

    • Misnamed – adds magic slashes, not quotes

  • addslashes, stripslashes – escape ‘, “, and \

    • Advice: do db queries first, then use $var = htmlspecialchars(stripslashes($input)) for use in <input value=‘$var’> tags

  • quotemeta – escape . \ + * ? [ ^ ] ( $ )

    • Good for commands: system() and `backticks`

UPHPU - Mac Newbold


Now for the fun stuff

Now for the fun stuff…

  • Intro to Strings in PHP

    • (Feel free to tell me how fast or slow to go)

  • Functions relating to HTML, SQL, etc.

  • Regular Expressions

    • PCRE

    • POSIX

  • Performance/Speed considerations

  • Grab bag of cool string functions

UPHPU - Mac Newbold


Regular expressions

Regular Expressions

  • Extremely powerful tool for pattern matching – same thing used by compilers and interpreters to run your programs

  • Two flavors in PHP:

    • PCRE – Perl-Compatible Regular Expressions

    • POSIX Extended

  • I favor PCRE – multiple languages, more features, faster, and binary-safe

UPHPU - Mac Newbold


Basics of re s

Basics of RE’s

  • They match patterns – the magic is in the pattern you tell them to match

  • They have to be precise, including and excluding exactly what you want

  • People get scared of them because the details can be tricky

  • But they’re one of the best tools you have for doing some pretty fancy string stuff

UPHPU - Mac Newbold


Re patterns

RE Patterns

  • Start with strings and grouping: “abc(def)”

  • Add alternative branches: “abc(def|123)”

  • Wildcard: . matches any char but \n

  • Quantifiers/Repeating:

    • * = “0 or more”, + = “1 or more”, ? = “0 or 1”

    • {n} = “n times”, {n,m} = “n to m times”

  • “(abc)+(def|123)*(.{2})*”

    • At least one abc, maybe some triplets, then an even number of characters

UPHPU - Mac Newbold


Character classes and types

Character Classes and Types

  • [] makes character classes

  • List of characters and ranges: [a-zA-Z0-9]

    • If you want to use -, put it at the beginning

    • Escape any special chars with \ as usual

    • If first char is ^, class is negated

  • \d = [0-9], \D = [^0-9]

  • \s = whitespace, \S = non-whitespace

  • \w = [a-zA-Z0-9_], \W = [^a-zA-Z0-9_]

  • \b = word boundary – “zero-width assertion”

UPHPU - Mac Newbold


Anchors

Anchors

  • What if you want to force it to match only at the beginning of the string? Or to match the entire string?

  • Use an anchor!

  • ^ as the first char anchors the beginning

  • $ as the last char anchors the end

  • (Varies slightly in multi-line mode)

UPHPU - Mac Newbold


Greediness and modifiers

Greediness and Modifiers

  • Regular Expressions are Greedy

    • They’ll keep eating characters as long as they can keep matching.

    • Consider: “<.*>” vs. “<[^>]*>” when matching against “<b>Hi</b>”

  • PCRE has modifiers: /<pattern>/<mods>

    • /i = case insensitive

    • /U = un-greedy

    • /m = multi-line

UPHPU - Mac Newbold


Back references

Back References

  • Most commonly used in replace operations, but can be used in match patterns as well

  • Parentheses not only group, but capture too

  • Use \ followed by the number of the capture

  • “ab(.)\1(.)\2” will match abccdd or abxxyy, but not abcccd or abdcdc

  • Can get tricky to count which backref goes where with nested parentheses

UPHPU - Mac Newbold


Modifiers for parentheses

Modifiers for Parentheses

  • PCRE Only – makes some things possible that otherwise couldn’t be done

  • Non-capturing grouping: (?: )

    • Can simplify back-reference counting

  • Look-ahead Assertions:

    • They don’t advance the matching position

    • Positive: (?= ), or Negative: (?! )

    • Very powerful, but not always easy to understand. Trial and error can be your friend!

UPHPU - Mac Newbold


Pcre specifics

PCRE Specifics

  • www.php.net/pcre

  • preg_match, preg_match_all, preg_replace, preg_split, preg_grep (filter an array)

  • Perl RE’s have a delimiter, usually /, but can be anything:

    • preg_match(“/foo/”,$bar);

    • preg_match(“%/usr/local/bin/%”,$path);

UPHPU - Mac Newbold


Posix specifics

POSIX Specifics

  • www.php.net/regex

  • ereg, ereg_replace, split, eregi, spliti, etc.

  • [Only] Advantage over PCRE: It doesn’t require the PCRE library to be installed, so it’s always there in any PHP installation

  • Other regex engines support this specification, though the Perl style seems to be more popular.

UPHPU - Mac Newbold


Almost there

Almost there…

  • Intro to Strings in PHP

    • (Feel free to tell me how fast or slow to go)

  • Functions relating to HTML, SQL, etc.

  • Regular Expressions

    • PCRE

    • POSIX

  • Performance/Speed considerations

  • Grab bag of cool string functions

UPHPU - Mac Newbold


Performance speed

Performance/Speed

  • Rule of thumb: use the simplest function that will get the job done right

    • strpos instead of substr

    • str_replace instead of preg_replace

    • And so forth…

    • The PHP manual online usually includes notes about speed differences

  • PCRE is faster than POSIX Regex

UPHPU - Mac Newbold


Grab bag

Grab Bag

  • md5, md5_file – Calculate md5 hashes

    • Great for passwords in databases, etc.

  • levenshtein, similar_text – calculate the “similarity” of two strings

  • metaphone, soundex – calculate how similar two strings sound when spoken out loud

  • str_rot13 – Encryption algorithm

    • Protected by the DMCA

UPHPU - Mac Newbold


Grab bag 2

Grab Bag 2

  • str_shuffle – words are much more fun once they’ve been randomized

  • count_chars, str_word_count – statistics about your strings

  • str_rev – if it doesn’t make sense forward, try it backwards

UPHPU - Mac Newbold


Grand finale

Grand Finale

  • Any questions?

UPHPU - Mac Newbold


Group practice

Group Practice

  • 8.3 filenames - anything but zip files

    • /^.{0,8}(\.[^z][^i]?[^p]?)?$/i – fails filename.ftp

    • /^.{0,8}\.(!?zip)$/I – PCRE only

    • Sometimes easier to match rejects rather than keepers

  • Apache access log example:

    • 4.79.40.166 - - [07/Jan/2005:04:35:42 -0700] "GET /robots.txt HTTP/1.0" 404 337 "-" "Holmes/1.0"

    • preg_match("/^(\d{1,3}(:?\.\d{1,3}){3}) ". #IP

    • "- - \[(.+)\] \"\w+ (\S+) (\S+)\" (\d+) (\d+) ".

    • "\"-\" \"([^"]*)\"$/",$row,$matches);

UPHPU - Mac Newbold


  • Login