Regex 101

YAPC::EU 2014 Sofia, Bulgaria http://pvnp.us/yapc::eu/2014/regex101/ Regex 101

Outline • The Plan • What are Regular Expressions? • The 101 Part • How Regexes Really Work

The Plan • Learn to think • What are Regular Expressions? • What are Regular Expressions Good For? • When should you NOT use Regular Expressions?

What are Regular Expressions? • Regular expressions are a language • A regular expression is like a sentence. It contains: • words (literals) • grammar (metacharacters) • We use regular expressions to describe patterns we want to efficiently find in some larger text. We may want to do something with the text we find, or we may want to do something with the text that does not match, or we want to know when it does not match

The 101 Part • Basic String Matching: • ‘Kelp is better than Dancer’ =~ /Kelp/; • ‘Kelp is better than Dancer’ =~ m!Kelp!; • What else matches ? • does ‘Kelp is better than Dancer’ =~ m{n D}; ?? • Metacharacters: • { } [ ] ( ) ^ $ . | * + ? \

Character Classes /[kK][eE][lL][pP]/ what about /kElP/i /[bcr]at/ Metachars diff within character class: - ] \ ^ $ /item[0-9]/ == /item[0123456789]/ What does /[^0-9]/ match? \d, \s, \w, \D, \S, \W, . \w\W == \b, \W\w == \b What does /\d\d:\d\d:\d\d/ match?

Matching Multi-lines • Matching across multiple lines: • /Regex/ • ‘.’ matches any char except ‘\n’, ‘^’, ‘$’ • /Regex/s • string is one line, ‘.’ now matches also ‘\n’ • /Regex/m • string multiple lines, ‘^’, ‘$’ match any beg/end on any line • /Regex/sm • string is one line, but knows about individual lines

Examples • Examples: • $x = ”Kelp is better than\nDancer\n"; • $x =~ /^Dancer/; # !match, ‘Dancer’ not at start of string • $x =~ /^Dancer/s; # !match, ‘Dancer’ not at start of string • $x =~ /^Dancer/m; # match, ‘Dancer’ at start of second line • $x =~ /^Dancer/sm; # match, ‘Dancer’ at start of second line • $x =~ /than.Dancer/; # !match, ‘.’ doesn't match "\n" • $x =~ /than.Dancer/s; # match, ‘.’ matches "\n" • $x =~ /than.Dancer/m; # !match, ‘.’ doesn't match "\n" • $x =~ /than.Dancer/sm; # match, ‘.’ matches "\n”

Alternation and Grouping • ‘kelp and dancer’ =~ /kelp|dancer|bird/; kelp • ‘kelp and dancer’ =~ /dancer|kelp|bird/; kelp • ‘stefan likes kelp’ =~ /k|ke|kel|kelp/; k • ‘stefan likes kelp’ =~ /kelp|kel|ke|k/; ke

More Alternation • Geoff, or Jeff, or? • /Jeffrey|Jeffery/ • /Jeff(rey|ery)/ • /Jeff(re|er)y/ • /(Geoff|Jeff)(rey|ery)/ • /(Geo|Je)ff(rey|ery)/ • /(Geo|Je)ff(re|er)y/ • /(Geo|Je)f{2}(re|er)y/

Special Characters • $`, $&, $’ holds the last match • $1, $2, .. holds the 1st, 2nd matches • use outside a regex • When you want to do something later with a match • \1, \2, .. holds the 1st, 2nd backreferences • use inside a regex • When you need to match something again that you already matched earlier • my $string = ‘this is some text’; • $s =~ s/(some)/<b>\1<\/b>/; # $1 also

Backreferences • my $regex = '([0-9]) \1 ([0-9])'; • my $string1 = '1 1 3 2 4 5 5 2'; • my $string2 = '1 2 3 4 5 6 7 8'; • my @string1_matches = $string1 =~ /$regex/; • my @string2_matches = $string2 =~ /$regex/; • What would ([0-9]) \1.*([0-9]).*\2 find in $string1?

Named Backreferences • my $datetime = '08/19/2014 06:25:57'; • $datetime =~ /\d\d:\d\d:\d\d/; • print "From [$datetime], I matched [$&]\n"; • $datetime =~ /(?<m>\d\d)\/(?<d>\d\d)\/(?<y>\d\d\d\d)/; … matches go to %+ hash: • print ”I matched [$+{d} and $+{m} and $+{y}]\n";

Numbers of Repetitions • a? means: match 'a' 1 or 0 times • a* means: match 'a' 0 or more times, i.e., any number of times • a+ means: match 'a' 1 or more times, i.e., at least once • a{n,m} means: match at least n times, but not more than m times. • a{n,} means: match at least n or more times • a{n} means: match exactly n times • Non-greedy • Ex: a?? Means match ‘a’ 0 or 1 times (least as long as still match) • Ex: a+? Means match ‘a’ 1 or more times, but as few as possible

Big Example @matches = 'Kelp is better than Dancer’ =~ /^(.+)(e|r)(.*)$/; • Match 0 is [Kelp is better than Dance] # but why did it match this? … stay tuned! • Match 1 is [r] • Match 2 is [] @matches = 'Kelp is better than Dancer’ =~ /(t{1,2})(.*)$/; • Match 0 is [tt] • Match 1 is [er than Dancer] @matches = 'Kelp is better than Dancer’ =~ /.*(t{1,2})(.*)$/; • Match 0 is [t] • Match 1 is [han Dancer] @matches = 'Kelp is better than Dancer’ =~ /(.?)(t{1,2})(.*)$/; • Match 0 is [e] • Match 1 is [tt] • Match 2 is [er than Dancer]

Important points • Readability vs Clarity vs Precision • /Jeffrey|Jeffery/ • /Jeff(rey|ery)/ • /Jeff(re|er)y/ • /(Geoff|Jeff)(rey|ery)/ • /(Geo|Je)ff(rey|ery)/ • /(Geo|Je)ff(re|er)y/ • /(gEo|JE)f{2}(re|er)Y/i • Which one is better? Why? • Are any better than simply: /Jeffrey|Jeffery|Geoffrey|Geoffery/? • Non-capturing ?:regex

Readability • /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/; • That is supposed to find “numbers” • Better: /^ [+-]?\ * # first, match an optional sign ( # then match integers or f.p. mantissas: \d+ # start out with a ... ( # \.\d* # mantissa of the form a.b or a. )? # ? takes care of integers of the form a |\.\d+ # mantissa of the form .b ) # ([eE][+-]?\d+)? # finally, optionally match an exponent $/x;

Should I use a regex? • No if parsing HTML / XML • Ok for trying to understand basic structure • http://bit.ly/1hY5QfK • No if validating email addresses • http://davidcel.is/blog/2012/09/06/stop-validating-email-addresses-with-regex/ • "Look at all these spaces!"@example.com seems valid! • No if writing an obscenity filter • Very easy to break …

How Regexes Work • Regex  simple machine answers yes / no • Machine checks 1 byte at a time • If not matches, keeps going: • Perl compiles the regexp into a compact sequence of opcodes that can often fit inside a processor cache. When the code is executed, these opcodes can then run at full throttle and search very quickly.

Penny (State) Machines • my $string = ‘abaa’; • my $regex = ‘^(a|b)*a$’; • Match? • i.e., does $string =~ /$regex/ return 1?

Terminology • In CS terms, penny machines are ‘finite automata’ (FA) • The fact that some pennies get cloned makes the machines ‘non-deterministic’ (N) • .:. Regexes in Perl are NFA • arrows = transitions, blank arrows = epsilon transitions, circles = states, pennies = current

‘^(a|b)*a$’

Regex -> Penny Machine • All regexes take the form /^P$/, where P is the regex. P can be a simple, or composite regex.

That’s all we Need • Why are those four penny machines all we need? • [abc] = (a|b|c) takes care of character classes • \d, \s, etc. really character classes • ? well, /P?/ = /(P|)/ • {n} well, /P{3} = /PPP/ • Non-greedy don’t change if match, only how many

Extended Example /”.*”/ And then “Kelp,” said Stefan, “is better than Dancer” and he was gone. What will the regular expression match: “Kelp,” “ said Stefan, “ “is better than Dancer” “Kelp,” said Stefan, “ ” said Stefan, “is better than Dancer” “Kelp,” said Stefan, “is better than Dancer”

How it Works Notice it will fail 9 times to begin with: A*n*d**t*h*e*n** Rule 1: Leftmost match (earliest possible) wins, so we can cancel the green ones right away. Rule 2: optional matches always match; so a regex that requires nothing to succeed always does Rule 3: when the engine has > 1 path to choose from, it chooses one and caches the other. If the path it chooses fails, it backtracks to the most recent untried path in the cache and repeats.

Choosing a Path • When faced with two possible paths, how does perl choose which one to try first? • Forest! • If maximal matching (greedy) ?, *, +, {min, max} … perl always tries the optional path first • If minimal matching (non-greedy) ??, .. … perl always skips optional path first

Another Example • #!/usr/bin/env perl • use strict; • use warnings; • use YAPE::Regex::Explain; • my $string = "Stefan's Kelp sure can help"; # if you have too much Dancer in your diet … • my $regex = '^(.*)(lp)(.*)$'; • my @matches = $string =~ /$regex/; • print YAPE::Regex::Explain->new($regex)->explain(); • foreach my $match (@matches) { • print STDERR "Matched: [$match]\n"; • }

What Matched? • Matched: [Stefan's Kelp sure can he] # 1 • Matched: [lp] # 2 • Matched: [] # 3

But what about that ‘explain’? • Remember this? my $regex = '^(.*)(lp)(.*)$’; print YAPE::Regex::Explain->new($regex)->explain();

The regular expression: (?-imsx:^(.*)(lp)(.*)$) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- lp 'lp' ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- ( group and capture to \3: ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \3 ---------------------------------------------------------------------- $ before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

Summary: Big Ideas • Literals • Metacharacters • Character classes • modifiers • (constrained) Alternation • Quantifiers • Word boundaries • Multi-line searching • Back references (backtracking) • Efficiency • Readability • Captures • Greedy / speed • State Machines

References • http://www.regexr.com/ • Online tool to build / explain regular expressions • http://regex.info/book.html • THE regular expressions book. Available in at least: English, German, Russian, Japanese, French • http://perldoc.perl.org/perlre.html • THE perl regular expressions documentation • YAPE::Regex::Explain • http://search.cpan.org/dist/Kelp/

Regex 101

Regex 101

Presentation Transcript

- - 101 - .

Data Manipulation Regex

RegEx Parsing

regex

Formal Languages, Grammars, Regex, & Automata

JAVA RegEx

Web Scraping and Regex

Regex is Fun

Regular Expressions ( RegEx )

FSG, RegEx, and CFG

-101

Strings, Regex, Web Response

Regex Challenge

Grammars, Regex, Problems and More

ReBug: A Regex Debugger

Regex

JavaScript Regular Expression | JavaScript Regex | JavaScript Tutorial For Beginners | Simplilearn

What is RegEx? Regular Expression in Python & Meta Characters

Regex 101

Regex 101

Presentation Transcript

- - 101 - .

Data Manipulation Regex

RegEx Parsing

regex

Formal Languages, Grammars, Regex, &amp; Automata

JAVA RegEx

Web Scraping and Regex

Regex is Fun

Regular Expressions ( RegEx )

FSG, RegEx, and CFG

-101

Strings, Regex, Web Response

Regex Challenge

Grammars, Regex, Problems and More

ReBug: A Regex Debugger

Regex

JavaScript Regular Expression | JavaScript Regex | JavaScript Tutorial For Beginners | Simplilearn

What is RegEx? Regular Expression in Python & Meta Characters

Formal Languages, Grammars, Regex, & Automata