CSE467/567 Computational Linguistics. Carl Alphonce [email protected] Computer Science & Engineering University at Buffalo. Levels of processing. phonetics/phonology – sounds morphology – word structure syntax – sentence structure semantics – meaning
“a regular expression is an algebraic notation for characterizing a set of strings” [p. 22]
Regular expressions are commonly used to specify search strings. For example, the UNIX utility program grep lets the user specify a pattern to search for in files.
Matching a sequence of characters
/a/ matches the character ‘a’
/fred/ matches the string ‘fred’
/fred/ does not match the string ‘Fred’!
In other words, patterns are case-sensitive.
Square brackets are used to indicate disjunction of characters.
/[Ff]/ matches either ‘f’ or ‘F’
/[Ff]red/ matches either ‘fred’ or ‘Fred’
This form of disjunction applies only at the character level. A set of characters in square brackets are sometimes referred to as a character class.
Sometimes it is useful to specify “any digit” or “any letter”.
“Any digit” can be written as //, since any of the ten digits satisfies the pattern.
An alternative is to use a special range notation: /[0-9]/
Any letter can be specified as /[A-Za-z]/
Range notation does not extend the power of regular expressions, but gives us a convenient way to express them.
To search for a character that is not in a character class, use the caret (^) in front of the character class that is enclosed in square brackets.
/[^a]/ matches anything except ‘a’
/[^0-9]/ matches anything except a digit
The ‘?’ matches zero or one occurrences of the preceding expression.
/a?/ matches ‘a’ or ‘’ (nothing)
/cats?/ matches ‘cat’ or ‘cats’
Note that the “preceding expression”, in these examples, is a single letter. We’ll see how to form longer expressions later.
The Kleene star (*) matches zero or more occurrences of the preceding expression.
/a*/ matches ‘’, ‘a’, ‘aa’, ‘aaa’, etc.
/[ab]*/ matches ‘’, ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc.
+ matches one or more occurrences
+ is not necessary: /[ab]+/ is equiv. to /[ab][ab]*/
The period (.) matches any single character except the newline (\n).
Anchors are used to restrict a match to a particular position within a string.
^ anchors to the start of a string
$ anchors to the end of a string
/[Ff]red/ matches both ‘Fred’ and ‘Fred is home’
/^[Ff]red$/ matches ‘Fred’ but not ‘Fred is home’
\b anchors to a word boundary
\B anchors to a non-boundary
Two regular expressions are conjoined by juxtaposition (placing the expressions side by side).
/a/ matches ‘a’
/m/ matches ‘m’
/am/ matches ‘am’ but not ‘a’ or ‘m’ alone
We have already seen disjunction of characters using the square bracket notation
General disjunction is expressed using the vertical bar (|), also called the pipe symbol.
This form of disjunction allows us to match any one of the alternative patterns, not just characters like the [ ] disjunction form.
In addition to matching, we can do replacements when a match is found:
To replace the British spelling of color with the American spelling, we can write:
DE DO DO DO DE DA DA DA
IS ALL I WANT TO SAY TO YOU
/(D[AEO].)*/ will match the first line
/(D[AEO])(.D[AEO]) \2 \2\s \1 (.D[AEO]) \3 \3/ matches it more specifically
This pattern also matches strings like DA DE DE DE DA DO DO DO
\s matches a whitespace character