COMP 5115 Programming Tools in Bioinformatics Week 6 Elements of an Expression

COMP 5115 Programming Tools in BioinformaticsWeek 6Elements of an Expression Character Classes: Character classes represent either a specific set of characters (e.g., uppercase) or a certain type of character (e.g., non-white-space).

Character Classes Examples Any Character . Use '..ain' in an expression to match a sequence of five characters ending in 'ain'. (Note that . matches white-space characters as well): str = 'The rain in Spain falls mainly on the plain.'; regexp(str, '..ain') ans = 4 13 24 39 Matches ' rain', 'Spain', ' main', and 'plain'. Returning Strings Rather than Indices: Here is the same example, this time specifying the command qualifier 'match'. In this case, regexp returns the text of the matching strings rather than the starting index: regexp(str, '..ain', 'match') ans = ' rain' 'Spain' ' main' 'plain'

Character Classes Examples Selected Characters [c1c2c3]Use [c1c2c3] in an expression to match selected characters r, p, or m followed by 'ain'. Specify two qualifiers this time, 'match' and 'start', along with an output argument for each, mat and idx. This returns the matching strings and the starting indices of those strings: [mat idx] = regexp(str, '[rpm]ain', 'match', 'start') mat = 'rain' 'pain' 'main' idx = 5 14 25 Range of Characters [c1 - c2]Use [c1-c2] in an expression to find words that begin with a letter in the range of A through Z: [mat idx] = regexp(str, '[A-Z]\w*', 'match', 'start') mat = 'The' 'Spain' idx = 1 13

Character Classes Examples Word and White-Space Characters \w, \sUse \w and \s in an expression to find words that end with the letter n followed by a white-space character. Add a new qualifier, 'end', to return the str index that marks the end of each match: [mat ix1 ix2] = regexp(str, '\w*n\s', 'match', 'start', 'end') mat = 'rain ' 'in ' 'Spain ' 'on ' ix1 = 5 10 13 32 ix2 = 9 12 18 34 Numeric Digits \dUse \d to find numeric digits in the following string: numstr = 'Easy as 1, 2, 3'; [mat idx] = regexp(numstr, '\d', 'match', 'start') mat = '1' '2' '3' idx = 9 12 15

Character Representation: The following character combinations represent specific character and numeric values:

Examples Character Representation Octal and Hexadecimal \o, \xUse \x and \o in an expression to find a comma (hex 2C) followed by a space (octal 40) followed by the character 2: numstr = 'Easy as 1, 2, 3'; [mat idx] = regexp(numstr, '\x2C\o{40}2', 'match', 'start') mat = ', 2' idx = 10 Special Characters \charUse \ before a character that has a special meaning to the regular expression functions if you want that character to be interpreted literally. The intention in this example is to have the string '(ab[XY|Z]c)' interpreted literally. The first expression does not do that because regexp interprets the parentheses and | sign as the special characters for grouping and logical OR: regexp('(ab[XY|Z]c)', '(ab[XY|Z]c)', 'match') ans = 'ab[XY' 'Z]c' This next expression uses a \ before any special characters. As a result the entire string is matched: regexp('(ab[XY|Z]c)', '$ab\[XY\|Z\]c$', 'match') ans = '(ab[XY|Z]c)'

Logical Operators: Logical operators do not match any specific characters. They are used to specify the context for matching an accompanying regular expression.

Examples Logical Operators Grouping and Capture (expr): elements can be grouped together using either (expr) to group and capture (Week-5 lecture: Using Tokens -- Example 1) or (?:expr) for grouping alone (see below) Grouping Only (?:expr) (1) Use (?:expr) to group a consonant followed by a vowel in the palindrome pstr. Specify at least two consecutive occurrences ({2,}) of this group. Return the starting and ending indices of the matched substrings: pstr = 'Marge lets Norah see Sharon''s telegram'; expr = '(?:[^aeiou][aeiou]){2,}'; [mat ix1 ix2] = regexp(pstr, expr, 'match', 'start', 'end') mat = 'Nora' 'haro' 'tele' ix1 = 12 23 31 ix2 = 15 26 34 (2) Remove the grouping, and the {2,} now applies only to [aeiou]. The command is entirely different now as it looks for a consonant followed by at least two consecutive vowels: expr = '[^aeiou][aeiou]{2,}'; [mat ix1 ix2] = regexp(pstr, expr, 'match', 'start', 'end') mat = 'see' ix1 = 18 ix2 = 20

Examples Logical Operators Including Comments (?#expr)Use (?#expr) to add a comment to this expression that matches capitalized words in pstr. Comments are ignored in the process of finding a match: regexp(pstr, '(?# Match words in caps)[A-Z]\w+', 'match') ans = 'Marge' 'Norah' 'Sharon' Alternative Match expr1|expr2Use p1|p2 to pick out words in the string that start with let or tel: regexpi(pstr, '(let|tel)\w+', 'match') ans = 'lets' 'telegram' (!) Note that using the '|' operator within square brackets in MATLAB regular expressions. Because the '|' operator has a higher precedence than '[ ]', MATLAB interprets the expression '[A|B]' not as 'A' or 'B', but as '[A' or 'B]'. The recommended way to match "the letter A or the letter B" in a MATLAB regexp expression is to use '[AB]'.

Examples Logical Operators Start and End of String Match ^expr, expr$Use ^expr to match words starting with the letter m or M only when it begins the string, and expr$ to match words ending with m or M only when it ends the string: regexpi(pstr, '^m\w*|\w*m$', 'match') ans = 'Marge' 'telegram' Start and End of Word Match \<expr, expr\>Use \<expr to match any words starting with n or N, or ending with e or E: regexpi(pstr, '\<n\w*|\w*e\>', 'match') ans = 'Marge' 'Norah' 'see' Exact Word Match \<expr\>Use \<expr\> to match a word starting with an n or N and ending with an h or H: regexpi(pstr, '\<n\w*h\>', 'match') ans = 'Norah'

Lookaround Operators • Lookaround operators have two components: a match pattern and a test pattern. If you call the match pattern p1 and the test pattern p2, then the simplest form of lookaround operator looks like this: • p1(?=p2) • The match pattern p1 is just like any other element in an expression. For example, it can be '\<[A-Za-z]+\>' to make regexp find any word. • The test pattern p2 places a condition on this match. There can be a match for p1 only if there is also a match for p2, and the p2 match must immediately precede (for lookbehind operators) or follow (for lookahead operators) the match for p1. • In the following expression, the match pattern is '\<[A-Za-z]+\>' and the test pattern is '\S'. The entire expression can be read as • "Find those words that are followed by a non-white-space character": • '\<[A-Za-z]+\>(?=\S)' • When used on the following string, this lookahead expression matches the letters of the words Raven and Nevermore: str = 'Quoth the Raven, "Nevermore"'; regexp(str, '\<[A-Za-z]+\>(?=\S)', 'match') ans = 'Raven' 'Nevermore' • One important characteristic of lookaround operators is how they affect the parsing of the input string. The parser can be said to "consume" pieces of the string as it looks for matching phrases. With lookaround operators, only the match pattern p1 affects the current parsing location. Finding a match for the test pattern p2 does not move the parser location.

Four lookaround expressions: 1. lookahead 2. negative lookahead 3. lookbehind 4. negative lookbehind

Examples Lookaround Operators Lookahead expr1(?=expr2)Use p1(?=p2) to find all words of this string that precede a comma: poestr = ['While I nodded, nearly napping, ' ... 'suddenly there came a tapping,']; [mat idx] = regexp(poestr, '\w*(?=,)', 'match', 'start') mat = 'nodded' 'napping' 'tapping' idx = 9 24 55 Negative Lookahead expr1(?!expr2)Use p1(?!p2) to find all words that do not precede a comma: [mat idx] = regexp(poestr, '\w+(?!\w*,)', 'match', 'start') mat = 'While' 'I' 'nearly' 'suddenly' 'there' 'came' 'a' idx = 1 7 17 33 42 48 53 Lookbehind (?<=expr1)expr2Use (?<=p1)p2 to find all words that follow a comma and zero or more spaces: [mat idx] = regexp(poestr, '(?<=,\s*)\w*', 'match', 'start') mat = 'nearly' 'suddenly' idx = 17 33

Examples Lookaround Operators Negative Lookbehind (?<!expr1)expr2Use (?<!p1)p2 to find all words that do not follow a comma and zero or more spaces: [mat idx] = regexp(poestr, '(?<!,\s*\w*)\w*', 'match', 'start') mat = 'While' 'I' 'nodded' 'napping' 'there' 'came' 'a' 'tapping' idx = 1 7 9 24 42 48 53 55 Using Lookaround as a Logical OperatorYou can use lookaround operators to perform a logical AND. The expression used here finds all words that contain a sequence of two letters under the condition that the two letters are identical and are in the range a through m. (The expression '(?=[a-m])' is a lookahead test for the range a through m, and the expression '(.)\1' tests for identical characters using a token: see Week-5 lecture notes): [mat idx] = regexp(poestr, '\<\w*(?=[a-m])(.)\1\w*\>', ... 'match', 'start') mat = 'nodded' 'suddenly' idx = 9 33 (!) Note that when using a lookahead operator to perform an AND, you need to place the match expression expr1 after the test expression expr2: (?=expr2)expr1 or (?!expr2)expr1

Quantifiers • Quantifiers can be used to specify how many instances of an element are to be matched. • When used alone, they match as much of the string as possible. Thus these are sometimes called greedy quantifiers. • When one of these quantifiers is followed by a plus sign (e.g., '\w*+'), it is known as a possessive quantifier. • Possessive quantifiers also match as much of the string as possible, but they do not rescan any portions of the string should the initial match fail. • When you follow a quantifier with a question mark (e.g., '\w*?'), it is known as a lazy quantifier. Lazy quantifiers match as little of the string as possible.

Quantifiers

Examples Quantifiers Zero or One expr?(1) Use ? to make the HTML <code> and </code> tags optional in the string. The first string, hstr1, contains one occurrence of each tag. Since the expression uses ()? around the tags, one occurrence is a match: hstr1 = '<td><a name="18854"></a><code>%%</code> </td>'; expr = '</a>(<code>)?..(</code>)? '; regexp(hstr1, expr, 'match') ans = '</a><code>%%</code> ' (2) The second string, hstr2, does not contain the code tags at all. Just the same, the expression matches because ()? allows for zero occurrences of the tags: hstr2 = '<td><a name="18854"></a>%% </td>'; expr = '</a>(<code>)?..(</code>)? '; regexp(hstr2, expr, 'match') ans = '</a>%% '

Examples Quantifiers Zero or More expr*Use * to match strings having any number of line breaks, including no line breaks at all. hstr1 = 'This string has line breaks'; expr = '.*( )*.*'; regexp(hstr1, expr, 'match') ans = 'This string has line breaks' hstr2 = 'This string has no line breaks'; regexp(hstr2, expr, 'match') ans = 'This string has no line breaks' One or More expr+Use + to verify that the HTML image source is not empty. This looks for one or more characters in the gif filename: hstr = '<a href="s12.html"><img src="b_prev.gif" border=0>'; expr = '<img src="\w+.gif'; regexp(hstr, expr, 'match') ans = '<img src="b_prev.gif'

Examples Quantifiers Exact, Minimum, and Maximum Quantities {min,max}Use {m}, {m,}, and {m,n} to verify the href syntax used in HTML. This statement requires the href to have at least one non-white-space character, followed by exactly one occurrence of .html, optionally followed by # and five to eight digits: hstr = '<a name="18749"></a><a href="s13.html#18760">'; expr = '<a href="\w{1,}(\.html){1}(\#\d{5,8}){0,1}"'; regexp(hstr, expr, 'match') ans = '<a href="s13.html#18760"' Greedy Quantifiers expr*Use * to match as many characters as possible between any < and > signs in the string. Because of the .* in the expression, regexp reads all characters in the string up to the end. Finding no closing > at the end, regexp then backs up to the </a> and ends the phrase there: hstr = '<tr valign=top><td><a name="19184"></a>xyz'; regexp(hstr, '<.*>', 'match') ans = '<tr valign=top><td><a name="19184"></a>'

Examples Quantifiers Possessive Quantifiers expr*+Except for the possessive *+ quantifier, this expression is the same as that used in the last example. Unlike the greedy quantifier, possessive quantifiers do not reevaluate parts of the string that have already been evaluated. This command scans the entire string because of the .* quantifier, but then cannot back up to locate the </a> sequence that would satisfy the expression. As a result, no match is found and regexp returns an empty cell array: regexp(hstr, '<.*+>', 'match') ans = { } Lazy Quantifiers expr*?This example shows the difference between lazy and greedy quantifiers. The first expression uses lazy .*? to match the minimum number of characters between <tr, <td, or </td tags: hstr = '<tr valign=top><td><a name="19184"></a> </td>'; regexp(hstr, '</?t.*?>', 'match') ans = '<tr valign=top>' '<td>' '</td>' The second expression uses greedy .* to match all characters from the opening <tr to the ending </td: regexp(hstr, '</?t.*>', 'match') ans = '<tr valign=top><td><a name="19184"></a> </td>'

COMP 5115 Programming Tools in Bioinformatics Week 6 Elements of an Expression