Regular Expressions in Python

Regular Expressions in Python HCA 741: Essential Programming for Health Informatics Rohit Kate

String Patterns • Often one wants to search for a particular pattern of characters in text • An email address: alphanumeric characters followed by ‘@’, followed by one or more “multiple characters and dot”, finally followed by edu, com, org etc. • Wisconsin license plate (most): Three digits, followed by ‘-’, followed by three alphabets

Regular Expression • It is possible to write a tailored program to search a particular string pattern by looping through the characters while performing equality checks, but this is a tedious, error-prone way • Regular expression is a mechanism to succinctly specify a pattern of strings • Many programming languages support it along with built-in mechanisms to search or match them • The programming languages that support it, all follow almost the same syntax to specify a regular expression • One can also use regular expressions on Linux command prompt and some editors, like emacs

Syntax of Regular Expressions • “\d” represents any digit, e.g. “1”, “2”, “9”, etc. • “\D” represents any non-digit, e.g. “a”, “b”, “-” • “\w” represents any alphanumeric characters, e.g. “a”, “1”, “z”, “0” • “\W” represents any non-alphanumeric character, “-”, “@” • “\s” represents whitespace, e.g. space, tab, newline • “\S” represents non-whitespace • Most other characters represent themselves, e.g. “a” represents “a”, “-” represents “-”, “1” represents “1”, etc.

Syntax of Regular Expressions • Sequence of characters represent sequence of corresponding characters • “\d\d” represents two consecutive digits, e.g. “12”, “33”, etc. • “abc” represents “abc” • “\w\w\s\w” represents two alphanumeric charcters, followed by space, followed by one alphanumeric character, e.g. “ab c”, “12 e” etc.

Python Regular Expressions • In Python regular expressions are specified like a normal string using double-quotes “” >>> myre = “\d\d” • To use them as regular expressions, import the library “re” >>> import re • Use “re.search()” function to search a regular expression in a string >>> re.search(myre,”abcd10efg”) <_sre.SRE_Match object at 0x010BB1A8> >>> >>> re.search(myre,”abcdefg”) >>> • If the RE is present in the string then returns an “object” else returns “None” • One can use it in an “if .. else” statement >>> if re.search(myre,”abcd10efg”) : print(“Present :-)”) else: print(“Not present :-(”) Present :-)

More Syntax of Regular Expressions • Any of the specified characters: [] • “[abc]” represents “a” or “b” or “c” • “[\dabc]” represents any digit or “a” or “b” or “c” • Use of “–” in “[]” • “[a-z]” represents any lower-case alphabet • “[A-Z]” represents any upper-case alphabet • “[a-zA-Z]” represents any alphabet • “[0-9]” represents any digit • “[e-yF-Z0-9]” represents e to y or F to Z or 0 to 9

More Syntax of Regular Expressions • None of the specified characters: [^] • “[âbc]” represents any character except “a” or “b” or “c” • “[^\dabc]” represents any character except any digit or “a” or “b” or “c” • Use of “-” in “[^]”: • “[â-z]” represents any character except any lower-case alphabet • “[Â-Z]” represents any character except any upper-case alphabet • “[â-zA-Z]” represents any character except any alphabet • “[^0-9]” represents any character except any digit • “[ê-yF-Z0-9]” represents any character except e to y or F to Z or 0 to 9

More Syntax of Regular Expressions • Metacharacter: “*” • “a*” represents zero or more “a”, e.g. “”, “a”, “aa”, “aaa”, etc. • “b*” represents zero or more “b”, e.g. “”, “b”, “bb”, “bbb”, etc. • “\d*” represents zero or more digits, e.g. “”, “1”, “2”, “23”, “23442”, etc. • “\D*” represents zero or more non-digits • “\w*” represents zero or more alphanumeric characters • “\s*” represents zero or more whitespaces • “[A-Z]*” represents zero or more upper-case alphabets

More Syntax of Regular Expressions • Metacharacter: “+” • “a+” represents one or more “a”, e.g. “a”, “aa”, “aaa”, etc. • “b+” represents one or more “b”, e.g. “b”, “bb”, “bbb”, etc. • “\d+” represents one or more digits, e.g. “1”, “2”, “23”, “23442”, etc. • “\D+” represents one or more non-digits • “\w+” represents one or more alphanumeric characters • “\s+” represents one or more whitespaces • “[A-Z]+” represents one or more upper-case alphabets

More Syntax of Regular Expressions • Metacharacter: “?” • “a?” represents zero or one “a”, i.e. “” or “a” • “b?” represents zero or one “b”, i.e. “” or “b” • “\d?” represents zero or one digit, e.g. “”, “1”, “2”, “3”, etc. • “\D?” represents zero or one non-digit • “\w?” represents zero or one alphanumeric character • “\s?” represents zero or one whitespace • “[A-Z]?” represents zero or one upper-case alphabets • Note: “[a*b+]” means “a” or “*” or “b” or “+” Metacharacters loose their meanings inside “[]”

More Syntax of Regular Expressions • Fixed number of repetitions: {m,n} • “a{1,3}” represents 1 to 3 “a”, i.e. “a”, “aa”, “aaa” • b{2,3}” represents 2 to 3 “b”, i.e. “bb” or “bbb” • “\d{3,5}” represents 3 to 5 digits, e.g. “111”, “1234”, “23456”, etc. • “\D{2,5}” represents 2 to 5 non-digits • “\w{10,100}” represents 10 to 100 alphanumeric characters • “\s{1,2}” represents 1 to 2 whitespaces • Regular expression for Wisconsin license plate: “\d{3,3}-[A-Z]{3,3}” or “\d\d\d-[A-Z][A-Z][A-Z]”

More Syntax of Regular Expressions • “.” represents any character except newline • “\.” represents the character “.”, similarly “\*” represents the character “*” etc. • “abc|xyz” matches either “abc” or “xyz” • “()” can be used for grouping, e.g. “(xy)+” represents one or more “xy”, e.g. “xy”, “xyxy”, “xyxyxy” etc. • Regular expression for email addresses: “\w+@(\w+\.)+(edu|com|org)”

Search vs. Match • We have already seen “re.search()”, i.e. if the pattern is present anywhere in the string • “re.match()” looks for the pattern at the beginning of the string >>> if re.search(“\d[A-Z]”, “A1B1”): print(“Found”) Found >>> if re.match(“\d[A-Z]”), ”A1B1”): print(“Found at the beginning”) >>> if re.match(“\d[A-Z]”), ”1A1B”): print(“Found at the beginning”) Found at the beginning

Returned Object of Search and Match • The returned object stores the portion of the string that was matched >>> re.search(“\d[A-Z]”, “A1B1”) <_sre.SRE_Match object at 0x010BB1A8> >>> a = re.search(“\d[A-Z]”, “A1B1”) Portion of the string that matched: >>> a. group() “1B” Index of the first character matched: >>> a.start() 1 Index of the last character matched plus 1: >>>a.end() 3 • Note: If the search did not succeed then a.group() etc. will crash with error.

Returned Object of Search and Match The following is a safer way to avoid those errors: >>> a = re.search(“\d[A-z]”,”abc123XYZ”) >>> if a : print(“Found pattern:”,a.group(),”from characters ”,a.start(),”to”,a.end()) else: print(“Pattern not found”) Found pattern: 3X from characters 5 to 7

Split • One can split a string using a regular expression, analogous to <string>.split(“..”) >>> re.split(“\d”,”A1B2C3”) [“A”,”B”,”C”,””] • <string>.split(“,”) can be re-written as “re.split(“,”,text)” • <string>.split() can be re-written as “re.split(“\s+”,text)”

Sub • One can substitute a matched portion of a string with a different string • A regular expression for names: “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+” (Ms. or Mr. or Dr. or Prof. followed by space, followed by a capital letter, followed by a dot or rest of the first name, followed by space, followed by a capital letter, followed by rest of the last name) • Remove all occurrences of names in text by “**name**” >>> nameRE = “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+” >>> text = “Mr. John Smith went to the office of Dr. A. Wong” >>> deidentified_text = re.sub(nameRE,”**name**”,text) >>> deidentified_text '**name** went to the office of **name**'

Capturing the Portions of the Regular Expression Matched • Suppose you want to capture the email name and the domain once the regular expression for an email matches the text • You can put extra “()” in the regular expression emailre = “(\w+)@(\w+\.)+(edu|com|org)” >>> m = re.search(emailre,”My email is katerj@uwm.edu, what is yours?”) >>> m.group() “katerj@uwm.edu” >>> m.group(1) “katerj” >>> m.group(2) “uwm.” >>>m.group(3) “edu”

Findall • Finds all the occurrences of the regular expression patterns in the string and puts them in a list >> ints=re.findall(“\d+”,”There were 20 numbers in the range of 0 to 100.”) >>>ints [“20”,”0”,”100”] When “()” brackets are present in the regular expression, it gives a list of tuples according to how each bracket matched. >>> name = “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+” >>> allnames = re.findall(name,”I met Dr. A. Wong and Mr. John Smith yesterday.”) >>> allnames [('Dr.', '.'), ('Mr.', 'ohn')] Put a bracket around the entire regular expression to get the entire matched string at the 0th index of the tuple. >>> name = “((Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+)” >>> allnames = re.findall(name,”I met Dr. A. Wong and Mr. John Smith yesterday.”) >>> allnames [('Dr. A. Wong', 'Dr.', '.'), ('Mr. John Smith', 'Mr.', 'ohn')] >>> allnames[0][0] 'Dr. A. Wong‘ >>> allnames[1][0] 'Mr. John Smith'

Match at Beginning and End • Putting ^ in front makes a regular expression match from the beginning (different from its use inside [..]) “^abc” will match “abc”, “abcd”, “abcde” but NOT “aabc” • Putting $ in the end makes a regular expression match at the end “abc$” will match “abc”, “aabc”, “babc” but NOT “abcd” • Putting both “^” and “$” makes a regular expression match exactly “^abc$” will only match “abc”

Compile • A regular expression can be “compiled” for efficiency. This is not necessary, unless it is going to be used a lot. >>> nameRE = “(Ms.|Mr.|Dr.|Prof.) [A-Z](\.|[a-z]+) [A-Z][a-z]+” >>> cnameRE = re.compile(nameRE) Compiled version is used like a normal regular expression. >>> re.sub(cnameRE,”**name**”,text)

Note • At many places, including in the second textbook, you will see regular expressions written preceded with “r”; this is called “raw” string form >>>print(“\n”) >>>print(r“\n”) \n • This is used to prevent some weird cases with “\”, but for most practical purposes, with or without “r” works equally well.

Resources • Another powerpoint from somewhere else: http://www.cs.umbc.edu/691p/notes/python/pythonRE.ppt • Official documentation for Python 3 http://docs.python.org/release/3.2.2/library/re.html A lot more detailed than needed for this course. • A tutorial http://www.macresearch.org/files/RegularExpressionsInPython.pdf Gets into more details than needed for this course.

Regular Expressions in Python

Regular Expressions in Python

Presentation Transcript

Regular Expressions

Python Regular Expressions

Python regular expressions

Regular Expressions

Introduction to Python and Regular Expressions in Python

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions ( in Python)

Regular Expressions

Regular Expressions

Regular Expressions

Python RegEx | Python Regular Expressions Tutorial | Python Tutorial | Python Training | Edureka

Regular Expressions

Python regular expressions

Regular expressions

Python regular expressions

Regular Expressions

Python for NLP Regular Expressions