Regular expressions in python
Download
1 / 24

Regular Expressions in Python - PowerPoint PPT Presentation


  • 132 Views
  • Uploaded on

Regular Expressions in Python. HCA 741: Essential Programming for Health Informatics Rohit Kate. String Patterns. Often one wants to search for a particular pattern of characters in text

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Regular Expressions in Python' - eben


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Regular expressions in python

Regular Expressions in Python

HCA 741: Essential Programming for Health Informatics

Rohit Kate


String patterns
String Patterns

  • Often one wants to search for a particular pattern of characters in text

    • An email address: alphanumeric characters followed by [email protected], followed by one or more “multiple characters and dot”, finally followed by edu, com, org etc.

    • Wisconsin license plate (most): Three digits, followed by ‘-’, followed by three alphabets


Regular expression
Regular Expression

  • It is possible to write a tailored program to search a particular string pattern by looping through the characters while performing equality checks, but this is a tedious, error-prone way

  • Regular expression is a mechanism to succinctly specify a pattern of strings

  • Many programming languages support it along with built-in mechanisms to search or match them

  • The programming languages that support it, all follow almost the same syntax to specify a regular expression

  • One can also use regular expressions on Linux command prompt and some editors, like emacs


Syntax of regular expressions
Syntax of Regular Expressions

  • “\d” represents any digit, e.g. “1”, “2”, “9”, etc.

  • “\D” represents any non-digit, e.g. “a”, “b”, “-”

  • “\w” represents any alphanumeric characters, e.g. “a”, “1”, “z”, “0”

  • “\W” represents any non-alphanumeric character, “-”, [email protected]

  • “\s” represents whitespace, e.g. space, tab, newline

  • “\S” represents non-whitespace

  • Most other characters represent themselves, e.g. “a” represents “a”, “-” represents “-”, “1” represents “1”, etc.


Syntax of regular expressions1
Syntax of Regular Expressions

  • Sequence of characters represent sequence of corresponding characters

    • “\d\d” represents two consecutive digits, e.g. “12”, “33”, etc.

    • “abc” represents “abc”

    • “\w\w\s\w” represents two alphanumeric charcters, followed by space, followed by one alphanumeric character, e.g. “ab c”, “12 e” etc.


Python regular expressions
Python Regular Expressions

  • In Python regular expressions are specified like a normal string using double-quotes “”

    >>> myre = “\d\d”

  • To use them as regular expressions, import the library “re”

    >>> import re

  • Use “re.search()” function to search a regular expression in a string

    >>> re.search(myre,”abcd10efg”)

    <_sre.SRE_Match object at 0x010BB1A8>

    >>>

    >>> re.search(myre,”abcdefg”)

    >>>

  • If the RE is present in the string then returns an “object” else returns “None”

  • One can use it in an “if .. else” statement

    >>> if re.search(myre,”abcd10efg”) :

    print(“Present :-)”)

    else:

    print(“Not present :-(”)

    Present :-)


More syntax of regular expressions
More Syntax of Regular Expressions

  • Any of the specified characters: []

    • “[abc]” represents “a” or “b” or “c”

    • “[\dabc]” represents any digit or “a” or “b” or “c”

    • Use of “–” in “[]”

      • “[a-z]” represents any lower-case alphabet

      • “[A-Z]” represents any upper-case alphabet

      • “[a-zA-Z]” represents any alphabet

      • “[0-9]” represents any digit

      • “[e-yF-Z0-9]” represents e to y or F to Z or 0 to 9


More syntax of regular expressions1
More Syntax of Regular Expressions

  • None of the specified characters: [^]

    • “[^abc]” represents any character except “a” or “b” or “c”

    • “[^\dabc]” represents any character except any digit or “a” or “b” or “c”

    • Use of “-” in “[^]”:

      • “[^a-z]” represents any character except any lower-case alphabet

      • “[^A-Z]” represents any character except any upper-case alphabet

      • “[^a-zA-Z]” represents any character except any alphabet

      • “[^0-9]” represents any character except any digit

      • “[^e-yF-Z0-9]” represents any character except e to y or F to Z or 0 to 9


More syntax of regular expressions2
More Syntax of Regular Expressions

  • Metacharacter: “*”

    • “a*” represents zero or more “a”, e.g. “”, “a”, “aa”, “aaa”, etc.

    • “b*” represents zero or more “b”, e.g. “”, “b”, “bb”, “bbb”, etc.

    • “\d*” represents zero or more digits, e.g. “”, “1”, “2”, “23”, “23442”, etc.

    • “\D*” represents zero or more non-digits

    • “\w*” represents zero or more alphanumeric characters

    • “\s*” represents zero or more whitespaces

    • “[A-Z]*” represents zero or more upper-case alphabets


More syntax of regular expressions3
More Syntax of Regular Expressions

  • Metacharacter: “+”

    • “a+” represents one or more “a”, e.g. “a”, “aa”, “aaa”, etc.

    • “b+” represents one or more “b”, e.g. “b”, “bb”, “bbb”, etc.

    • “\d+” represents one or more digits, e.g. “1”, “2”, “23”, “23442”, etc.

    • “\D+” represents one or more non-digits

    • “\w+” represents one or more alphanumeric characters

    • “\s+” represents one or more whitespaces

    • “[A-Z]+” represents one or more upper-case alphabets


More syntax of regular expressions4
More Syntax of Regular Expressions

  • Metacharacter: “?”

    • “a?” represents zero or one “a”, i.e. “” or “a”

    • “b?” represents zero or one “b”, i.e. “” or “b”

    • “\d?” represents zero or one digit, e.g. “”, “1”, “2”, “3”, etc.

    • “\D?” represents zero or one non-digit

    • “\w?” represents zero or one alphanumeric character

    • “\s?” represents zero or one whitespace

    • “[A-Z]?” represents zero or one upper-case alphabets

      • Note: “[a*b+]” means “a” or “*” or “b” or “+”

        Metacharacters loose their meanings inside “[]”


More syntax of regular expressions5
More Syntax of Regular Expressions

  • Fixed number of repetitions: {m,n}

    • “a{1,3}” represents 1 to 3 “a”, i.e. “a”, “aa”, “aaa”

    • b{2,3}” represents 2 to 3 “b”, i.e. “bb” or “bbb”

    • “\d{3,5}” represents 3 to 5 digits, e.g. “111”, “1234”, “23456”, etc.

    • “\D{2,5}” represents 2 to 5 non-digits

    • “\w{10,100}” represents 10 to 100 alphanumeric characters

    • “\s{1,2}” represents 1 to 2 whitespaces

  • Regular expression for Wisconsin license plate:

    “\d{3,3}-[A-Z]{3,3}” or “\d\d\d-[A-Z][A-Z][A-Z]”


More syntax of regular expressions6
More Syntax of Regular Expressions

  • “.” represents any character except newline

  • “\.” represents the character “.”, similarly “\*” represents the character “*” etc.

  • “abc|xyz” matches either “abc” or “xyz”

  • “()” can be used for grouping, e.g. “(xy)+” represents one or more “xy”, e.g. “xy”, “xyxy”, “xyxyxy” etc.

  • Regular expression for email addresses:

    “\w+@(\w+\.)+(edu|com|org)”


Search vs match
Search vs. Match

  • We have already seen “re.search()”, i.e. if the pattern is present anywhere in the string

  • “re.match()” looks for the pattern at the beginning of the string

    >>> if re.search(“\d[A-Z]”, “A1B1”):

    print(“Found”)

    Found

    >>> if re.match(“\d[A-Z]”), ”A1B1”):

    print(“Found at the beginning”)

    >>> if re.match(“\d[A-Z]”), ”1A1B”):

    print(“Found at the beginning”)

    Found at the beginning


Returned object of search and match
Returned Object of Search and Match

  • The returned object stores the portion of the string that was matched

    >>> re.search(“\d[A-Z]”, “A1B1”)

    <_sre.SRE_Match object at 0x010BB1A8>

    >>> a = re.search(“\d[A-Z]”, “A1B1”)

    Portion of the string that matched:

    >>> a. group()

    “1B”

    Index of the first character matched:

    >>> a.start()

    1

    Index of the last character matched plus 1:

    >>>a.end()

    3

  • Note: If the search did not succeed then a.group() etc. will crash with error.


Returned object of search and match1
Returned Object of Search and Match

The following is a safer way to avoid those errors:

>>> a = re.search(“\d[A-z]”,”abc123XYZ”)

>>> if a :

print(“Found pattern:”,a.group(),”from characters ”,a.start(),”to”,a.end())

else:

print(“Pattern not found”)

Found pattern: 3X from characters 5 to 7


Split
Split

  • One can split a string using a regular expression, analogous to <string>.split(“..”)

    >>> re.split(“\d”,”A1B2C3”)

    [“A”,”B”,”C”,””]

  • <string>.split(“,”) can be re-written as “re.split(“,”,text)”

  • <string>.split() can be re-written as “re.split(“\s+”,text)”


Sub

  • One can substitute a matched portion of a string with a different string

  • A regular expression for names: “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+”

    (Ms. or Mr. or Dr. or Prof. followed by space, followed by a capital letter, followed by a dot or rest of the first name, followed by space, followed by a capital letter, followed by rest of the last name)

  • Remove all occurrences of names in text by “**name**”

    >>> nameRE = “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+”

    >>> text = “Mr. John Smith went to the office of Dr. A. Wong”

    >>> deidentified_text = re.sub(nameRE,”**name**”,text)

    >>> deidentified_text

    '**name** went to the office of **name**'


Capturing the portions of the regular expression matched
Capturing the Portions of the Regular Expression Matched

  • Suppose you want to capture the email name and the domain once the regular expression for an email matches the text

  • You can put extra “()” in the regular expression

    emailre = “(\w+)@(\w+\.)+(edu|com|org)”

    >>> m = re.search(emailre,”My email is [email protected], what is yours?”)

    >>> m.group()

    [email protected]

    >>> m.group(1)

    “katerj”

    >>> m.group(2)

    “uwm.”

    >>>m.group(3)

    “edu”


Findall
Findall

  • Finds all the occurrences of the regular expression patterns in the string and puts them in a list

    >> ints=re.findall(“\d+”,”There were 20 numbers in the range of 0 to 100.”)

    >>>ints

    [“20”,”0”,”100”]

    When “()” brackets are present in the regular expression, it gives a list of tuples according to how each bracket matched.

    >>> name = “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+”

    >>> allnames = re.findall(name,”I met Dr. A. Wong and Mr. John Smith yesterday.”)

    >>> allnames

    [('Dr.', '.'), ('Mr.', 'ohn')]

    Put a bracket around the entire regular expression to get the entire matched string at the 0th index of the tuple.

    >>> name = “((Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+)”

    >>> allnames = re.findall(name,”I met Dr. A. Wong and Mr. John Smith yesterday.”)

    >>> allnames

    [('Dr. A. Wong', 'Dr.', '.'), ('Mr. John Smith', 'Mr.', 'ohn')]

    >>> allnames[0][0]

    'Dr. A. Wong‘

    >>> allnames[1][0]

    'Mr. John Smith'


Match at beginning and end
Match at Beginning and End

  • Putting ^ in front makes a regular expression match from the beginning (different from its use inside [..])

    “^abc” will match “abc”, “abcd”, “abcde” but NOT “aabc”

  • Putting $ in the end makes a regular expression match at the end

    “abc$” will match “abc”, “aabc”, “babc” but NOT “abcd”

  • Putting both “^” and “$” makes a regular expression match exactly

    “^abc$” will only match “abc”


Compile
Compile

  • A regular expression can be “compiled” for efficiency. This is not necessary, unless it is going to be used a lot.

    >>> nameRE = “(Ms.|Mr.|Dr.|Prof.) [A-Z](\.|[a-z]+) [A-Z][a-z]+”

    >>> cnameRE = re.compile(nameRE)

    Compiled version is used like a normal regular expression.

    >>> re.sub(cnameRE,”**name**”,text)


Note

  • At many places, including in the second textbook, you will see regular expressions written preceded with “r”; this is called “raw” string form

    >>>print(“\n”)

    >>>print(r“\n”)

    \n

  • This is used to prevent some weird cases with “\”, but for most practical purposes, with or without “r” works equally well.


Resources
Resources

  • Another powerpoint from somewhere else:

    http://www.cs.umbc.edu/691p/notes/python/pythonRE.ppt

  • Official documentation for Python 3

    http://docs.python.org/release/3.2.2/library/re.html

    A lot more detailed than needed for this course.

  • A tutorial

    http://www.macresearch.org/files/RegularExpressionsInPython.pdf

    Gets into more details than needed for this course.


ad