regular expressions in python
Download
Skip this Video
Download Presentation
Regular Expressions in Python

Loading in 2 Seconds...

play fullscreen
1 / 24

Regular Expressions in Python - PowerPoint PPT Presentation


  • 132 Views
  • Uploaded on

Regular Expressions in Python. HCA 741: Essential Programming for Health Informatics Rohit Kate. String Patterns. Often one wants to search for a particular pattern of characters in text

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Regular Expressions in Python' - eben


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
regular expressions in python

Regular Expressions in Python

HCA 741: Essential Programming for Health Informatics

Rohit Kate

string patterns
String Patterns
  • Often one wants to search for a particular pattern of characters in text
    • An email address: alphanumeric characters followed by ‘@’, followed by one or more “multiple characters and dot”, finally followed by edu, com, org etc.
    • Wisconsin license plate (most): Three digits, followed by ‘-’, followed by three alphabets
regular expression
Regular Expression
  • It is possible to write a tailored program to search a particular string pattern by looping through the characters while performing equality checks, but this is a tedious, error-prone way
  • Regular expression is a mechanism to succinctly specify a pattern of strings
  • Many programming languages support it along with built-in mechanisms to search or match them
  • The programming languages that support it, all follow almost the same syntax to specify a regular expression
  • One can also use regular expressions on Linux command prompt and some editors, like emacs
syntax of regular expressions
Syntax of Regular Expressions
  • “\d” represents any digit, e.g. “1”, “2”, “9”, etc.
  • “\D” represents any non-digit, e.g. “a”, “b”, “-”
  • “\w” represents any alphanumeric characters, e.g. “a”, “1”, “z”, “0”
  • “\W” represents any non-alphanumeric character, “-”, “@”
  • “\s” represents whitespace, e.g. space, tab, newline
  • “\S” represents non-whitespace
  • Most other characters represent themselves, e.g. “a” represents “a”, “-” represents “-”, “1” represents “1”, etc.
syntax of regular expressions1
Syntax of Regular Expressions
  • Sequence of characters represent sequence of corresponding characters
    • “\d\d” represents two consecutive digits, e.g. “12”, “33”, etc.
    • “abc” represents “abc”
    • “\w\w\s\w” represents two alphanumeric charcters, followed by space, followed by one alphanumeric character, e.g. “ab c”, “12 e” etc.
python regular expressions
Python Regular Expressions
  • In Python regular expressions are specified like a normal string using double-quotes “”

>>> myre = “\d\d”

  • To use them as regular expressions, import the library “re”

>>> import re

  • Use “re.search()” function to search a regular expression in a string

>>> re.search(myre,”abcd10efg”)

<_sre.SRE_Match object at 0x010BB1A8>

>>>

>>> re.search(myre,”abcdefg”)

>>>

  • If the RE is present in the string then returns an “object” else returns “None”
  • One can use it in an “if .. else” statement

>>> if re.search(myre,”abcd10efg”) :

print(“Present :-)”)

else:

print(“Not present :-(”)

Present :-)

more syntax of regular expressions
More Syntax of Regular Expressions
  • Any of the specified characters: []
    • “[abc]” represents “a” or “b” or “c”
    • “[\dabc]” represents any digit or “a” or “b” or “c”
    • Use of “–” in “[]”
      • “[a-z]” represents any lower-case alphabet
      • “[A-Z]” represents any upper-case alphabet
      • “[a-zA-Z]” represents any alphabet
      • “[0-9]” represents any digit
      • “[e-yF-Z0-9]” represents e to y or F to Z or 0 to 9
more syntax of regular expressions1
More Syntax of Regular Expressions
  • None of the specified characters: [^]
    • “[^abc]” represents any character except “a” or “b” or “c”
    • “[^\dabc]” represents any character except any digit or “a” or “b” or “c”
    • Use of “-” in “[^]”:
      • “[^a-z]” represents any character except any lower-case alphabet
      • “[^A-Z]” represents any character except any upper-case alphabet
      • “[^a-zA-Z]” represents any character except any alphabet
      • “[^0-9]” represents any character except any digit
      • “[^e-yF-Z0-9]” represents any character except e to y or F to Z or 0 to 9
more syntax of regular expressions2
More Syntax of Regular Expressions
  • Metacharacter: “*”
    • “a*” represents zero or more “a”, e.g. “”, “a”, “aa”, “aaa”, etc.
    • “b*” represents zero or more “b”, e.g. “”, “b”, “bb”, “bbb”, etc.
    • “\d*” represents zero or more digits, e.g. “”, “1”, “2”, “23”, “23442”, etc.
    • “\D*” represents zero or more non-digits
    • “\w*” represents zero or more alphanumeric characters
    • “\s*” represents zero or more whitespaces
    • “[A-Z]*” represents zero or more upper-case alphabets
more syntax of regular expressions3
More Syntax of Regular Expressions
  • Metacharacter: “+”
    • “a+” represents one or more “a”, e.g. “a”, “aa”, “aaa”, etc.
    • “b+” represents one or more “b”, e.g. “b”, “bb”, “bbb”, etc.
    • “\d+” represents one or more digits, e.g. “1”, “2”, “23”, “23442”, etc.
    • “\D+” represents one or more non-digits
    • “\w+” represents one or more alphanumeric characters
    • “\s+” represents one or more whitespaces
    • “[A-Z]+” represents one or more upper-case alphabets
more syntax of regular expressions4
More Syntax of Regular Expressions
  • Metacharacter: “?”
    • “a?” represents zero or one “a”, i.e. “” or “a”
    • “b?” represents zero or one “b”, i.e. “” or “b”
    • “\d?” represents zero or one digit, e.g. “”, “1”, “2”, “3”, etc.
    • “\D?” represents zero or one non-digit
    • “\w?” represents zero or one alphanumeric character
    • “\s?” represents zero or one whitespace
    • “[A-Z]?” represents zero or one upper-case alphabets
      • Note: “[a*b+]” means “a” or “*” or “b” or “+”

Metacharacters loose their meanings inside “[]”

more syntax of regular expressions5
More Syntax of Regular Expressions
  • Fixed number of repetitions: {m,n}
    • “a{1,3}” represents 1 to 3 “a”, i.e. “a”, “aa”, “aaa”
    • b{2,3}” represents 2 to 3 “b”, i.e. “bb” or “bbb”
    • “\d{3,5}” represents 3 to 5 digits, e.g. “111”, “1234”, “23456”, etc.
    • “\D{2,5}” represents 2 to 5 non-digits
    • “\w{10,100}” represents 10 to 100 alphanumeric characters
    • “\s{1,2}” represents 1 to 2 whitespaces
  • Regular expression for Wisconsin license plate:

“\d{3,3}-[A-Z]{3,3}” or “\d\d\d-[A-Z][A-Z][A-Z]”

more syntax of regular expressions6
More Syntax of Regular Expressions
  • “.” represents any character except newline
  • “\.” represents the character “.”, similarly “\*” represents the character “*” etc.
  • “abc|xyz” matches either “abc” or “xyz”
  • “()” can be used for grouping, e.g. “(xy)+” represents one or more “xy”, e.g. “xy”, “xyxy”, “xyxyxy” etc.
  • Regular expression for email addresses:

“\[email protected](\w+\.)+(edu|com|org)”

search vs match
Search vs. Match
  • We have already seen “re.search()”, i.e. if the pattern is present anywhere in the string
  • “re.match()” looks for the pattern at the beginning of the string

>>> if re.search(“\d[A-Z]”, “A1B1”):

print(“Found”)

Found

>>> if re.match(“\d[A-Z]”), ”A1B1”):

print(“Found at the beginning”)

>>> if re.match(“\d[A-Z]”), ”1A1B”):

print(“Found at the beginning”)

Found at the beginning

returned object of search and match
Returned Object of Search and Match
  • The returned object stores the portion of the string that was matched

>>> re.search(“\d[A-Z]”, “A1B1”)

<_sre.SRE_Match object at 0x010BB1A8>

>>> a = re.search(“\d[A-Z]”, “A1B1”)

Portion of the string that matched:

>>> a. group()

“1B”

Index of the first character matched:

>>> a.start()

1

Index of the last character matched plus 1:

>>>a.end()

3

  • Note: If the search did not succeed then a.group() etc. will crash with error.
returned object of search and match1
Returned Object of Search and Match

The following is a safer way to avoid those errors:

>>> a = re.search(“\d[A-z]”,”abc123XYZ”)

>>> if a :

print(“Found pattern:”,a.group(),”from characters ”,a.start(),”to”,a.end())

else:

print(“Pattern not found”)

Found pattern: 3X from characters 5 to 7

split
Split
  • One can split a string using a regular expression, analogous to <string>.split(“..”)

>>> re.split(“\d”,”A1B2C3”)

[“A”,”B”,”C”,””]

  • <string>.split(“,”) can be re-written as “re.split(“,”,text)”
  • <string>.split() can be re-written as “re.split(“\s+”,text)”
slide18
Sub
  • One can substitute a matched portion of a string with a different string
  • A regular expression for names: “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+”

(Ms. or Mr. or Dr. or Prof. followed by space, followed by a capital letter, followed by a dot or rest of the first name, followed by space, followed by a capital letter, followed by rest of the last name)

  • Remove all occurrences of names in text by “**name**”

>>> nameRE = “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+”

>>> text = “Mr. John Smith went to the office of Dr. A. Wong”

>>> deidentified_text = re.sub(nameRE,”**name**”,text)

>>> deidentified_text

\'**name** went to the office of **name**\'

capturing the portions of the regular expression matched
Capturing the Portions of the Regular Expression Matched
  • Suppose you want to capture the email name and the domain once the regular expression for an email matches the text
  • You can put extra “()” in the regular expression

emailre = “(\w+)@(\w+\.)+(edu|com|org)”

>>> m = re.search(emailre,”My email is [email protected], what is yours?”)

>>> m.group()

[email protected]

>>> m.group(1)

“katerj”

>>> m.group(2)

“uwm.”

>>>m.group(3)

“edu”

findall
Findall
  • Finds all the occurrences of the regular expression patterns in the string and puts them in a list

>> ints=re.findall(“\d+”,”There were 20 numbers in the range of 0 to 100.”)

>>>ints

[“20”,”0”,”100”]

When “()” brackets are present in the regular expression, it gives a list of tuples according to how each bracket matched.

>>> name = “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+”

>>> allnames = re.findall(name,”I met Dr. A. Wong and Mr. John Smith yesterday.”)

>>> allnames

[(\'Dr.\', \'.\'), (\'Mr.\', \'ohn\')]

Put a bracket around the entire regular expression to get the entire matched string at the 0th index of the tuple.

>>> name = “((Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+)”

>>> allnames = re.findall(name,”I met Dr. A. Wong and Mr. John Smith yesterday.”)

>>> allnames

[(\'Dr. A. Wong\', \'Dr.\', \'.\'), (\'Mr. John Smith\', \'Mr.\', \'ohn\')]

>>> allnames[0][0]

\'Dr. A. Wong‘

>>> allnames[1][0]

\'Mr. John Smith\'

match at beginning and end
Match at Beginning and End
  • Putting ^ in front makes a regular expression match from the beginning (different from its use inside [..])

“^abc” will match “abc”, “abcd”, “abcde” but NOT “aabc”

  • Putting $ in the end makes a regular expression match at the end

“abc$” will match “abc”, “aabc”, “babc” but NOT “abcd”

  • Putting both “^” and “$” makes a regular expression match exactly

“^abc$” will only match “abc”

compile
Compile
  • A regular expression can be “compiled” for efficiency. This is not necessary, unless it is going to be used a lot.

>>> nameRE = “(Ms.|Mr.|Dr.|Prof.) [A-Z](\.|[a-z]+) [A-Z][a-z]+”

>>> cnameRE = re.compile(nameRE)

Compiled version is used like a normal regular expression.

>>> re.sub(cnameRE,”**name**”,text)

slide23
Note
  • At many places, including in the second textbook, you will see regular expressions written preceded with “r”; this is called “raw” string form

>>>print(“\n”)

>>>print(r“\n”)

\n

  • This is used to prevent some weird cases with “\”, but for most practical purposes, with or without “r” works equally well.
resources
Resources
  • Another powerpoint from somewhere else:

http://www.cs.umbc.edu/691p/notes/python/pythonRE.ppt

  • Official documentation for Python 3

http://docs.python.org/release/3.2.2/library/re.html

A lot more detailed than needed for this course.

  • A tutorial

http://www.macresearch.org/files/RegularExpressionsInPython.pdf

Gets into more details than needed for this course.

ad