Chapter 14 Autocoding Xia Jiang

Chapter 14 AutocodingXia Jiang Department of Biomedical Informatics University of Pittsburgh School of Medicine http://www.dbmi.pitt.edu

The Content of this Lecture • Python language basics: 1) some string constants and functions 2) date and time objects 3) range(), oct(), hex() 2. Textbook tasks: An autocoder script written in python 3. An introduction to algorithm analysis in terms of running time.

String Constants • string.ascii_letters • string.ascii_lowercase • string.ascii_uppercase • string.digits • string.hexdigits • string.letters

String Constants • string.lowercase • string.octdigits • string.punctuation • string.printable • string.uppercase • string.whitespace

>>> string.printable '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c' >>> len(string.printable) 100

>>> string.ascii_letters 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' >>> string.hexdigits '0123456789abcdefABCDEF' >>> string.whitespace '\t\n\x0b\x0c\r '

Some String Functions string.maketrans(from, to) string.translate(s, table[, deletechars])

string.maketrans(from, to) Return a translation table suitable for passing to translate(), that will map each character in from into the character at the same position in to; from and to must have the same length.

>>> table = string.maketrans('', '') >>> print table !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

>>> table = string.maketrans("12345","aeiou") >>> print table !"#$%&'()*+,-./0aeiou6789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

string.translate(s, table[, deletechars]) Delete all characters from s that are in deletechars (if present), and then translate the characters using table, which must be a 256-character string giving the translation for each character value, indexed by its ordinal. If table is None, then only the character deletion step is performed.

Questions • How do we obtain a string containing all the bad ascii’s? • How do we delete all the digits in a string? • How do we replace any digit in a string with a * ? • How do we delete all the punctuation marks in a string? • How do we replace any punctuation mark in a string with a *?

>>> norm=string.maketrans('','') >>> badascii= string.translate(norm,norm,string.printable) >>> print badascii ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? >>> len(badascii) 156

>>> badtable = badascii+(256-len(badascii))*" " >>> print badtable ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? >>> len(badtable) 256 >>> junktable= 256*" " >>> print junktable

>>> file = open("/users/xij6/data/seealls.txt","r") >>> str=file.read() >>> print str Squamous cell carcinoma, adenoid 0 0 0 0 0 1 0 3 1 5 8 8 12 17 9 14 4 3 3 0 0 0 0 0 0 0 0 1 0 2 4 5 11 17 10 18 8 10 27 0 Malignant teratoma, intermediate 0 0 0 0 3 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> str = string.translate(str,table, string.digits) >>> print str Squamous cell carcinoma, adenoid Malignant teratoma, intermediate

>>> file = open("/users/xij6/data/seealls.txt","r") >>> str = file.read() >>> print str Squamous cell carcinoma, adenoid 0 0 0 0 0 1 0 3 1 5 8 8 12 17 9 14 4 3 3 0 0 0 0 0 0 0 0 1 0 2 4 5 11 17 10 18 8 10 27 0 Malignant teratoma, intermediate 0 0 0 0 3 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> table = string.maketrans('\n\r', ' ') >>> newstr=string.translate(str,table,string.digits) >>> print newstr Squamous cell carcinoma, adenoid Malignant teratoma, intermediate

How do we test the following: In a string.translate() statement, which operation is executed first, deletion or replacement?

>>> table = string.maketrans(' ', '*') >>> newstr2 = string.translate(newstr, table) >>> print newstr2 Squamous*cell*carcinoma,*adenoid********************************************Malignant*teratoma,*intermediate****************************************** >>> newstr3 = string.translate(newstr,table,' ') >>> print newstr3 Squamouscellcarcinoma,adenoidMalignantteratoma,intermediate

>>> print str Squamous cell carcinoma, adenoid 0 0 0 0 0 1 0 3 1 5 8 8 12 17 9 14 4 3 3 0 0 0 0 0 0 0 0 1 0 2 4 5 11 17 10 18 8 10 27 0 Malignant teratoma, intermediate 0 0 0 0 3 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> starlist = len(string.digits)*'*' >>> print starlist ********** >>> newtable=string.maketrans(string.digits, starlist)

>>> newstr = string.translate(str, newtable) >>> print newstr Squamous cell carcinoma, adenoid * * * * * * * * * * * * ** ** * ** * * * * * * * * * * * * * * * * ** ** ** ** * ** ** * Malignant teratoma, intermediate * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Date and Time Objects datetime module objects that can be created using this module datetime date timedelta time

>>> from datetime import datetime >>> t0 = datetime(2012,10, 12, 4, 55, 12) >>> print t0 2012-10-12 04:55:12 >>> d1=datetime.now() >>> print d1 2012-10-13 17:00:54.529736 >>> dt = d1-t0 >>> print dt 1 day, 12:05:42.529736 >>> dt.total_seconds() 129942.529736

>>> from datetime import date >>> d0 = date(2012,10,11) >>> print d0 2012-10-11 >>> d1 = date.today() >>> print d1 2012-10-13 >>> dt=d1-d0 >>> print dt 2 days, 0:00:00 >>> print dt.total_seconds() 172800.0 >>>

>>> from datetime import time >>> t0 = time() >>> print t0 00:00:00 >>> t0=time(4,55,12) >>> print t0 04:55:12 >>> from datetime import date, datetime >>> d0=date.today() >>> print d0 2012-10-13 >>> d0_1=datetime.combine(d0,t0) >>> print d0_1 2012-10-13 04:55:12

>>> t1=d0_1.time() >>> print t1 04:55:12 >>> t2=datetime.now().time() >>> print t2 17:46:53.616930 >>> td=t2-t1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for -: 'datetime.time' and 'datetime.time’

Date and Time Objects time module time () Return the time as a floating point number expressed in seconds since the epoch, in UTC. Note that even though the time is always returned as a floating point number, not all systems provide time with a better precision than 1 second.

>>> from time import time >>> t0=time() >>> print t0 1350164274.71 >>> t1 = time() >>> print t1 1350164291.39 >>> print t1-t0 16.6719491482

Date and Time Objects timeit module This module provides a simple way to time the execution time of small bits of Python code.

>>> import timeit >>> s ='''\ ... for i in range(10): ... print oct(i) ... '’’

>>> timeit.timeit(stmt=s,number=1) 0 01 02 03 04 05 06 07 010 011 7.796287536621094e-05

>>> timeit.Timer("for i in range(3): print oct(i)").timeit(3) 0 01 02 0 01 02 0 01 02 4.887580871582031e-05 >>>

>>> range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> xrange(10) xrange(10) >>> for i in range(10): ... print i ... 0 1 2 3 4 5 6 7 8 9 >>>

>>> oct(5) '05' >>> oct(9) '011' >>> hex(5) '0x5' >>> hex(15) '0xf' >>> hex(16) '0x10' >>> hex(17) '0x11’

Autocoder An autocoder is a software product that can parse and codes medical text. For example: Hepatocellular cacinoma Liver cell cancer Liver cancer Hcc

Autocoder These terms should be annotated with the same concept code. A “nomenclature” is a list that contains codes together with their corresponding terms.

A Neoplasm Nomenclature <name nci-code = "C6546000">pediatric ovarian embryonalca</name> <name nci-code = "C6546000">pediatric ovarian embryonal cancer</name> <name nci-code = "C6546000">pediatric ovarian embryonal carcinoma</name> <name nci-code = "C6547000">ovary with childhood immature teratoma</name> <name nci-code = "C6547000">childhood immature teratoma arising in ovary</name> <name nci-code = "C6547000">childhood immature teratoma involving ovary</name>

Algorithm for the Neoplasm Autocoder • Open the nomenclature file: neocl.xml. text = open("/users/xij6/data/neocl.xml", "r")

Algorithm for the Neoplasm Autocoder 2. Create a dictionary object with keys to be the terms and values to be the codes. literalhash = {} codematch = re.compile('\"(C\d{7})\"') phrasematch = re.compile('\"\> ?(.+) ?\<\/') Example line: <name nci-code = "C6546000">pediatric ovarian embryonalca</name>

for line in text: m= codematch.search(line) if m: code = m.group(1) else: continue x = phrasematch.search(line) if x: phrase = x.group(1) else: continue literalhash[phrase] = code text.close()

Algorithm for the Neoplasm Autocoder 3. Open the text file to be parsed (tumorabs.txt). 4. Parse through the file, line by line. 5. As each line is parsed, break the file into every possible ordered subsequence of words (a phrase array).

Algorithm for the Neoplasm Autocoder 6. For each item in the phrase array, determine whether the item matches a term in the dictionary. 7. If there is a match, print the phrase and the code ( the value in the dictionary) to an output file. The printed line consists of the lines form the text file, followed by the phrases from the lines that match the neoplasm terms, along with their codes.

The very critical procedure is step number 5!

Example for Phrase Array Sentence: “She is a good person” She is a good person She is a good She is a She is She is a good person is a good is a

is a good person a good a good person good person

Example for Phrase Array Text: “anxiety and depression in patients” anxiety anxiety and anxiety and depression anxiety and depression in anxiety and depression in patients

and and depression and depression in and depression in patients depression depression in depression in patients

in in patients patients

Analyzing the Code/Algorithm for Step Number 5 singular = re.compile('omas') england = re.compile('tumo[u]?rs') for line in absfile: sentence = line sentence = singular.sub("oma",sentence) sentence = england.sub("tumor",sentence)

For example, we have the following sentence: state and trait anxiety and depression in patients with primary brain tumorbefore and after surgery 1 year longitudinal study.

Chapter 14 Autocoding Xia Jiang

Chapter 14 Autocoding Xia Jiang

Presentation Transcript

Xia Dynasty

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia

Lirong Xia