1 / 62

Chapter 14 Autocoding Xia Jiang

Chapter 14 Autocoding Xia Jiang. Department of Biomedical Informatics University of Pittsburgh School of Medicine http://www.dbmi.pitt.edu. The Content of this Lecture. Python language basics: 1) s ome string constants and functions 2) date and time objects

tanuja
Download Presentation

Chapter 14 Autocoding Xia Jiang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 14 AutocodingXia Jiang Department of Biomedical Informatics University of Pittsburgh School of Medicine http://www.dbmi.pitt.edu

  2. The Content of this Lecture • Python language basics: 1) some string constants and functions 2) date and time objects 3) range(), oct(), hex() 2. Textbook tasks: An autocoder script written in python 3. An introduction to algorithm analysis in terms of running time.

  3. String Constants • string.ascii_letters • string.ascii_lowercase • string.ascii_uppercase • string.digits • string.hexdigits • string.letters

  4. String Constants • string.lowercase • string.octdigits • string.punctuation • string.printable • string.uppercase • string.whitespace

  5. >>> string.printable '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c' >>> len(string.printable) 100

  6. >>> string.ascii_letters 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' >>> string.hexdigits '0123456789abcdefABCDEF' >>> string.whitespace '\t\n\x0b\x0c\r '

  7. Some String Functions string.maketrans(from, to) string.translate(s, table[, deletechars])

  8. string.maketrans(from, to) Return a translation table suitable for passing to translate(), that will map each character in from into the character at the same position in to; from and to must have the same length.

  9. >>> table = string.maketrans('', '') >>> print table !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

  10. >>> table = string.maketrans("12345","aeiou") >>> print table !"#$%&'()*+,-./0aeiou6789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

  11. string.translate(s, table[, deletechars]) Delete all characters from s that are in deletechars (if present), and then translate the characters using table, which must be a 256-character string giving the translation for each character value, indexed by its ordinal. If table is None, then only the character deletion step is performed.

  12. Questions • How do we obtain a string containing all the bad ascii’s? • How do we delete all the digits in a string? • How do we replace any digit in a string with a * ? • How do we delete all the punctuation marks in a string? • How do we replace any punctuation mark in a string with a *?

  13. >>> norm=string.maketrans('','') >>> badascii= string.translate(norm,norm,string.printable) >>> print badascii ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? >>> len(badascii) 156

  14. >>> badtable = badascii+(256-len(badascii))*" " >>> print badtable ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? >>> len(badtable) 256 >>> junktable= 256*" " >>> print junktable

  15. >>> file = open("/users/xij6/data/seealls.txt","r") >>> str=file.read() >>> print str Squamous cell carcinoma, adenoid 0 0 0 0 0 1 0 3 1 5 8 8 12 17 9 14 4 3 3 0 0 0 0 0 0 0 0 1 0 2 4 5 11 17 10 18 8 10 27 0 Malignant teratoma, intermediate 0 0 0 0 3 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> str = string.translate(str,table, string.digits) >>> print str Squamous cell carcinoma, adenoid Malignant teratoma, intermediate

  16. >>> file = open("/users/xij6/data/seealls.txt","r") >>> str = file.read() >>> print str Squamous cell carcinoma, adenoid 0 0 0 0 0 1 0 3 1 5 8 8 12 17 9 14 4 3 3 0 0 0 0 0 0 0 0 1 0 2 4 5 11 17 10 18 8 10 27 0 Malignant teratoma, intermediate 0 0 0 0 3 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> table = string.maketrans('\n\r', ' ') >>> newstr=string.translate(str,table,string.digits) >>> print newstr Squamous cell carcinoma, adenoid Malignant teratoma, intermediate

  17. How do we test the following: In a string.translate() statement, which operation is executed first, deletion or replacement?

  18. >>> table = string.maketrans(' ', '*') >>> newstr2 = string.translate(newstr, table) >>> print newstr2 Squamous*cell*carcinoma,*adenoid********************************************Malignant*teratoma,*intermediate****************************************** >>> newstr3 = string.translate(newstr,table,' ') >>> print newstr3 Squamouscellcarcinoma,adenoidMalignantteratoma,intermediate

  19. >>> print str Squamous cell carcinoma, adenoid 0 0 0 0 0 1 0 3 1 5 8 8 12 17 9 14 4 3 3 0 0 0 0 0 0 0 0 1 0 2 4 5 11 17 10 18 8 10 27 0 Malignant teratoma, intermediate 0 0 0 0 3 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> starlist = len(string.digits)*'*' >>> print starlist ********** >>> newtable=string.maketrans(string.digits, starlist)

  20. >>> newstr = string.translate(str, newtable) >>> print newstr Squamous cell carcinoma, adenoid * * * * * * * * * * * * ** ** * ** * * * * * * * * * * * * * * * * ** ** ** ** * ** ** * Malignant teratoma, intermediate * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

  21. Date and Time Objects datetime module objects that can be created using this module datetime date timedelta time

  22. >>> from datetime import datetime >>> t0 = datetime(2012,10, 12, 4, 55, 12) >>> print t0 2012-10-12 04:55:12 >>> d1=datetime.now() >>> print d1 2012-10-13 17:00:54.529736 >>> dt = d1-t0 >>> print dt 1 day, 12:05:42.529736 >>> dt.total_seconds() 129942.529736

  23. >>> from datetime import date >>> d0 = date(2012,10,11) >>> print d0 2012-10-11 >>> d1 = date.today() >>> print d1 2012-10-13 >>> dt=d1-d0 >>> print dt 2 days, 0:00:00 >>> print dt.total_seconds() 172800.0 >>>

  24. >>> from datetime import time >>> t0 = time() >>> print t0 00:00:00 >>> t0=time(4,55,12) >>> print t0 04:55:12 >>> from datetime import date, datetime >>> d0=date.today() >>> print d0 2012-10-13 >>> d0_1=datetime.combine(d0,t0) >>> print d0_1 2012-10-13 04:55:12

  25. >>> t1=d0_1.time() >>> print t1 04:55:12 >>> t2=datetime.now().time() >>> print t2 17:46:53.616930 >>> td=t2-t1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for -: 'datetime.time' and 'datetime.time’

  26. Date and Time Objects time module time () Return the time as a floating point number expressed in seconds since the epoch, in UTC. Note that even though the time is always returned as a floating point number, not all systems provide time with a better precision than 1 second.

  27. >>> from time import time >>> t0=time() >>> print t0 1350164274.71 >>> t1 = time() >>> print t1 1350164291.39 >>> print t1-t0 16.6719491482

  28. Date and Time Objects timeit module This module provides a simple way to time the execution time of small bits of Python code.

  29. >>> import timeit >>> s ='''\ ... for i in range(10): ... print oct(i) ... '’’

  30. >>> timeit.timeit(stmt=s,number=1) 0 01 02 03 04 05 06 07 010 011 7.796287536621094e-05

  31. >>> timeit.Timer("for i in range(3): print oct(i)").timeit(3) 0 01 02 0 01 02 0 01 02 4.887580871582031e-05 >>>

  32. >>> timeit.Timer("for i in range(3): print oct(i)").timeit(3) 0 01 02 0 01 02 0 01 02 4.887580871582031e-05 >>>

  33. >>> range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> xrange(10) xrange(10) >>> for i in range(10): ... print i ... 0 1 2 3 4 5 6 7 8 9 >>>

  34. >>> oct(5) '05' >>> oct(9) '011' >>> hex(5) '0x5' >>> hex(15) '0xf' >>> hex(16) '0x10' >>> hex(17) '0x11’

  35. Autocoder An autocoder is a software product that can parse and codes medical text. For example: Hepatocellular cacinoma Liver cell cancer Liver cancer Hcc

  36. Autocoder These terms should be annotated with the same concept code. A “nomenclature” is a list that contains codes together with their corresponding terms.

  37. A Neoplasm Nomenclature <name nci-code = "C6546000">pediatric ovarian embryonalca</name> <name nci-code = "C6546000">pediatric ovarian embryonal cancer</name> <name nci-code = "C6546000">pediatric ovarian embryonal carcinoma</name> <name nci-code = "C6547000">ovary with childhood immature teratoma</name> <name nci-code = "C6547000">childhood immature teratoma arising in ovary</name> <name nci-code = "C6547000">childhood immature teratoma involving ovary</name>

  38. Algorithm for the Neoplasm Autocoder • Open the nomenclature file: neocl.xml. text = open("/users/xij6/data/neocl.xml", "r")

  39. Algorithm for the Neoplasm Autocoder 2. Create a dictionary object with keys to be the terms and values to be the codes. literalhash = {} codematch = re.compile('\"(C\d{7})\"') phrasematch = re.compile('\"\> ?(.+) ?\<\/') Example line: <name nci-code = "C6546000">pediatric ovarian embryonalca</name>

  40. for line in text: m= codematch.search(line) if m: code = m.group(1) else: continue x = phrasematch.search(line) if x: phrase = x.group(1) else: continue literalhash[phrase] = code text.close()

  41. Algorithm for the Neoplasm Autocoder 3. Open the text file to be parsed (tumorabs.txt). 4. Parse through the file, line by line. 5. As each line is parsed, break the file into every possible ordered subsequence of words (a phrase array).

  42. Algorithm for the Neoplasm Autocoder 6. For each item in the phrase array, determine whether the item matches a term in the dictionary. 7. If there is a match, print the phrase and the code ( the value in the dictionary) to an output file. The printed line consists of the lines form the text file, followed by the phrases from the lines that match the neoplasm terms, along with their codes.

  43. The very critical procedure is step number 5!

  44. Example for Phrase Array Sentence: “She is a good person” She is a good person She is a good She is a She is She is a good person is a good is a

  45. is a good person a good a good person good person

  46. Example for Phrase Array Text: “anxiety and depression in patients” anxiety anxiety and anxiety and depression anxiety and depression in anxiety and depression in patients

  47. and and depression and depression in and depression in patients depression depression in depression in patients

  48. in in patients patients

  49. Analyzing the Code/Algorithm for Step Number 5 singular = re.compile('omas') england = re.compile('tumo[u]?rs') for line in absfile: sentence = line sentence = singular.sub("oma",sentence) sentence = england.sub("tumor",sentence)

  50. For example, we have the following sentence: state and trait anxiety and depression in patients with primary brain tumorbefore and after surgery 1 year longitudinal study.

More Related