1 / 26

HCA 741: Essential Programming for Health Informatics Rohit Kate

More Programming with SEER Dataset Race-Based Cancer Occurrence Rates (Text 2 Chapter 20) Site-Specific Tumors (Text 2 Chapter 25). HCA 741: Essential Programming for Health Informatics Rohit Kate. Race-Based Cancer Occurrence Rates (Text 2 Chapter 20). Race-Based Analysis.

yeo-clayton
Download Presentation

HCA 741: Essential Programming for Health Informatics Rohit Kate

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. More Programming with SEER DatasetRace-Based Cancer Occurrence Rates (Text 2 Chapter 20)Site-Specific Tumors (Text 2 Chapter 25) HCA 741: Essential Programming for Health Informatics Rohit Kate

  2. Race-Based Cancer Occurrence Rates(Text 2 Chapter 20)

  3. Race-Based Analysis • Each entry in the SEER dataset contains racial information about the patient • We can ask questions about racial difference in the cancer occurrence rates • What can cause the differences? • Genetic traits • Identifiable exposure to carcinogens or living conditions • Socioeconomic conditions

  4. Race-Based Analysis • Why do race-based analysis? • Do interventions that can modify the effects of the trait • More targeted prevention • We can find differences by analyzing millions of cancer records in the SEER dataset

  5. Race Information in SEER • The seerdic.pdf file in the “incidence” folder gives the following information: • The race/ethnicity information is stored in characters 20-21 for each occurrence • The two characters mean the following: 01 White 02 Black 03 American Indian 04 Chinese 05 Japanese 06 Filipino 07 Hawaiian 08 Korean (Effective with 1/1/1988 dx) 10 Vietnamese (Effective with 1/1/1988 dx) • Laotian (Effective with 1/1/1988 dx) • Hmong (Effective with 1/1/1988 dx) 13 Kampuchean (including Khmer and Cambodian) (Effective with 1/1/1988 dx) 14 Thai (Effective with 1/1/1994 dx) 15 Asian Indian or Pakistani, NOS (Effective with 1/1/1988 dx) 16 Asian Indian (Effective with 1/1/2010 dx) 17 Pakistani (Effective with 1/1/2010 dx) 20 Micronesian, NOS (Effective with 1/1/1991) 21 Chamorran (Effective with 1/1/1991 dx) 22 Guamanian, NOS (Effective with 1/1/1991 dx) 25 Polynesian, NOS (Effective with 1/1/1991 dx) 26 Tahitian (Effective with 1/1/1991 dx) 27 Samoan (Effective with 1/1/1991 dx) 28 Tongan (Effective with 1/1/1991 dx) 30 Melanesian, NOS (Effective with 1/1/1991 dx) 31 Fiji Islander (Effective with 1/1/1991 dx) 32 New Guinean (Effective with 1/1/1991 dx) 96 Other Asian, including Asian, NOS and Oriental, NOS (Effective with 1/1/1991 dx) 97 Pacific Islander, NOS (Effective with 1/1/1991 dx) 98 Other 99 Unknown

  6. Comparing Rates among Whites and Blacks • We want to compare the difference in the occurrence rates among whites and blacks • We will write a program to process the SEER dataset line-by-line as before • Read the two characters representing race and keep a count for whites and blacks • Get the cancer names from ICD-O as before • Very similar to the programming assignment! • We will also print a ratio to make the comparison • Normalize by total white and black cancer occurrence counts (assume they represent population distribution) • Save in a comma-separated-value file

  7. Program # Using the SEER dataset, this program computes distribution of various cancer # types among whites and blacks. import glob import icdo # the file we created for icdo dictionary def main(): # make a list of all the relevant file names in the SEER dataset folder filelist = glob.glob("SEER_1973_2010_TEXTDATA/incidence/yr*/*.TXT") white_di = {} black_di = {} for file in filelist: # process each file infile = open(file,"r") for line in infile: disease = line[52:57] # ICD-Oncology-3 code (Histology+Behavior) appears at 53-57 characters (then why 52-57?) race = line[19:21] if (race=="01") : white_di[disease] = white_di.get(disease,0) + 1 elif (race=="02") : black_di[disease] = black_di.get(disease,0) + 1 infile.close() icdo_di = icdo.ICDO_dictionary() # use the ICD-O dictionary to get the names for the codes total_white = 0 # total whites total_black = 0 # total blacks for w in white_di: total_white += white_di[w] for b in black_di: total_black += black_di[b] outfile = open("SEER_race_out.csv","w") for code in icdo_di: bo = black_di.get(code,0) # total blacks that get this cancer wo = white_di.get(code,0) # total whites that get this cancer if (bo != 0 or wo != 0) : # skip if both are zeros # compute the ratio if (bo == 0) : # avoid division by zero ratio = 1000000 + wo/total_white # some large value plus relative white count else: ratio = (wo/total_white)/(bo/total_black) # ratio of relative counts print(icdo_di[code].replace(",",""), ",", ratio, ",", wo, ",", bo, file=outfile) outfile.close()

  8. Analysis • After sorting in excel according to the ratio column, one can see • Certain cancers present only in whites • spindle cell melanoma type a (61-0) • mal. melanoma in precan. melanosis(47-0) • Certain cancers have high occurrences in whites • superficial spreading melanoma in situ (10688-14) • mal. melanoma in junctional nevus (969-1) • Certain cancers are more common in blacks • pigmented dermatofibrosarcomaprotuberans (W/B Ratio = 0.23) • Strengthens the hypothesis that melanin in the tumor is a secondary phenomenon • Pathologists used to believe that Ewing sarcoma, a rare malignant tumor that occurs in children and young adults, never happens in blacks • Analysis of SEER dataset disproves it (75 cases found) • Book uses an older version of the dataset; we are using 1973-2010 version, hence the numbers are different

  9. Site-Specific Tumors(Text 2 Chapter 25)

  10. Site-Specific Tumors • Generally tumors are spoken about without specific anatomic sites in mind assuming that a particular type of tumor has the same properties wherever it arises • We will consider a particular type of tumor, mesothelioma, and empirically look at the age-distributions of its occurrence at different anatomical sites

  11. Mesothelioma • Malignant tumors that arise from the surfaces lining the walls of body cavities and surfaces of organs within the body cavities • In SEER dataset it is coded with four different ICD-O codes 90503 = mesothelioma 90513 = fibrous mesothelioma 90523 = epithelial mesothelioma 90533 = biphasic mesothelioma

  12. Mesothelioma • We will consider four major anatomic sites: • Pleural (chest; heart and lungs) • Peritoneal (abdomen; liver and intestines) • Ovarian • Testicular • Each SEER record also contains a topography code indicating the anatomic site (Codes: http://www.dhs.wisconsin.gov/wcrs/pdf/ICD-03Site-text307.pdf)

  13. Site Codes • See the book chapter for their meanings • Pleural: C341, C342, C343, C348, C349, C380, C381, C382, C383, C384, C388, C90, C398, C399 • Peritoneal: C482, C488 • Ovarian: C540, C541, C542, C543, C548, C549, C559, C569, C570, C571, C572, C573, C575, C577, C578, C579 • Testicular: C620, C621, C629, C630, C631, C632, C637, C638, C639

  14. SEER Records • Each SEER record is a line in the file which gives: • ICD-O code for the disease at characters 53-57 • Anatomical site at characters 43-46 • Age at characters 25-27

  15. Mesothelioma in SEER Records • ICD-O code for the disease at characters 53-57 • Match starting with “905” • Anatomical site at characters 43-46 • Match one those anatomical codes that represent one of the four anatomical sites • Age at characters 25-27 • We will record that information

  16. Age-Distribution in a Site-Specific Mesothelioma • Given that we need age-distributions for four different anatomical sites, we will write it as a general function • The function will be, in fact, general enough able to give the age distributions for any disease(s) at any anatomical site(s) • Parameters: • Disease codes as a regular expression • Anatomical site codes as a regular expression • Returns: • A list with age distribution (0-4, 5-9, …, 95+)

  17. Disease-Site Function • The function will go through every SEER file and every line in every file • If the record matches the disease and the anatomical site: • Include the age in the age-distribution

  18. Age-Distribution • It can be done in several ways • We will make 20 bins, each of 5 years • 0-4, 5-9, 10-14,…, 95+ • The last bin will include everyone >= 95 • We will count how many records in each age-bin • How to implement this in Python?

  19. Age-Distribution • Initialize a list of size 20 with zeros • For every relevant record, integer-divide the age by 5 to get the bin number (if age > 95, make it 95) • Increment the appropriate bin

  20. Regular Expression for Mesothelima ICD codes 90503 = mesothelioma 90513 = fibrous mesothelioma 90523 = epithelial mesothelioma 90533 = biphasic mesothelioma • “^905” or “905” and use “re.match()” • This is sufficient, although more general • “905[0123]3” • This is exact

  21. Regular Expressions for Anatomical Sites • Pleural: “(C34[12389]|C3[89][0123489])” • Peritoneal: “C48[28]” • Ovarian: “C5[4567]” • Testicular: “C6[23]” • All are a little more general, but sufficient (don’t cover anything else that represents some other site) • One could also write “C620|C621|C629..” etc.

  22. Disease-Site Function import glob import re def diseaseSite(diseaseRE, siteRE) : # returns a list with age-wise distribution with 20 bins # of the SEER records which match disease and site regular expressions filelist = glob.glob("SEER_1973_2010_TEXTDATA/incidence/yr*/*.TXT") ages = [0]*20 # 20 bins for ages, 0-4, 5-9, ..., 95+ for file in filelist: # process each file infile = open(file,"r") for line in infile: disease = line[52:57] # ICD-Oncology-3 code (Histology+Behavior) site = line[42:46] # ICD-Oncology topology-site code age = int(line[24:27]) if (re.match(diseaseRE,disease) and re.match(siteRE,site)) : if (age > 95): # put higher ages in the last bin age = 95 age_bin = age // 5 # integer division that gives the bin number ages[age_bin] += 1 infile.close() return ages

  23. Calling the Disease-Site Function def main() : mesothelioma_pleura = diseaseSite("905","(C34[12389]|C3[89][0123489])") print("done") mesothelioma_peritoneum = diseaseSite("905","C48[28]") print("done") mesothelioma_ovary = diseaseSite("905","C5[4567]") print("done") mesothelioma_testis = diseaseSite("905","C6[23]") print("done") outfile = open("SEER_diseaseSite_out.csv","w") # save in a comma-separated-values file print("Ages, Pleura, Peritoneum, Ovarian, Testicular",file=outfile) for i in list(range(0,20)) : print(5*i," to ",5*i+4,",",mesothelioma_pleura[i],",",mesothelioma_peritoneum[i],",",mesothelioma_ovary[i],",",mesothelioma_testis[i],file=outfile) outfile.close()

  24. Age-Distribution of Mesothelioma Occurring at Different AnatomicSites

  25. Age-Distribution Graph (plotted .csv file through Excel)

  26. Conclusions • Looking at the table and graph, we can conclude: • Ovarian and testicular mesothelioma are very rare compared to pleura and peritoneum mesothelioma • The peek age for ovarian mesothelioma (50s) is younger than for other sites (70s)

More Related