260 likes | 465 Views
More Programming with SEER Dataset Race-Based Cancer Occurrence Rates (Text 2 Chapter 20) Site-Specific Tumors (Text 2 Chapter 25). HCA 741: Essential Programming for Health Informatics Rohit Kate. Race-Based Cancer Occurrence Rates (Text 2 Chapter 20). Race-Based Analysis.
E N D
More Programming with SEER DatasetRace-Based Cancer Occurrence Rates (Text 2 Chapter 20)Site-Specific Tumors (Text 2 Chapter 25) HCA 741: Essential Programming for Health Informatics Rohit Kate
Race-Based Analysis • Each entry in the SEER dataset contains racial information about the patient • We can ask questions about racial difference in the cancer occurrence rates • What can cause the differences? • Genetic traits • Identifiable exposure to carcinogens or living conditions • Socioeconomic conditions
Race-Based Analysis • Why do race-based analysis? • Do interventions that can modify the effects of the trait • More targeted prevention • We can find differences by analyzing millions of cancer records in the SEER dataset
Race Information in SEER • The seerdic.pdf file in the “incidence” folder gives the following information: • The race/ethnicity information is stored in characters 20-21 for each occurrence • The two characters mean the following: 01 White 02 Black 03 American Indian 04 Chinese 05 Japanese 06 Filipino 07 Hawaiian 08 Korean (Effective with 1/1/1988 dx) 10 Vietnamese (Effective with 1/1/1988 dx) • Laotian (Effective with 1/1/1988 dx) • Hmong (Effective with 1/1/1988 dx) 13 Kampuchean (including Khmer and Cambodian) (Effective with 1/1/1988 dx) 14 Thai (Effective with 1/1/1994 dx) 15 Asian Indian or Pakistani, NOS (Effective with 1/1/1988 dx) 16 Asian Indian (Effective with 1/1/2010 dx) 17 Pakistani (Effective with 1/1/2010 dx) 20 Micronesian, NOS (Effective with 1/1/1991) 21 Chamorran (Effective with 1/1/1991 dx) 22 Guamanian, NOS (Effective with 1/1/1991 dx) 25 Polynesian, NOS (Effective with 1/1/1991 dx) 26 Tahitian (Effective with 1/1/1991 dx) 27 Samoan (Effective with 1/1/1991 dx) 28 Tongan (Effective with 1/1/1991 dx) 30 Melanesian, NOS (Effective with 1/1/1991 dx) 31 Fiji Islander (Effective with 1/1/1991 dx) 32 New Guinean (Effective with 1/1/1991 dx) 96 Other Asian, including Asian, NOS and Oriental, NOS (Effective with 1/1/1991 dx) 97 Pacific Islander, NOS (Effective with 1/1/1991 dx) 98 Other 99 Unknown
Comparing Rates among Whites and Blacks • We want to compare the difference in the occurrence rates among whites and blacks • We will write a program to process the SEER dataset line-by-line as before • Read the two characters representing race and keep a count for whites and blacks • Get the cancer names from ICD-O as before • Very similar to the programming assignment! • We will also print a ratio to make the comparison • Normalize by total white and black cancer occurrence counts (assume they represent population distribution) • Save in a comma-separated-value file
Program # Using the SEER dataset, this program computes distribution of various cancer # types among whites and blacks. import glob import icdo # the file we created for icdo dictionary def main(): # make a list of all the relevant file names in the SEER dataset folder filelist = glob.glob("SEER_1973_2010_TEXTDATA/incidence/yr*/*.TXT") white_di = {} black_di = {} for file in filelist: # process each file infile = open(file,"r") for line in infile: disease = line[52:57] # ICD-Oncology-3 code (Histology+Behavior) appears at 53-57 characters (then why 52-57?) race = line[19:21] if (race=="01") : white_di[disease] = white_di.get(disease,0) + 1 elif (race=="02") : black_di[disease] = black_di.get(disease,0) + 1 infile.close() icdo_di = icdo.ICDO_dictionary() # use the ICD-O dictionary to get the names for the codes total_white = 0 # total whites total_black = 0 # total blacks for w in white_di: total_white += white_di[w] for b in black_di: total_black += black_di[b] outfile = open("SEER_race_out.csv","w") for code in icdo_di: bo = black_di.get(code,0) # total blacks that get this cancer wo = white_di.get(code,0) # total whites that get this cancer if (bo != 0 or wo != 0) : # skip if both are zeros # compute the ratio if (bo == 0) : # avoid division by zero ratio = 1000000 + wo/total_white # some large value plus relative white count else: ratio = (wo/total_white)/(bo/total_black) # ratio of relative counts print(icdo_di[code].replace(",",""), ",", ratio, ",", wo, ",", bo, file=outfile) outfile.close()
Analysis • After sorting in excel according to the ratio column, one can see • Certain cancers present only in whites • spindle cell melanoma type a (61-0) • mal. melanoma in precan. melanosis(47-0) • Certain cancers have high occurrences in whites • superficial spreading melanoma in situ (10688-14) • mal. melanoma in junctional nevus (969-1) • Certain cancers are more common in blacks • pigmented dermatofibrosarcomaprotuberans (W/B Ratio = 0.23) • Strengthens the hypothesis that melanin in the tumor is a secondary phenomenon • Pathologists used to believe that Ewing sarcoma, a rare malignant tumor that occurs in children and young adults, never happens in blacks • Analysis of SEER dataset disproves it (75 cases found) • Book uses an older version of the dataset; we are using 1973-2010 version, hence the numbers are different
Site-Specific Tumors • Generally tumors are spoken about without specific anatomic sites in mind assuming that a particular type of tumor has the same properties wherever it arises • We will consider a particular type of tumor, mesothelioma, and empirically look at the age-distributions of its occurrence at different anatomical sites
Mesothelioma • Malignant tumors that arise from the surfaces lining the walls of body cavities and surfaces of organs within the body cavities • In SEER dataset it is coded with four different ICD-O codes 90503 = mesothelioma 90513 = fibrous mesothelioma 90523 = epithelial mesothelioma 90533 = biphasic mesothelioma
Mesothelioma • We will consider four major anatomic sites: • Pleural (chest; heart and lungs) • Peritoneal (abdomen; liver and intestines) • Ovarian • Testicular • Each SEER record also contains a topography code indicating the anatomic site (Codes: http://www.dhs.wisconsin.gov/wcrs/pdf/ICD-03Site-text307.pdf)
Site Codes • See the book chapter for their meanings • Pleural: C341, C342, C343, C348, C349, C380, C381, C382, C383, C384, C388, C90, C398, C399 • Peritoneal: C482, C488 • Ovarian: C540, C541, C542, C543, C548, C549, C559, C569, C570, C571, C572, C573, C575, C577, C578, C579 • Testicular: C620, C621, C629, C630, C631, C632, C637, C638, C639
SEER Records • Each SEER record is a line in the file which gives: • ICD-O code for the disease at characters 53-57 • Anatomical site at characters 43-46 • Age at characters 25-27
Mesothelioma in SEER Records • ICD-O code for the disease at characters 53-57 • Match starting with “905” • Anatomical site at characters 43-46 • Match one those anatomical codes that represent one of the four anatomical sites • Age at characters 25-27 • We will record that information
Age-Distribution in a Site-Specific Mesothelioma • Given that we need age-distributions for four different anatomical sites, we will write it as a general function • The function will be, in fact, general enough able to give the age distributions for any disease(s) at any anatomical site(s) • Parameters: • Disease codes as a regular expression • Anatomical site codes as a regular expression • Returns: • A list with age distribution (0-4, 5-9, …, 95+)
Disease-Site Function • The function will go through every SEER file and every line in every file • If the record matches the disease and the anatomical site: • Include the age in the age-distribution
Age-Distribution • It can be done in several ways • We will make 20 bins, each of 5 years • 0-4, 5-9, 10-14,…, 95+ • The last bin will include everyone >= 95 • We will count how many records in each age-bin • How to implement this in Python?
Age-Distribution • Initialize a list of size 20 with zeros • For every relevant record, integer-divide the age by 5 to get the bin number (if age > 95, make it 95) • Increment the appropriate bin
Regular Expression for Mesothelima ICD codes 90503 = mesothelioma 90513 = fibrous mesothelioma 90523 = epithelial mesothelioma 90533 = biphasic mesothelioma • “^905” or “905” and use “re.match()” • This is sufficient, although more general • “905[0123]3” • This is exact
Regular Expressions for Anatomical Sites • Pleural: “(C34[12389]|C3[89][0123489])” • Peritoneal: “C48[28]” • Ovarian: “C5[4567]” • Testicular: “C6[23]” • All are a little more general, but sufficient (don’t cover anything else that represents some other site) • One could also write “C620|C621|C629..” etc.
Disease-Site Function import glob import re def diseaseSite(diseaseRE, siteRE) : # returns a list with age-wise distribution with 20 bins # of the SEER records which match disease and site regular expressions filelist = glob.glob("SEER_1973_2010_TEXTDATA/incidence/yr*/*.TXT") ages = [0]*20 # 20 bins for ages, 0-4, 5-9, ..., 95+ for file in filelist: # process each file infile = open(file,"r") for line in infile: disease = line[52:57] # ICD-Oncology-3 code (Histology+Behavior) site = line[42:46] # ICD-Oncology topology-site code age = int(line[24:27]) if (re.match(diseaseRE,disease) and re.match(siteRE,site)) : if (age > 95): # put higher ages in the last bin age = 95 age_bin = age // 5 # integer division that gives the bin number ages[age_bin] += 1 infile.close() return ages
Calling the Disease-Site Function def main() : mesothelioma_pleura = diseaseSite("905","(C34[12389]|C3[89][0123489])") print("done") mesothelioma_peritoneum = diseaseSite("905","C48[28]") print("done") mesothelioma_ovary = diseaseSite("905","C5[4567]") print("done") mesothelioma_testis = diseaseSite("905","C6[23]") print("done") outfile = open("SEER_diseaseSite_out.csv","w") # save in a comma-separated-values file print("Ages, Pleura, Peritoneum, Ovarian, Testicular",file=outfile) for i in list(range(0,20)) : print(5*i," to ",5*i+4,",",mesothelioma_pleura[i],",",mesothelioma_peritoneum[i],",",mesothelioma_ovary[i],",",mesothelioma_testis[i],file=outfile) outfile.close()
Age-Distribution of Mesothelioma Occurring at Different AnatomicSites
Conclusions • Looking at the table and graph, we can conclude: • Ovarian and testicular mesothelioma are very rare compared to pleura and peritoneum mesothelioma • The peek age for ovarian mesothelioma (50s) is younger than for other sites (70s)