1 / 32

Anatomic Pathology Data Mining

Anatomic Pathology Data Mining. Jules J. Berman, Ph.D., M.D. Program Director, Pathology Informatics Cancer Diagnosis Program National Cancer Institute *All opinions herein are Dr. Berman’s and do not represent those of any federal agency. Expertise Domain of the Anatomic Pathology Data Miner.

meg
Download Presentation

Anatomic Pathology Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anatomic Pathology Data Mining • Jules J. Berman, Ph.D., M.D.Program Director, Pathology InformaticsCancer Diagnosis ProgramNational Cancer Institute*All opinions herein are Dr. Berman’s and do not represent those of any federal agency.

  2. Expertise Domain of the Anatomic Pathology Data Miner • Confidentiality/Privacy Issues • Data Sharing issues, which includes data standardization • Data Analysis

  3. Data Domain of Pathology Data Miner • Pathology Data linked to tissue samples • Any medical record data that can be linked to pathology data (including cancer registry data) • Any other relevant data in existence that can be sensibly linked to pathology records (this usually means the internet)

  4. Confidentiality/privacy • Anyone interested in using confidential information (essentially any data generated in a hospital that is attached to a patient) needs to understand confidentiality and privacy issues. • The fact that you might be using only your department’s data and that you treat the data confidentially will almost never exempt you from existing regulations. • The consequences to you and your institution of ignoring regulations can be profound.

  5. UNCONSENTED RECORDS VERSUS CONSENTED RECORDS • How can a researcher get a waiver from patient consent requirements? • By minimizing the risk of the study. • In most studies, this means reducing confidentiality and privacy risks to near-zero

  6. Standards issues related to data sharing • Nomenclatures and free-text mapping • Common Data Elements • Standard Report Formats • Internet Protocols

  7. CDE for Date of Birth • |birthdate| September 15, 1970 • |birthday| September 15, 1970 • |D.O.B.| September 15, 1970 • |d.o.b.| September 15, 1970 • |date of birth| September 15, 1970 • |date of birth| September 15, 1970 • |date-of-birth| September 15, 1970 • |date_of_birth| September 15, 1970 • |dob| September 15, 1970 • |DOB| September 15, 1970

  8. Representation of CDE • |date_of_birth| September 15, 1970 • |date_of_birth| 15, September, 1970 • |date_of_birth| 9/15/70 • |date_of_birth| 15/9/70 • |date_of_birth| 15/09/70 • |date_of_birth| 9/15/1970 • |date_of_birth| 9.15.70 • |date_of_birth| 9,15,70 • |date_of_birth| some delta time

  9. Annotation/Curation of the CDE • Unique identifier • Creator name • Date of creation • Date of modifications • Exact definition • Hierarchy (if applicable) • List of users or CDE-specific browsers

  10. BEST EXAMPLE CDE SITE • United States Health InformatioN Knowledgebase (USHIK) • http://hmrha.hirs.osd.mil/registry/index1.html

  11. CDEs become XML tags • <date_of_birth>10/17/00</date_of_birth>

  12. CDEs become self-attributing XML tags • <date_of_birth defn=“http://www.cde.org”>10/17/00 </date_of_birth>

  13. Shared Pathology Informatics Network • 5-year project beginning April 2001 • Will develop the tools that will allow about 6 large laboratories to share their data with researchers, using the internet • Basically, it will allow a researcher to interrogate the pathology records at multiple institutions simultaneously and receive a summary report almost instantaneously.

  14. Shared Pathology Informatics Network institution 1 fire wall Internet request server institution 2 fire wall Test data requests and responses institution 3 fire wall fire wall institution 9

  15. What is so special about anatomic pathology data? • Every anatomic pathology record is linked to the patient identifier and to the tissue blocks for that record • One of the important rate-limiting factors in cancer research today is access to tissues • Access to even a small fraction of the tissues routinely collected by pathology departments (about 40 million each year) would be of enormous research benefit.

  16. Increasing frequency of precancer terms, 1984-2000

  17. Example project: Virtual Precancer Archive • Johns Hopkins Surgical Pathology has cases accrued in electronic form since 1984 • 372, 536 is the current (circa Sept., 2000) number of accrued cases • Wouldn’t it be nice to be able to survey the archived precancer cases in a large archive such as the Hopkins Archive?

  18. Step 1. (Drs Bill Moore and Robert Miller)Build a phrase from all cases • The text of the reports can be represented as a collection of phrases that contain all of the concepts included in the reports. • The 372,536 records were parsed to find the diagnostic field free-text. • Diagnostic field free-text was parsed into sentences. • Diagnostic field sentences were parsed into phrases and words.

  19. 418,159 phrases represent all the textual concepts in the JHH surg path records - lie outside the realm of Common Rule • minimal mononuclear cell infiltrate • minimal mononuclear cell infiltration • minimal mononuclear cell interstitial • minimal mononuclear infiltrate • minimal mononuclear inflammation • minimal mononuclear interstitial infitrates • minimal mononuclear meningeal • minimal morphologic abnormalities

  20. Step 2. Create a precancer terminology • Started with the National Library of Medicine’s UMLS (Unified Medical Language System) • We use the concept list file, which is 113,699,627 bytes and contains 1,598,176 terms. • As example, rcc has about 80 synonymous terms in UMLS

  21. UMLS CUI C0007134: Renal cell carcinoma • carcinoma, renal cell • carcinomas, renal cell • renal cell carcinoma • hypernephroid carcinoma • grawitz tumor • hypernephroma • renal cell adenocarcinoma • rcc

  22. Why do we need to disambiguate common terms? • Google search engine query 09/19/00 • "rcc" => 132,000 hits • "renal cell carcinoma" => 11,600 hits • "grawitz tumor" => 79 hits

  23. The UMLS precancer terms • 2,984 terms • Contains 221 terms added by myself and given private J-codes

  24. Step 3. Map the Hopkins phrases to the precancer terms • Start with 418,159 phrases • One-by-one try to find a matching phrase from the list of 2,984 precancer terms list • Prepare a file of all the matching terms • This step takes 33 second to complete with a PERL script running on a 450 MHz desktop computer - i.e., it’s scalable

  25. The result: 10,310 term matches,from 418,159 phrases:a scalable work in progress • early actinic keratosis|actinic keratosis|0022602 • early adenomatous polyp|adenomatous polyp|0206677 • early borderline rejection|borderline|0205189 • early dysplasia|dysplasia|0334044 • early dysplastic change|dysplastic|0334045 • early dysplastic process|dysplastic|0334045 • early gastric mucin cell metaplasia|metaplasia|0025568 • early gastric mucous cell metaplasia|metaplasia|0025568

  26. Step 4. Give precancer match list to Drs. Bill Moore and Robert Miller to create a concordance • 10,310 precancer terms occurred in 54,909 accessioned surgical pathology cases between 1984 and 2000. That is, each of the precancer terms were found in a little more than 5 cases. • 54,909 cases containing a precancer term represents 54,909/ 372,536 =~ 15%

  27. The concordance looks like this: • C0001815^367220497667008419098^^ • C0002893^394120765570701149177^^ • C0002893^435120960421908784068^^ • C0002893^436410698795906686356^^ • C0002893^445510623875200588234^^

  28. 1984 1175 7% 1985 1573 8% 1986 2024 10% 1987 2195 11% 1988 2239 11% 1989 2328 11% 1990 2721 12% 1991 3077 14% 1992 3185 14% 1993 2878 13% 1994 3060 14% 1995 2968 13% 1996 3475 14% 1997 4726 17% 1998 4989 18% 1999 5996 20% 2000 6298 25% Precancer-related cases by year

  29. Precancer-related cases by year

  30. C0004763 1984 30 C0004763 1985 35 C0004763 1986 82 C0004763 1987 97 C0004763 1988 106 C0004763 1989 84 C0004763 1990 97 C0004763 1991 100 C0004763 1992 132 C0004763 1993 126 C0004763 1994 144 C0004763 1995 162 C0004763 1996 221 C0004763 1997 307 C0004763 1998 341 C0004763 1999 401 Cases per year of Barrett’s esophagus

  31. Conclusion: • With these techniques, laboratories with good informatics infrastructure can create a virtual omni-archive (at very low cost) that operates within current human subject protection guidelines for minimal-risk de-identified retrospective studies.

  32. Epilog • There are other protocols for conducting confidential anatomic pathology research • These include anonymization, deidentification, brokered double encryption, sanitization through nomenclature mapping • Example of the latter two methods is: www.netautopsy.org

More Related