1 / 25

BioNLP, Information Extraction from Radiology Reports

BioNLP, Information Extraction from Radiology Reports. Emilia Apostolova College of Computing and Digital Media DePaul University. Pacific Symposium on Biocomputing Intelligent Systems for Molecular Biology Association for Computational Linguistics

pabloc
Download Presentation

BioNLP, Information Extraction from Radiology Reports

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University

  2. Pacific Symposium on Biocomputing Intelligent Systems for Molecular Biology Association for Computational Linguistics North American Association for Computational Linguistics BioNLP BioCreative TREC Genomics IClef BioNLP – conferences and shared tasks

  3. The NLP Pipeline Lexical Analysis – tokenization, morphological analysis, linguistic lexicons. Syntactic Analysis – Part of Speech Tagging, Chunking, Parsing. Semantic Analysis – Lexical Semantic Interpretation, Semantic Interpretation of Utterances. Information Extraction (in BioMedicine)

  4. GATE - General Architecture for Text Engineering. Apache UIMA - Unstructured Information Management Application. Geneways - a system for automatically extracting, analyzing, visualizing and integrating molecular pathway data from the research literature. PASTA - Protein Structures and Information Extraction from Biological Texts. NLP Pipeline Frameworks

  5. Segmenting text into linguistic tokens – words and sentences. Abbreviations - The Study was conducted within the U.S. Apostrophes - IL-10's cytokine synthesis inhibitory activity Hyphenation - co-operate, cooperate Multiple formats: 464,285.23 and 464295.23 Sentence boundary detection - :, ;, - Lexical Analysis - Tokenization

  6. Link surface variants of a lexical element to its canonical base form. E.g. inflections (activat-es, activat-ed, activat-ing), derivations (activation). Porter stemmer – lexicon-free approach. Finds longest match of a word to a a list of English derivational and inflectional suffixes. Two-level morphology – a finite state based approach that applies a series of parallel transducers to input tokens. (fly -> flies) Lexical Analysis – Morphological analysis

  7. activation – POS noun, singular activate – POS verb, present non-3d person singular active – POS adjective report? Syntactic Level – Part of Speech Tagging

  8. Syntactic Level - Parsing • A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. • The Stanford Dependency Parser - a Java implementation of probabilistic natural language parsers, trained on the Penn Treebank.

  9. Semantic Level – Lexical Interpretation • Selectional Restrictions: transitive verbs: inhibit [something], transcribe [something] semantic restrictions: inhibit [Process], transcribe [Nucleic Acid] Syntactically admissible, but semantically invalid: to inhibit amino acids to transcribe cell growth

  10. Discourse Level - Pragmatics • Discourse referents; what entities does a given message refer to? • What background knowledge is needed to understand a given message? • How do the beliefs of speaker and hearer interact in the interpretation of a message? • What is a relevant answer to a given question? • Summarization, Translation, Dialog Systems, Natural Language Generation.

  11. Lexical resources for (Bio)NLP • Princeton Wordnet • NLM UMLS lexicon and metathesaurus. • The Open Biomedical Ontologies

  12. Text and Image Integration

  13. Automatic Image Annotation

  14. Automatic Image Annotation • Where? Woman (Population Group), Right breast (Body Part, Organ, or Organ Component)‏ • How? Mammography (Diagnostic Procedure)‏ • What? Calcification (Pathologic Function), Lesion (Finding), Carcinoma, Papillary (Neoplastic Process)‏

  15. IE from Clinical Texts – Radiology and Pathology Reports Northwestern University Medical School Department of Radiology Imaging Informatics

  16. Radiology Reports

  17. Sample Radiology Report • Patient Name: XXXXXXX, XXXXX • Medical Record Number: XXXXXXXXXX DOB: XXXX.XX.XX Sex: F • Accession Number: XXXXXXXX • Study Requested: DIG MAMMOGRAM SCREENING (3300000) • Scheduled Date and Time: XXXX.XX.XX 13:02:00.0000 • Requesting Physician: XXXXXXX, • Reason for Exam: V76.12 • ----------------------------Radiological Report--------------------------------- • Comparison is made to previous exams dated XX/XX/XX. • CLINICAL HISTORY: Seventy-two year old woman for screening exam. Patient has a family history of breast cancer, sister age sixty years old. Patient has a history of a previous left breast benign biopsy. • TECHNIQUE: Mammograms were obtained using digital technique. • FINDINGS: There is dense fibroglandular tissue bilaterally. No dominant masses or clustered microcalcifications suggestive of malignancy are seen. • 1. NO SPECIFIC FEATURES OF MALIGNANCY SEEN EITHER BREAST. • 2. NO SIGNIFICANT CHANGE WHEN COMPARED WITH PRIOR STUDIES. • 3. ANNUAL SCREENING MAMMOGRAM IS RECOMMENDED. • CODE (1): NEGATIVE • Attending Radiologist: XXXXXXX, MD • Date Signed off: XXXXXX, Transc. by: NS

  18. NLP for Clinical Texts • Document retrieval – case finding. • Subject recruitment – identify patients that can benefit from a study. • Surveillance – monitoring disease outbreaks. • Discovery of disease-drug associations. • Discovery of disease-finding associations.

  19. IE from Radiology Reports • Automatic Section Segmentation • Demographics • History • Comparison • Technique • Findings • Impression • Recommendation • Sign off

  20. Dataset • 215,000 free-text radiology reports selected randomly from 3 million reports over period of 9 years and representing 24 different types of diagnostic procedures.

  21. Method – Training Set • Hand-crafted rules for automatic extraction of a training set. Common boundary patterns: e.g. section Findings – text between known section headers and another known headings: ^ (finding | observation | discussion)s?: ^ (\W*)(finding | observation | discussion)s?(\W*)$ • 3,000 automatically segmented “high-confidence” radiology reports, containing all 8 sections of interest.

  22. Method • Classification task - each sentence from a radiology report is assigned to one of 8 pre-defined report sections.

  23. Sentence features used for training a classifier.

  24. Work in Progress • Identify named entities within sections using a controlled vocabulary – findings, diseases, observations, anatomical organs, imaging modalities. • Negation Discovery. • Identify relationships between named entities of interest, for example what observations are associated with a diagnosis. • Use radiology report text to support automatic annotation of medical images.

  25. Q/A

More Related