1 / 35

ODIE Toolkit

ODIE Toolkit. NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu. Outline. Overview of the Project Aims, People, Organization, Domain, Philosophy Specific Aims from a use case approach Information Extraction Ontology Enrichment

Download Presentation

ODIE Toolkit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ODIE Toolkit NCBO Council Talk December 18, 2007 Rebecca Crowley crowleyrs@upmc.edu

  2. Outline • Overview of the Project • Aims, People, Organization, Domain, Philosophy • Specific Aims from a use case approach • Information Extraction • Ontology Enrichment • First steps, synergies, and year 1work, working together

  3. Project Overview • Funded by National Cancer Center • Develop tools for • Information extraction from clinical text using ontologies • Enrichment of ontologies using clinical text • Project Period: 9/27/2007 – 7/31/2011 • Collaboration with National Center for Biomedical Ontology • Subcontract to Stanford (consultation on Bioportal) • Subcontract to Mayo (Terminologies, NLP)

  4. Specific Aims Year 1 development goals Specific Aim 1:Develop and evaluate methods for information extraction (IE) tasks using existing OBO ontologies, including: • Named Entity Recognition • Co-reference Resolution • Discourse Reasoning • Attribute Value Extraction Specific Aim 2:Develop and evaluate general methods for clinical-text mining to assist in ontology development, including: • Preprocessing • Concept Discovery and Clustering • Suggest taxonomic positioning and relationships • Specific Aim 3: Develop reusable software for performing information extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture. Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit. Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.

  5. Dual Proposal Goals

  6. People @pitt Wendy Chapman, co-I Rebecca Crowley, PI Preet Chaudhary, co-I Kaihong Liu, Graduate Student Kevin Mitchell, Architect Girish Chavan, Interfaces John Dowling, Annotation

  7. Organization Annotations Algorithms Architecture Consider and test existing algorithms; design, implement and test new algorithms Develop and implement architecture Develop manually annotated sets for training and testing Rebecca Crowley Wendy Chapman Kaihong Liu John Dowling Rebecca Crowley Wendy Chapman Kaihong Liu Kevin Mitchell Rebecca Crowley Kevin Mitchell Girish Chavan

  8. Domain • Will attempt to develop general tools whenever possible • Priorities for evaluation of components in : • Radiology and pathology reports • NCIT as well as other clinically relevant OBO ontologies • Cancer domains (including hematologic oncology)

  9. Philosophy • Toolkit for developers of NLP applications and ontologies • Support interaction and experimentation • Package systems at the conclusion of working with ODIE • Foster cycle of enrichment and extraction needed to advance development of NLP systems • Ontology enrichment as opposed to denovo development • Human-machine collaboration as opposed to fully automated learning

  10. Specific Aims Key ODIE Functionality Specific Aim 1:Develop and evaluate methods for information extraction (IE) tasks using existing OBO ontologies, including: • Named Entity Recognition • Co-reference Resolution • Discourse Reasoning • Attribute Value Extraction Specific Aim 2:Develop and evaluate general methods for clinical-text mining to assist in ontology development, including: • Preprocessing • Concept Discovery and Clustering • Suggest taxonomic positioning and relationships • Specific Aim 3: Develop reusable software for performing information extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture. Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit. Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.

  11. Named Entity Recognition • User has • clinical documents • one or more ontology • (and/or) one or more lexical resources (synonyms, POS) • (optionally) a reference standard of human annotations • User wants to • determine degree of coverage of different ontologies with text • determine degree of overlap in annotations generated between ontologies • (optionally) test accuracy of NER with different ontologies to choose ‘best’ ontology to annotate text with • tag existing document set with concepts from ontology (optionally using the synonyms from their synonym source if not in ontology) • System produces annotated clinical documents and descriptive statistics

  12. Named Entity Recognition Clinical Document Ontology Lexical Resource Metathesaurus (synonyms) SPECIALIST (POS information)

  13. Named Entity Recognition View Annotated Concepts From A Single Ontology

  14. Named Entity Recognition Compare Annotations from Multiple Ontologies

  15. Co-reference Resolution • User has • clinical documents with NER annotations • one or more ontology • (optionally) a reference standard of co-reference annotations • User wants to • visualize co-references detected using one or more ontologies • (optionally) test accuracy of CR with different ontologies to choose ontology for annotations • tag existing document set with co-references from ontology • System produces annotated clinical documents and descriptive statistics

  16. Co-reference Resolution

  17. Discourse Reasoning • User has • a set of clinical documents with NER and CR annotations • a set of information models about those documents • User wants to • determine which information model (or parts of them) should be used for which clinical document

  18. Discourse Reasoning BRAIN, RIGHT PARIETAL, STEROTACTIC BIOPSY: Mucinous Adenocarcinoma, consistent with previous history of colon primary BRAIN Site Morphology COLON Location Grade Size TNM Stage

  19. Attribute Value Extraction • User has • clinical documents with NER, CR, DR annotations • information model of specific subset of documents • Wants to extract attributes and value from clinical text conforming to model • Analyze data using common tools • possible later search for particular cases

  20. Attribute Value Extraction Histologic Type Clark’s Level Breslow Depth Mitoses Ulcer Perineural Invasion Angiolymphatic Invasion Regression

  21. Attribute Value Extraction Histologic Type – Superficial Spreading Clark’s Level – IV Breslow Depth – 1.75 mm Mitoses – Greater than 2 per HLP Ulcer – None Perineural Invasion – None Angiolymphatic Invasion – None Regression - None

  22. Ontology Enrichment • User has • clinical documents • Ontology • User wants to identify potential candidate concepts from the documents to include in the ontology • Visualized in a manner to ease search and recognition of presence of absence of those concepts in the ontology • Suggestions for where in taxonomy the concept should be placed • Suggestions for relationships

  23. Ontology Enrichment Breast, Left, Excisional Biopsy: Mucinous Carcinoma Breast, Right, Lumpectomy: Infiltrating Ductal Carcinoma Breast, Left: Invasive Ductal Carcinoma Breast, Left, Excisional Biopsy: Malignant Phylloides Tumor Tumor shows osseous and lipomatous metaplasia Disease or Disorder Breast Disorder Breast Neoplasm Malignant Breast Neoplasm Breast Carcinoma Ductal Breast Carcinoma Invasive Ductal Carcinoma

  24. Concept Discovery Breast, Left, Excisional Biopsy: Mucinous Carcinoma Breast, Right, Lumpectomy: Infiltrating Ductal Carcinoma Breast, Left: Invasive Ductal Carcinoma Breast, Left, Excisional Biopsy: Malignant Phylloides Tumor Tumor shows osseous and lipomatous metaplasia Disease or Disorder Breast Disorder Breast Neoplasm Malignant Breast Neoplasm Breast Carcinoma Ductal Breast Carcinoma Invasive Ductal Carcinoma

  25. Taxonomic Positioning Breast, Left, Excisional Biopsy: Mucinous Carcinoma Breast, Right, Lumpectomy: Infiltrating Ductal Carcinoma Breast, Left: Invasive Ductal Carcinoma Breast, Left, Excisional Biopsy: Malignant Phylloides Tumor Tumor shows osseous and lipomatous metaplasia Disease or Disorder Breast Disorder Breast Neoplasm Malignant Breast Neoplasm Breast Carcinoma Ductal Breast Carcinoma Invasive Ductal Carcinoma Mucinous Carcinoma Malignant Phylloides Tumor

  26. Relationships Breast, Left, Excisional Biopsy: Mucinous Carcinoma Breast, Right, Lumpectomy: Infiltrating Ductal Carcinoma Breast, Left: Invasive Ductal Carcinoma Breast, Left, Excisional Biopsy: Malignant Phylloides Tumor Tumor shows osseous and lipomatous metaplasia Disease or Disorder Breast Disorder Breast Neoplasm Malignant Breast Neoplasm Breast Carcinoma Ductal Breast Carcinoma Invasive Ductal Carcinoma Mucinous Carcinoma Malignant Phylloides Tumor Morphologic Finding Metaplasia Osseous metaplasia has-Finding Lipomatous metaplasia Cartilageous metaplasia

  27. First Steps • Use cases • Survey of Bioportal, LexBio, GATE and UIMA • Survey of ontology enrichment techniques • Architectural assumptions and notional architecture • Started discussions with Stanford and Mayo • Delineated first year work • Annotation software and document sets

  28. Architecture Decisions • The primary goal of ODIE is to serve as a workbench for building and refining text processing pipelines and ontologies. • Information retrieval is not a primary goal. However ODIE may have a rudimentary search feature for annotated document collections. • ODIE Toolkit will be a desktop application. • ODIE UI will be based on the Eclipse Rich Client Platform. • ODIE will use UIMA as the Language Engineering Platform. GATE processing resources will be usable in ODIE by wrapping them in UIMA TAEs. • UIMA is highly configurable using xml descriptor files. • Better documentation, community support. • We will use GATE in first year for rapid prototyping and manual annotation • ODIE will have the ability to easily import and use UIMA TAEs developed by others. This may be expanded to GATE processing resources. • ODIE will allow for packaging a pipeline for deployment in a production environment.

  29. Notional Architecture

  30. Ontrez ODIE Synergies: Ontrez • Information Retrieval • Range of inputs • Annotation • Named Entity Recognition • Enhance annotation of Ontrez? • Use inference and indexing on • clinical documents? • Other kinds of annotation • Information Extraction • Ontology Enrichment • Clinical Documents

  31. Synergies: Mayo • NER and Co-reference resolution • Clustering, discovery of synonyms • LexGrid • Using similar tools, focused on larger range of document types • More – to be explored

  32. First Year Work • NER and co-reference modules • Concept discovery • Develop manually annotated reference standards for NER and CR • Focus on testing and developing algorithms • ODIE 1.0 will include basic architecture and modules for NER, CR and concept discovery, statistics

  33. Working Together • Work with Mayo to scope first year collaboration (NER, CR, synonym discovery) • Decisions regarding terminology access • Better define what NCBO resources we will use

  34. Working Together • SourceForge site, ODIE website and Wiki • All our meetings are open and we are happy to arrange teleconferences • Mondays 2-4 pm (EST) • Schedule visits with Mayo and Stanford for early spring ’08 • Anticipate providing monthly progress updates at the ODIE website starting in January ‘08 • Other ideas? What’s the expectation of the Council?

  35. Questions? Comments?

More Related