Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford

Data Engineering Informatics Lab at Stanford University

TREC Genomics 2007 Data Set • Over 162,000 full-text scientific publications from 49 prominent journals in biomedicine • Metadata available through MEDLINE • Tasks involve passage, document, and feature retrieval • Methodologies are evaluated on their response to 36 topics (‘queries’) • The topics are categorized based on 13 entity types (Proteins, Genes, etc.) Engineering Informatics Lab at Stanford University

BioPortal • BioPortal is an integrated resource for biomedical ontologies • Currently indexes over 300 ontologies including Medical Subject Headings and Gene Ontology • Provides a comprehensive web service, abstracting the formats and API’s of all underlying ontologies Engineering Informatics Lab at Stanford University

Methodology Engineering Informatics Lab at Stanford University

How is Domain Knowledge Integrated • Annotating Documents prior to indexing • Response time is fast • Not flexible, the entire index has to be updated if a new ontology needs to be added • Indexes can grow very large (2) Query Expansion • Response time is slower • Very flexible, ontologies can be dynamically chosen Engineering Informatics Lab at Stanford University

Query Expansion • TREC Queries are first manually pre-processed “What [TUMOR TYPES] are found in zebrafish?” => “[Tumor][MeSH] AND zebrafish” • [Tumor] indicates term that has to be expanded • [MeSH] indicates ontology that should be used Engineering Informatics Lab at Stanford University

Query Expansion Tumor MeSH • The pre-processed query is automatically expanded using BioPortal’s API [Tumor][MeSH] => {Tumor, Neoplasm, Carcinoma, Leukemia …} Melanoma Adenocarcinoma Leukemia Nerve Sheath Neo Engineering Informatics Lab at Stanford University

Which Domain Knowledge is Integrated • The use of synonymy results in inconsistent performance (2007 TREC genomics track) • Common reasons include: • Relevant terms may not be classified as expected • Some relevant terms may not be classified in a particular ontology • Incomplete information (such as synonyms) • Selection of the appropriate domain ontology is important Engineering Informatics Lab at Stanford University

Enriching Existing Ontologies • Existing ontologies must be enriched to complete missing information • Multiple ontologies can be used to provide different classifications MeSH NCI Engineering Informatics Lab at Stanford University

Evaluations • Baseline • With Query Expansion (Suggested Sources) • Using Enriched Ontologies • Multiple Query Expansions per query Engineering Informatics Lab at Stanford University

Queries Engineering Informatics Lab at Stanford University

Baseline • Queries are used without modification, e.g., • “What [ANTIBODIES] have been used to detect protein PSD-95?” • “What [SIGNS OR SYMPTOMS] of anxiety disorder are related to coronary artery disease?” • Document MAP: 0.277 Engineering Informatics Lab at Stanford University

Query Expansion • Queries are formulated in ‘AND’ clauses: “[Tumor][MeSH] AND zebrafish” => (Tumor, Neoplasm, Carcinoma, Leukemia …) AND zebrafish • Document MAP: 0.347 Engineering Informatics Lab at Stanford University

Multiple Query Expansion Terms • Expansion can be performed on multiple terms in the query • Example: Coronary Artery Disease => {Coronary heart disease, coronary disease, CAD, …} [Tumor][MeSH] AND zebrafish[MeSH} => (tumor, neoplasm, …) AND (zebrafish, daniorerio, …) • Document MAP: 0.352 Engineering Informatics Lab at Stanford University

Enriched Ontology • Marginal improvement over basic enhanced models • Document MAP: 0.352 • Why is the improvement only marginal? • Framework for enrichment based on synonymy is rigid, i.e., relevant terms that are entirely missing in the ontology are still not included • Relevant terms that are classified differently are never included in the search Engineering Informatics Lab at Stanford University

Visualization • Expert knowledge is valuable • We extend MINOE, a co-occurrence based visualization tool, originally designed for exploring marine ecosystems • User can browse (or search) documents through ontologies and visualize interactions between concepts SEE DEMO Engineering Informatics Lab at Stanford University

Summary • Search methodologies must be based on semantics in order to tackle terminology inconsistency • Domain ontologies provide these semantics • Domain ontologies need to be modified (or enriched) in order to fulfill information needs • User interaction is important Engineering Informatics Lab at Stanford University

Future Work • Using multiple enriched ontologies may provide the necessary terms • MeSH Descriptors are provided for every publication during indexing and can potentially improve results • Implement Okapi model for scoring documents Engineering Informatics Lab at Stanford University

Backup Slides Engineering Informatics Lab at Stanford University

Motivation • Scientific literature is an important source of information • Retrieving relevant information from scientific publications is challenging • Domain terminology is used inconsistently in scientific publications • Increasing amounts of information amplify the problem • Improved methodologies based on semantics are required Engineering Informatics Lab at Stanford University

Background • Text REtrieval Conference (TREC) organized by NIST has showcased many successful methods • The Genomics track focused on full-text scientific publications from 49 prominent journals • Methodologies involved: • Use of Synonymy from ontologies • Language based models • Query expansion and annotations • Okapi scoring model Engineering Informatics Lab at Stanford University

Goals • Understand how domain ontologies can be leveraged • Understand which domain ontologies can be leveraged • Develop a knowledge-based approach to integrate domain knowledge with search mechanism Engineering Informatics Lab at Stanford University

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Presentation Transcript

SCIENTIFIC INFORMATION STRATEGIES FOR RETRIEVAL

Using Genetic Information to Improve Health

Using Weather Information to Improve Route

Using Ontologies in the Domain Analysis of Domain-Specific Languages

Information Retrieval Using SQL

Information Ontologies

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Using Information to Improve Distributor Performance

Using Blog Properties to Improve Retrieval

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Using Social Annotations to Improve Language Model for Information Retrieval

Multilingual Information Retrieval using GHSOM

Indexing with semantic components improve information retrieval in domain-specific web portal

Scientific Domain

From Domain Ontologies to Modeling Ontologies to Executable Simulation Models

Using Semantic Relations to Improve Information Retrieval

Design and Creation of Ontologies for Environmental Information Retrieval

Challenges in Using Performance Information to Improve Management

From Domain Ontologies to Modeling Ontologies to Executable Simulation Models

Using Genetic Information to Improve Health

EMCDDA scientific publications