Citation-based Extraction of Core Contents from Biomedical Articles

Citation-based Extraction of Core Contents from Biomedical Articles Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Outline • Background • Problem definition • The proposed technique: CoreCE • Empirical evaluation • Conclusion

Background

Core Contents ofBiomedical Articles • Core contents of a scholarly article a are the textual contents about • Research goal of a • Research background of a • Research conclusion of a

Why Extraction of the Core Contents? • Indexing of the articles • Mining & analysis of highly related evidence • Keyword-based searchof the articles • Search engines often work by keyword input • But the extraction is challenging • Core content of an article a may be expressed in different ways and scattered in a.

Selected by biomedical experts for <erythropoietin, anemia>  They are highly related to each other Recommended by PubMed, but not highly related to <erythropoietin, anemia> 6

Problem Definition

Goal & Contribution • Goal • Given a scholarly article a, extract the core content of a • Contribution • Developing a technique CoreCE (Core Content Extractor) that extracts the core content based on how the article cites references citation-based extraction

Related Work • Extraction of citation links • In-link citations (how article a is cited by others) • Out-link citations (how article a cites others)  Cannot support keyword-based retrieval • Extraction of textual contents • Certain important parts (e.g., titles and abstracts) • Certain terms with higher weights (e.g., TFIDF weight)  But core content of an article a may be expressed in different ways and scattered in a

The Proposed Technique: CoreCE

Basic Definitions

Interesting Ideas of CoreCE • Core content of article a is extracted from • Title and abstract of a, AND • Titles of the referencescited by a • Term frequency of a term t is amplified if • t appears in citation passagesof the references cited by a • The core content is represented by plain text • Applicable to keyword-based indexing & retrieval

Empirical Evaluation

The data • Two sets of articles • Highly relatedbiomedical articles: • For each gene-disease pair <g,d>, collect the biomedical articles that biomedical experts selected to annotate the pair (noted by DisGeNET) • Near-missbiomedical articles (Non-highly related articles): • For each gene-disease pair <g,d>, collect articles using two queries: “g NOT d” and “d NOT g”

Data statistics • 53 gene-disease pairs • 9,876 articles, including • 53 targets + 9,823 candidates • 435,786 out-link references

The Systems to Be Evaluated (1) Title Only (2) Abstract Only (3) Title+Abstract (4) Title+Abstract+ReferenceTitles (5) Whole Article (including the main body) (6) CoreCE

The Underlying Inter-Article Similarity Measure • One of the state-of-the-art measures:

Evaluation Criterion • MAP (Mean Average Precision) • If a system can rank higher those articles that are highly related to r, average precision (AvgP) for the gene-disease pair will be higher • MAP is simply the average of the AvgP values for all gene-disease pairs

Average P@X • If those articles that are highly related to r, are ranked at top-X position, P@X for the gene-disease pair will be higher • Average P@X is simply the average of the P@X values for all gene-disease pairs

Result With the core contents extracted by CoreCE, the system performs significantly better in ranking highly related articles

CoreCE helps to rank highly related articles at top positions (top-1 and top-3) for a higher percentage of the testes

CoreCE performs better when the size is set to 5, however the performance differences are not statistically significant

Conclusion

Core content of a scholarly article a is • The fundamental basis for the indexing, retrieval, and analysis of scientific literature, BUT • Scattered in a and expressed with different terms • We develop CoreCE that • Extracts the core content based on titles and citation passagesof the references cited by a • The idea of CoreCE can be • Incorporated as a front-end processor for search engines to properly index scholarly articles

Citation-based Extraction of Core Contents from Biomedical Articles

Citation-based Extraction of Core Contents from Biomedical Articles

Presentation Transcript

Joint Inference for Knowledge Extraction from Biomedical Literature

Ontology-based Extraction of Information from the Internet

Core Curriculum Contents

Biomedical articles per year

Biomedical articles per year

Information Extraction from Biomedical Text

A Knowledge-based Approach to Citation Extraction

Biomedical Information Extraction

Semi-Automatic Indexing of Full Text Biomedical Articles

Biomedical Research Core Facilities

Information Extraction from biomedical texts

Automatic Keyphrase Extraction from Croatian Newspaper Articles

FSM Extraction from Model Based Specification languages

BioMedical Informatics Core Update

Coreference Based Event-Argument Relation Extraction on Biomedical Text

Domain Adaptation for Biomedical Information Extraction

Extraction and Visualisation of Emotion from News Articles

Extraction of nonlinear features from biomedical time-series using HRVFrame framework

Biomedical Informatics Core

Topic Extraction From Turkish News Articles

Towards Improving Classification of Real World Biomedical Articles

Citation Linking for Electronic Journal Articles