Application of the nlp techniques to ie and ir
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Application of the NLP techniques to IE and IR PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on
  • Presentation posted in: General

Application of the NLP techniques to IE and IR. CREST 言語処理グループ. Outline. Background Building NLP resources GENIA Extracting Disease-Gene Associations from MEDLINE H-invitational Extracting DGAs by machine learning An IR system for predicate-argument relations MEDUSA.

Download Presentation

Application of the NLP techniques to IE and IR

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Application of the nlp techniques to ie and ir

Application of the NLP techniques to IE and IR

CREST

言語処理グループ


Outline

Outline

  • Background

  • Building NLP resources

    • GENIA

  • Extracting Disease-Gene Associations from MEDLINE

    • H-invitational

    • Extracting DGAs by machine learning

  • An IR system for predicate-argument relations

    • MEDUSA


Application to the biomedical domain

Application to the Biomedical domain

  • Plenty of text

    • MEDLINE database: 12 million abstracts

    • Needs of effective IE and IR

  • Domain knowledge

    • Gene ontology, KEGG, UMLS, ICD, …

  • Other Information sources

    • A variety of molecular databases

      • DNA sequences, motifs, diseases, molecular interactions, etc…


Developing nlp resources

Developing NLP resources

  • Resources for NLP research

    • Domain knowledge

    • Training data for ML-based techniques

    • Test data for evaluating the transferability of a system

  • We are now developing…

    • GENIA

      • Ontology

      • Corpus


Genia corpus

GENIAcorpus

  • 4,000 MEDLINE abstracts

    • Selected by MeSH Terms (Human, Blood cells, Transcription factors)

  • XML format

  • Contents

    • Named-entity (Kim et al 2003)

    • Part-of-speech (Tateisi et al 2004)

    • Parse tree

    • Co-reference (Institute of Infocomm Research, Singapore)


Genia named entity corpus

DNA virus

cell_type

GENIA named-entity corpus

  • Terms are annotated based on the semantic classes in the GENIA ontology

  • Size

    • 2,000 abstracts

    • Number of the terms: 92,723

    • Vocabulary size:36,568

The peri-kappa B site mediates human immunodeficiency

virus type 2 enhancer activation in monocytes …


Genia part of speech corpus

GENIA part-of-speech corpus

  • Each token is annotated with its part-of-speech tag.

  • Size

    • 2,000 abstracts

    • 20,544 sentences

    • 50,1054 words (about half the size of Penn Treebank)

DT NN NN NN VBZ JJ NN

NN NN CD NN NN IN NNS

The peri-kappa B site mediates human immunodeficiency

virus type 2 enhancer activation in monocytes …


Genia treebank

S

VP

VP

PP

NP

NP

ADJP

GENIA treebank

  • Based on the standard of the Penn TreeBank

  • Size

    • 200 abstracts

    • (1500 abstracts at the end of this fiscal year)

CD3-episilon expression is controlled by a downstream T lymphocyte-specific enhancer element


Genia corpus1

GENIA corpus

  • Used in more than 240 institutions

    • Japan (28), Asia (54), North America (63), Europe (62), etc…

  • De facto standard for evaluating biomedical named-entity recognition systems

    • BioNLP workshop at Coling 2004

      • Named-entity recognition shared task

        • Institute for Infocomm Research (Singapore),

        • Stanford University (USA),

        • University of Edinburgh (UK),

        • University of Wisconsin-Madison (USA),

        • Pohang University of Science and Technology (Korea),

        • University of Alberta (Canada),

        • University Duisburg-Essen (Germany),

        • Korea University (Korea),

        • National Taiwan University (Taiwan),


Outline1

Outline

  • Background

  • Building NLP resources

    • GENIA

  • Extracting Disease-Gene Associations from MEDLINE

    • H-invitational

    • Extracting DGAs by machine learning

  • An IR system for predicate-argument relations

    • MEDUSA


H invitational disease edition

Literature

(PubMed)

H-InvitationalDisease Edition

Specific disease

Select

specific disease

Known disease gene

List of genes

Genomic region of interest (GROI)

Dictionary

Scoring system

(PANDA)

Text-mining

H-InvDB

Other DB

Genes with high score

Synthetic analysis

  • SNPs

  • Public

  • Private

AND/OR

  • Gene expression

  • Public

  • Private

June 25, 2004

Disease group, JBIRC

Final Result


Disease gene associations extracted from medline

Disease-Gene Associations extracted from MEDLINE

  • DGA explorer

    (demo)


Application of the nlp techniques to ie and ir

Text

  • 1.5 million MEDLINE abstracts

    • Selected by MeSH Terms

      • “Disease Category” AND (“Amino Acids, Peptides, and Proteins” OR “Genetic Structures”)

  • Parsing

    • All the sentences were parsed by the HPSG parser

    • Using a PC cluster (100 processors with GXP)

    • Time: 10 days


Disease gene associations in texts

Disease-Gene Associations in texts

 These results suggested that targeted disruption of Cyp19 caused anovulation and precocious depletion of ovarian follicles

 Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B.


Training data

Training data

  • All co-occurrences are classified into “relevant” or “irrelevant” by a domain expert.

All foals with OLWS were homozygous for the Ile118Lys EDNRB mutation, and adults that were homozygous were not found.

Dominant radial drusen and Arg345Trp EFEMP1 mutation.

The 5 year overall survival (OS) and event-free survival (EFS) were 94 and 90 +/- 8%, respectively, with a median follow-up of 48 months.

These data may indicate that formation of parathyroid adenoma in young patients is related to a mechanism involving EGFR.


Maximum entropy learning

Maximum entropy learning

  • Log-linear model

  • Features

    • Bag-of-words

    • Local context

    • Gene/disease name

    • Predicate-argument structures

    • :

Feature function

Weight


Features of predicate argument structures 1

ARG2

gene/disease

X

Features of predicate-argument structures (1)

  • Dedifferentiation of adenoid cystic carcinoma: report of a case implicatingp53 gene mutation.


Features of predicate argument structures 2

Features of predicate-argument structures (2)

  • These results suggested that targeted disruption of Cyp19causedanovulation and precocious depletion of ovarian follicles.

  • Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B.

ARG1

ARG2

disease/gene

gene/disease

X


Extraction accuracy

Extraction accuracy

  • Training/test data: 2,253 sentences

  • 10-fold cross validation


Outline2

Outline

  • Background

  • Building NLP resources

    • GENIA

  • Extracting Disease-Gene Associations from MEDLINE

    • H-invitational

    • Extracting DGAs by machine learning

  • An IR system for predicate-argument relations

    • MEDUSA


Medusa an ir system for predicate argument structures

MEDUSA: An IR system for predicate-argument structures

  • Ex.

    • Search a sentence in which the subject of the verb activate is protein.

  • Simple: Since the PHO2 Asp-230 mutant mimics Ser-230-phosphorylated PHO2, we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene.

  • With a relative pronoun: Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription.

  • Coordination: Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fais to localize the RNA to the posterior.


Medusa demonstration

MEDUSAdemonstration

  • 100,000 MEDLINE abstracts

  • Parsed by Enju

  • Genes and diseases are annotated by using the UMLS dictionary


Summary

Summary

  • GENIA corpus

    • Parts of speech, Named-entities, Parse trees

  • Extracting gene-disease associations from MEDLINE

    • Machine learning with HPSG parse results

  • An IR system for predicate-argument structures

    • MEDUSA


Software and resource

Software and resource

  • GENIA

    • Named entity corpus

    • Part-of-speech corpus

    • Parse tree corpus

    • Co-reference (Singapore)

    • Part-of-speech tagger

    • Named entity tagger (soon)

    • HPSG parse results (100,00 MEDLINE abstracts)

  • Enju (HPSG parser)

  • MEDUSA

  • LiLFeS

  • Amis


  • Login