Extracting biological names and relations from texts
This presentation is the property of its rightful owner.
Sponsored Links
1 / 74

Extracting biological names and relations from texts PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on
  • Presentation posted in: General

Extracting biological names and relations from texts. Ting-Yi Sung 宋定懿 Bioinformatics Program, TIGP Institute of Information Science Academia Sinica 2004/12/16. Motivation. To automatically extract information from natural language text.

Download Presentation

Extracting biological names and relations from texts

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Extracting biological names and relations from texts

Extracting biological names and relations from texts

Ting-Yi Sung 宋定懿

Bioinformatics Program, TIGP

Institute of Information Science

Academia Sinica

2004/12/16


Motivation

Motivation

  • To automatically extract information from natural language text.

    • The need arises from rapid accumulation of biomedical literature.

    • Expedite survey efforts

    • Support the database curation (automatically associate the papers with database records)


Targets of information extraction

Targets of Information Extraction

  • Protein-Protein interaction/binding/inhibition

  • Protein-Small Molecules

  • Gene-Gene regulation

  • Gene-Gene Product interaction

  • Gene-Drug relation

  • Protein-Subcellular location

  • Amino Acid-Protein relation

  • Example relationships between gene and drugs:

    • The gene is the drug target

    • The gene confers resistance to the drug

    • The gene metabolizes the drug


Information extraction tasks

Information Extraction Tasks

Identify Target

Named Entities

Identify Relations

among Named

Entities

Identify Relations

among Events and

Named Entities

Associate Results

with existing

database records


Outline

Outline

  • NER (named entity recognition) in biomedical domain

  • Challenges in biomedical NER

  • State of progress in NER

  • Abbreviation disambiguation

  • Future works


What is ner

What is NER?

  • NER

    • Named Entity Recognition

    • Including two tasks

      • Identification of proper names in text

      • Classification of proper names in text

  • Newswire Domain

    • Person, Location, Organization

  • Biomedical Domain

    • Protein, DNA, RNA, Body Part, Cell Type, Lipid, etc.


Example of ner biomedical

Example of NER - Biomedical

Protein

tissue

Disease


Ner in biomedical domain

NER in biomedical domain

  • BioNER aims to recognize following names

    • First Priority

      • Protein name, DNA name, RNA name

    • Second Priority

      • cell type, other organic compound, cell line, lipid, multi-cell, virus, cell component, body part, tissue, amino acid monomer, polynucleotide, mono-cell, inorganic, peptide, nucleotide, atom, other artificial source, carbohydrate, organic


The overall spectrum

The Overall Spectrum

  • BioNER is only the starting point of biological information extraction

  • A whole suite of NLP techniques are needed to treat relations, events in literature mining

  • Techniques developed for BioNER should be adaptable to problems in later stages,

    • e.g. NE relation recognition


Intrinsic features of bioner

Intrinsic Features of BioNER

  • Unknown words

  • Long compound words

  • Variations of expressions

  • Nested NEs


Unknown words

Unknown Words

  • Words containing hyphen, digit, letter, Greek letter, Roman numeral.

    • Alpha B1

    • Adenyly cyclase 76E

    • Latent membrane protein 1

    • 4’-mycarosyl isovaleryl-CoA transferase

    • oligodeoxyribonucleotide

    • 18-deoxyaldosterone

  • Abbreviation and Acronym

    • IL, TECd, IFN, TPA


Long compound words

Long Compound words

  • interleukin 1 (IL-1)-responsive kinase

  • interleukin 1-responsive kinase

  • epidermal growth factor receptor

  • SH2 domain containing tyrosine kinase Syk

  • SH2 domain (GENIA example)


Various expressions of the same ne

Various expressions of the same NE

  • Spelling variation

    • N-acetylcysteine, N-acetyl-cysteine, NAcetylCysteine

  • Word permutation

    • beta-1 intergrin, integrin beta-1

  • Ambiguous expressions

    • epidermal growth factor receptor, EGF receptor, EGFR

    • c-jun, c-Jun, c jun


Various expressions the name explains its function

Various expressions: the name explains its function

  • the Ras guanine nucleotide exchange factor Sos

  • the Ras guanine nucleotide releasing protein Sos

  • the Ras exchanger Sos

  • the GDP-GTP exchange factor Sos

  • Sos(mSos), a GDP/GTP exchange protein for Ras


Various expressions the name includes preposition and or conjunction ambiguity of dependencies

Various expressions: The name includes preposition and/or conjunction (ambiguity of dependencies)

  • p85 alpha subunit of PI 3-kinase

  • SH2 and SH3 domains of Src

  • NF-AT1 , AP-1 , and NF-kB sites

  • E2F1 and -3

  • Residues 432, 435, 437, 438, and 440


Nested named entity

Nested Named Entity

  • An NE embedded in another NE.

  • IL-2: protein

  • IL-2gene: gene

  • CBP/p300 associated factor: protein

  • CBP/p300 associated factorbinding promoter: DNA


Outline1

Outline

  • NER (named entity recognition) in biomedical domain

  • Challenges in biomedical NER

  • State of progress in NER

  • Abbreviation disambiguation

  • Future works


Challenges of ner

Challenges of NER

  • Unknown word identification

  • Named entity boundary detection

  • Class disambiguation


Challenges

Challenges

  • Unknown word identification

    • t (10;11) (p13; q14)

    • DNA methyltransferase

    • 73 kDa protein

    • interleukin 1 (IL-1)-responsive kinase (NE may contain an abbreviation within it.)

    • Some unknown words occur very few times in the corpus  hard to recognize.


Challenges cont d

Challenges (cont’d)

  • NE boundary detection

    Can be a regular English word, unknown word, Roman numeral, digit.

    • MHC Class II

    • latent protein 1 (The left boundary is an adjective)

    • cyclin-like UDG gene product

  • Conjunction (and, or, …)

    • alpha- and beta-globin

    • human and mouse gene


Challenges cont d1

Challenges (cont’d)

  • Classification of abbreviations

    • NF-AT

      • Full name: nuclear factor of activated cells

      • Class: Protein

    • HTLV-I

      • Full name: Human T cell lymphotropic virus I

      • Class: Virus

    • TCDD

      • Full name: 2, 3, 7, 8-tetrachlorodibenzo-p- dioxin

      • Class: Other Organic

    • GRE

      • Full name: glucocorticoid response element

      • Class: DNA


Outline2

Outline

  • NER (named entity recognition) in biomedical domain

  • Challenges in biomedical NER

  • State of progress in NER

  • Abbreviation disambiguation

  • Future works


State of the art systems on ner two evaluation contests

State-of-the-art Systems on NER: Two evaluation contests

  • BioCreative 2004 (March)

    • Critical Assessment of Information Extraction Systems in Biology

    • Task 1: Entity extraction

      • Target: genes (or proteins, where there is ambiguity)

      • 10000 sentences from Medline as training data, and 5000 sentences as testing data

  • BioNLP 2004 (August)

    • GENIA Corpus as training data and 404 abstracts as testing data

    • Target: 5 classes, including protein, DNA, gene, cell line and cell type.

  • Both use exact match scoring.


Extracting biological names and relations from texts

BioNLP 2004 Datasets


Current methods

Current Methods

  • Machine Learning

    • HMM, SVM, ME (Maximum Entropy), CRF (Conditional Random Field)

    • Hybrid methods

  • Dictionary Based

    • Approximate String matching algorithm

      • Naming Rules

      • Dynamic Programming


Features for machine learning methods

Features for Machine Learning Methods

  • Morphological Features

  • Orthographical Features

  • POS Features

    • Genia POS tagger

  • Semantic Trigger Features

    • Head-noun Features

      • NF-kappaB consensus site

      • IL-2 gene


Morphological features

Morphological Features


Orthographical features

Orthographical Features


Head nouns

Head Nouns


Additional features used by manning s group local features

Additional features used by Manning’s group: local features

  • Clues within a sentence

  • Include:

    • Previous NEs

    • Abbreviations: an abbr., a long form, neither

    • Parenthesis-matching

    • etc.


External resources used by manning s group

External resources used by Manning’s group

  • Motivation

    • Contextual clues do not provide sufficient evidence for confident classification.

  • May be vulnerable to incompleteness, noise, and ambiguity.

  • Web

    • Least vulnerable to incompleteness, highly vulnerable to noise.

    • Prepare patterns for each class

      • For genes: X gene, X antagonist, X mutation

      • For RNA: X mRNA, …

      • For proteins: X ligation, …

    • Features: web-protein, web-RNA, O-web, …

    • Does not work well in BioNLP Task.


External resources 2

External resources (2)

  • Gazetteers (dictionaries)

    • Are arguably subject to all three, and yet have been successfully in some systems.

    • Compiled a list of gene names from databases (e.g. Locus Link) and GO, the data from BioCreative Tasks 1A and 1B.

    • Filtering

      • Single character entries, e.g., ‘A’, ‘1’; entries containing only digits or symbols and digits, e.g., ’37’‘3-1’

      • Entries containing only words can be found in an English dictionary (CELEX), e.g., ‘abnormal’, ‘brain tumor’

    • 1,731,581 entries

  • Larger context


State of the art approaches

State-of-the-art approaches

  • Machine learning + Post-processing

  • Our method (BioKDD2004)

    • Maximum entropy

    • Post-processing

      • Boundary extension

      • Re-classification


Zhou et al approach

Zhou et al. approach

  • HMM + SVM

  • Post-processing

    • Rule-based: used to resolve nested name entities.

  • Top1 in the NLPBA Task, F=72.5%


Manning et al method

Manning et al. method

  • Machine learning:

    • ME Markov model

    • Local features

    • External resources and larger context

  • Post-processing

    • To correct gene’s boundary (mainly for BioCreative Task)

  • Top 1 in BioCreative, F= 83.2%

  • Top 2 in NLPBA Task, F=70.1%


Our method overview

Our Method Overview

Training Phase

Knowledge input

Construct boundary word

lists and dictionary

Dictionary

Training

Data

Mapping features

Boundary word

lists

Knowledge input

ME Learning

Testing Phase

Post-processing

Testing

Data

ME

Boundary extension

NEs

Re-classify


Experimental results

Experimental Results:


Post processing

Post-Processing

  • Nested Named Entity

    • Ex: CIITA mRNA

    • Nested Annotation: <RNA><DNA>CIITA </DNA>mRNA</RNA>

    • ME sometimes only recognizes CIITA as DNA

    • 16.57% of NEs in GENIA 3.02 contains one or more shorter NE [Zhang, 2003]

  • Post-processing method

    • Boundary Extension

    • Re-classification


Boundary extension 1

Boundary Extension (1)

  • Boundary extension for nested NEs

    • Extend the R-boundary repeatedly if the NE is followed by another NE, a head noun, or an R-boundary word with a valid POS tag.

    • Extend the left boundary repeatedly if the NE is preceded by an L-boundary word with a valid POS tag.


Example

Example

  • ICAM-1 surface protein

    • ME result: ICAM-1 /1U surface/unknown protein /unknown (1:protein, U: single)

    • Boundary extension

      • surface: in R-boundary word list, valid POS tag

      • Extension: ICAM-1 surface

      • protein: in R-boundary word list, valid POS tag

      • Extension: ICAM-1 surface protein


Boundary extension 2

Boundary extension (2)

  • Boundary extension for NEs containing brackets or slashes

    • NE := NE + ( + NE + ) + {NE or head noun or R-boundary word with valid POS tag}

    • NE := NE + / + NE ( + / + NE ) + { NE or head noun or R-boundary word with valid POS tag}

  • Example

    • granulocyte-macrophage colony-stimulating factor ( GM-CSF ) gene

    • ME result: granulocyte-macrophage colony-stimulating factor, GM-CSF

    • Extension: granulocyte-macrophage colony-stimulating factor ( GM-CSF ) gene


Re classification

Re-classification

  • Use dictionary lookup

  • Use R-boundary word

    • CIITA mRNA: RNA class

    • granulocyte-macrophage colony-stimulating factor ( GM-CSF ) gene: DNA class


Experimental results ne identification

Experimental Results:NE Identification

BE-1:boundary extension for nested NEs

BE-2:boundary extension for brackets and slashes

BE-3:with human name filter


Experimental results ne recognition

Experimental Results:NE Recognition

RC-1: re-classification using dictionary lookup

RC-2: re-classification using R-boundary words


Experimental results1

Experimental Results:

GENIA v3.02 (10 Fold-CV)

Recently, Zhou improve the F-measure of his HMM model to 0.712 by combining SVM


Error analysis

Error Analysis

  • GENIA inconsistent annotation

    • IL-2 gene expression

      • <DNA>IL-2 gene</DNA> expression

      • <othername><DNA>IL-2 gene</DNA> expression</othername>

  • Conjunction

    • Human and mouse gene

  • Boundary detection error (boundary not in boundary word file)

    • Squirrel, manic, bursal…


Error analysis1

Error Analysis

  • Abbreviation classification

    • Orthographical form fits into at least two classed.

    • Protein: SOS1, FLICE, GAG

    • Other Organic: CD336

  • False negative

    • A number of errors due to low-frequency words or works not encountered in the training data.

  • False positive

  • Ellipsis:

    • Many inflammatory cytokine genes including TNF, IL-1, and IL-6


Outline3

Outline

  • NER (named entity recognition) in biomedical domain

  • Challenges in biomedical NER

  • Current methods and our method

  • State of progress in NER

  • Future works


Manning s conclusion i key factor for low performance

Manning’s conclusion (I): Key factor for low performance

  • Task difficulty does not appear to be the primary factor leading to low performance.

    • BioCreative: 1 class, BioNLP: 5 classes

  • Key factor: quality of the training and evaluation data

    • Higher inconsistency in the annotation of the BioNLP data.

    • Two of the authors independently review 50 system’s errors; 34-35 are attributed to annotation.

    • The authors do not think the annotation inconsistencies are due to biological subtleties.


Manning s conclusion ii

Manning’s Conclusion (II)

  • To improve biomedical annotation

  • BioNLP organizers emphasized that participants should focus on deep knowledge sources

    • coreference resolution and use of dependency relations over “wide used lexical-level features (POS, morphological, orthographical, etc)

  • Proper exploitation of external resources

    • In both tasks, external resources led to improvement of only 1-2%.

  • Consistent annotation might have led to a 70% reduction in error rate.


Outline4

Outline

  • NER (named entity recognition) in biomedical domain

  • Challenges in biomedical NER

  • State of progress in NER

  • Abbreviation disambiguation

  • Future works


Disambiguation of abbreviation

Disambiguation of abbreviation


Motivation i

Motivation (I)

  • Named entity (NE) recognition (NER) is first step of information extraction.

  • NER contain two steps

    • NE identification: extract named entity from text

    • NE classification: classify given NE into specific class.


Motivation ii

Motivation (II)

  • Since many protein or gene names are long compound names, they usually represent gene or protein names with abbreviation.

    • A2M: Alpha-2-macroglobulin

    • A4GALT: alpha 1,4-galactosyltransferase

    • EGFR: epidermal growth factor receptor, EGF receptor

    • NF-AT: nuclear factor of activated cells

    • HTLV-I: Human T cell lymphotropic virus I

    • TCDD: 2, 3, 7, 8-tetrachlorodibenzo-p- dioxin

    • GRE: glucocorticoid response element


Motivation iii

Motivation (III)

  • Abbreviation identification task:

    • It is easier than classification task.

    • Abbreviations often have some orthographical clues.

      • All Capital letter, Alphabet and digit hybrid…etc.

  • Abbreviation classification task:

    • In some situation, it is hard to disambiguate abbreviation’s class.

      • Example: only mention abbreviation without full name


Challenges of abbreviation

Challenges of abbreviation

  • Two cases

    • Case 1: sentence contains abbreviation and full name

      • Human immunodeficiency virus type 2 (HIV-2), like HIV-1, causes AIDS and is associated with AIDS cases primarily in West Africa.

    • Case 2: sentence contains only abbreviation

      • HIV-1 and HIV-2 display significant differences in nucleic acid sequence and in the natural history of clinical disease.


Case 1

Case 1

  • Case 1 is easier than Case 2

    • The classification can be solved by following steps:

      • Abbreviation – Full name association

      • Disambiguate full name’s class

      • Assign full name’s class to abbreviation

  • Challenges has shift from abbreviation classification to abbreviate-full name association


Example of case 1

Example of Case 1:

  • Sentence

    • Human immunodeficiency virus type 2 (HIV-2), like HIV-1, causes AIDS and is associated with AIDS cases primarily in West Africa.

  • Step 1: Abbreviation – Full name association

    • (Full name, Abbreviation) = (Human immunodeficiency virus type 2, HIV-2)

  • Step 2: Full name class assignment

    • Name: Human immunodeficiency virus type 2

    • Class: Virus

  • Step 3: Abbreviation class assignment

    • Abbreviation: HIV-2

    • Class: Virus


A solution method to case 1

A solution method to Case 1

  • Schwartz and Hearst, PSB 2003.

  • Identify <long form, short form> pairs.

    • Both long form and short form occur in the same sentence.

    • long form ‘(’ short form‘)’– more frequently

    • short form ‘(’ long form ‘)’


Algorithm identify long form short form

Algorithm: Identify long form ‘(’ short form ‘)’

  • Identify long form and short form candidates (using adjacency to parentheses).

  • Identify correct long form.

    • Starting from the end of both candidates, move right to left, trying to find the shortest long form that matches the short form.

    • Every character in the short form must match a character in the long form.

    • The matched characters in the long form must be in the same order as the characters in the short forms.

    • <HSF, Heat shock transcription factor>

    • <TTF-1, Thyroid transcription factor 1> : fail


Error analysis2

Error analysis

  • Unused characters, e.g., <CNS1, cyclophilin seven suppressor>

  • Do not have any pattern between long form and short form, e.g., <ATN, anterior thalamus>

  • Partial matching:

    • The long form includes additional words to the left of the matching, e.g., <Pol I, RNA polymerase I>

  • Out-of-order mapping

  • First character matches to the internal character (of the long form).

  • Non-continuous long form.

  • Transformation in the mapping (2D -> two-dimensional)

  • Short form of only one character.


Other types of abbreviations

Other types of abbreviations

  • Schwartz and Hearst’s algorithm only consider candidates in parentheses.

  • Challenges: To find all possible pairs is a more difficult problem.


Example of case 2

Example of Case 2

  • It’s hard to disambiguate abbreviations’ class, even with context information.

  • Example:

    • HIV-1 and HIV-2 display significant differences in nucleic acid sequence and in the natural history of clinical disease.

    • HIV-1 and HIV-2 are both virus, but if we replace HIV-1 and HIV-2 with IL-2 and IL-10, the sentence still make sense.

    • IL-2 and IL-10 display significant differences in nucleic acid sequence and in the natural history of clinical disease.

      • IL-2 and IL-10: gene name


Case 2

Case 2

  • Leave for future work.

  • Clue:

    • Statistical methods

    • Dictionary-based methods


Outline5

Outline

  • NER (named entity recognition) in biomedical domain

  • Challenges in biomedical NER

  • State of progress in NER

  • Abbreviation disambiguation

  • Future works


What s next after ner solved

What’s Next after NER solved?

  • Name entity relation recognition (NERR)

    • Protein-Protein interaction/binding/inhibition

    • Protein-Small Molecules

    • Gene-Gene regulation

    • Gene-Gene Product interaction

    • Gene-Drug relation

    • Protein-Subcellular location

    • Amino Acid-Protein relation

    • Gene-drug relation


Identify relations among named entities

Identify Relations among Named Entities

  • Target: Extract relations between various biological named entities.

Here we demonstrate that the c-myb proto-oncogene product, which is itself a

DNA-binding protein, and transcriptional transactivator, can interact

synergistically with Z.

Relation (Subject, Action, Object)

: (c-myb proto-oncogene product, interact, Z)


Future works

Future works

  • Few papers have been published on the following specific challenging topics of NER.

    • Automated corpus correction

    • Disambiguation of abbreviations (Schwartz & Hearst, 2003,…)

    • Conjunction

  • NERR (difficult)

    • parser

    • Pronoun and anaphora resolution


Acknowledgements

Acknowledgements

  • Bioinformatics: Yi-Feng Lin, Wen-Chi Chou

  • NLP: Tzong-Han Tsai, Cheng-Wei Lee

  • Postdoc: Kuen-Pin Wu

  • Colleague: Wen-Lian Hsu (Fu Chang is jumping on the bandwagon now.)


Lab introduction

Lab Introduction


Research topics

Research topics

  • Protein structure prediciton

    • 2nd structure prediction

    • Tertiary structure prediction – local structure

    • Members: Hsin-Nan Lin, Caster Chen, Jia-Ming Chang

  • Protein structure determination based on NMR data

    • Backbone assignment

    • Side chain assignment

    • RDC

    • Jia-Ming Chang, Caster Chen, Philip Chen

    • Collaborator: Prof TH Huang, IBMS


Research topics1

Research topics

  • Mass spectrometry based proteomics

    • Protein quantification

    • Protein identification – for modification study

    • Yi-Hwa Yian, Wen-Ting Lin, Jacky Chou, Wei-Nung Hung

    • Collaborator: Prof YR Chen, Inst of Chemistry

  • Biological literature mining

    • NER, NERR

    • Yi-Feng Lin, Jacky Chou, Richard Tsai


Faculty

Faculty

  • PI: Wen-Lian Hsu, Ting-Yi Sung

  • Post-doc: Kuen-Pin Wu


  • Login