A probabilistic term variant generator for biomedical terms
Download
1 / 19

A Probabilistic Term Variant Generator for Biomedical Terms - PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on

A Probabilistic Term Variant Generator for Biomedical Terms. Yoshimasa Tsuruoka and Jun ’ ichi Tsujii CREST, JST The University of Tokyo. Outline. Probabilistic Term Variant Generator Generation Algorithm Application: Dictionary expansion. Background.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Probabilistic Term Variant Generator for Biomedical Terms' - orsin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A probabilistic term variant generator for biomedical terms

A Probabilistic Term Variant Generator for Biomedical Terms

Yoshimasa Tsuruoka and Jun’ichi Tsujii

CREST, JST

The University of Tokyo


Outline
Outline

  • Probabilistic Term Variant Generator

    • Generation Algorithm

    • Application: Dictionary expansion


Background
Background

  • Information extraction from biomedical documents

  • Recognizing technical terms (e.g. DNA, protein names)

    We measured glucocorticoid receptors ( GR )

    in mononuclear leukocytes ( MNL ) isolated…


Technical term recognition
Technical Term Recognition

  • Machine learning based

    • Identifying the regions of terms

      ⇒ No ID information

  • Dictionary-based

    • Comparing the strings with each entry in the dictionary

      ⇒ ID information


Problems of dictionary based approaches
Problems of Dictionary-based approaches

  • Spelling variation degrades recall

     ⇒ Approximate string searching

  • False positivesdegrade precision

     ⇒ Filtering by machine learning


Exact string searching
Exact String Searching

  • Example

    • Text

      Phorbol myristate acetate induced Egr-1 mRNA…

    • Dictionary

      EGP

      EGR-1

      EGR-1 binding protein

      :

      ⇒ Any of them does not match


Edit distance
Edit Distance

  • Defines the distance of two strings by the sequence of three kinds of operations.

    • Substitution

    • Insertion

    • Deletion

  • Ex.)board → abord

    • Cost = 2 (delete `a’ and add `a’)


Automatic generation of spelling variants
Automatic Generation of Spelling Variants

  • Variant Generator

NF-Kappa B (1.0)

NF Kappa B (0.9)

NF kappa B (0.6)

NF kappaB (0.5)

NFkappaB (0.3)

:

Generator

NF-Kappa B

Each generated variant is associated with

its generation probability


Generation algorithm
Generation Algorithm

  • Recursive generation

    P = P’ x Pop

T cell (1.0)

0.5

0.2

T-cell (0.5)

T cells (0.2)

0.2

T-cells (0.1)


Collecting examples of spelling variation
Collecting Examples of Spelling Variation

  • Abbreviation Extraction (Schwartz 2003)

    • Extracts short and long form pairs


Learning operation rules
Learning Operation Rules

  • Operations for generating variants

    • Substitution

    • Deletion

    • Insertion

  • Context

    • Character-level context: preceding (following) two characters

  • Operation Probability






Application dictionary expansion
Application:Dictionary Expansion

  • Expanding each entry in the dictionary

    • Threshold of Generation Probability: 0.1

    • Max number of variants for each entry: 20


Protein name recognition
Protein Name Recognition

  • Information Extraction

  • Longest match

  • GENIA corpus



Conclusion
Conclusion

  • Probabilistic Variant Generator

    • Learning from actual examples

    • Dictionary expansion by the generator improves recall without the loss of precision.