1 / 19

A Probabilistic Term Variant Generator for Biomedical Terms

A Probabilistic Term Variant Generator for Biomedical Terms. Yoshimasa Tsuruoka and Jun ’ ichi Tsujii CREST, JST The University of Tokyo. Outline. Probabilistic Term Variant Generator Generation Algorithm Application: Dictionary expansion. Background.

orsin
Download Presentation

A Probabilistic Term Variant Generator for Biomedical Terms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun’ichi Tsujii CREST, JST The University of Tokyo

  2. Outline • Probabilistic Term Variant Generator • Generation Algorithm • Application: Dictionary expansion

  3. Background • Information extraction from biomedical documents • Recognizing technical terms (e.g. DNA, protein names) We measured glucocorticoid receptors ( GR ) in mononuclear leukocytes ( MNL ) isolated…

  4. Technical Term Recognition • Machine learning based • Identifying the regions of terms ⇒ No ID information • Dictionary-based • Comparing the strings with each entry in the dictionary ⇒ ID information

  5. Problems of Dictionary-based approaches • Spelling variation degrades recall  ⇒ Approximate string searching • False positivesdegrade precision  ⇒ Filtering by machine learning

  6. Exact String Searching • Example • Text Phorbol myristate acetate induced Egr-1 mRNA… • Dictionary EGP EGR-1 EGR-1 binding protein : ⇒ Any of them does not match

  7. Edit Distance • Defines the distance of two strings by the sequence of three kinds of operations. • Substitution • Insertion • Deletion • Ex.)board → abord • Cost = 2 (delete `a’ and add `a’)

  8. Automatic Generation of Spelling Variants • Variant Generator NF-Kappa B (1.0) NF Kappa B (0.9) NF kappa B (0.6) NF kappaB (0.5) NFkappaB (0.3) : Generator NF-Kappa B Each generated variant is associated with its generation probability

  9. Generation Algorithm • Recursive generation P = P’ x Pop T cell (1.0) 0.5 0.2 T-cell (0.5) T cells (0.2) 0.2 T-cells (0.1)

  10. Collecting Examples of Spelling Variation • Abbreviation Extraction (Schwartz 2003) • Extracts short and long form pairs

  11. Learning Operation Rules • Operations for generating variants • Substitution • Deletion • Insertion • Context • Character-level context: preceding (following) two characters • Operation Probability

  12. Probabilistic Rules

  13. Example (1)

  14. Example (2)

  15. Example (3)

  16. Application:Dictionary Expansion • Expanding each entry in the dictionary • Threshold of Generation Probability: 0.1 • Max number of variants for each entry: 20

  17. Protein Name Recognition • Information Extraction • Longest match • GENIA corpus

  18. Results of Dictionary Expansion • a

  19. Conclusion • Probabilistic Variant Generator • Learning from actual examples • Dictionary expansion by the generator improves recall without the loss of precision.

More Related