1 / 53

A Bio Text Mining Workbench combined with Active Machine Learning

A Bio Text Mining Workbench combined with Active Machine Learning. Gary Geunbae Lee Postech 11/25 LBM2005. Contents. Introduction POSBIOTM/W Workbench POSBIOTM/NER System POSBIOTM/NER with Active Machine Learning POSBIOTM/Event System Current status ( demo). Introduction.

geranium
Download Presentation

A Bio Text Mining Workbench combined with Active Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

  2. Contents • Introduction • POSBIOTM/W Workbench • POSBIOTM/NER System • POSBIOTM/NER with Active Machine Learning • POSBIOTM/Event System • Current status (demo)

  3. Introduction • Exponentially growing biological publications

  4. Introduction • Two key issues to deal with biological texts. • Biological named entity recognition. • Extract the biological interaction (events) between biological entities. • Important to biological pathway. Biological Papers

  5. Introduction • Bio-text mining workbench • Development workbench (common in NLP) • Grammar development workbench • POS/Tree Tagging workbench • Use large amount of Corpus • Machine Learning methods are used in NER task and event extraction task. • Annotated corpus is essential to achieve good results in machine learning based methods (both in quantity and quality) • Lack of annotated corpus (notorious in bio/medical fields) • Need • tools in support of collecting, managing, creating, annotating and exploiting rich biomedical text resources. • Tools which interacts with the automatic system to increase the high quality annotated corpus

  6. Contents • Introduction • POSBIOTM/W Workbench • POSBIOTM/NER System • POSBIOTM/NER with Active Machine Learning • POSBIOTM/Event System • Current status

  7. POSBIOTM/W: A development Workbench • Overall Design

  8. POSBIOTM/W Workbench • Managing Tool • Goal • help users to search, collect and manage publications. • Quick Search Bar • provides quick access to PubMed. • Pubmed Search Assistant • Users can select specific abstracts to do the named-entity tagging and event extraction

  9. POSBIOTM/W Workbench • Managing Tool • Pubmed search Assistant

  10. POSBIOTM/W Workbench • NER Tool • Named-entity recognition (NER) task • identification of material names concerned. • Goal: automatically and effectively annotate biomedical-related entities. • NER Tool is a Client Tool of POSBIOTM/NER System • Currently, Three NER models are provided. • The GENIA-NER model, the GENE-NER-model and the GPCR-NER model • Named-entity recognition with Active learning • To minimize the human labeling effort

  11. POSBIOTM/W Workbench • NER Tool • Named-entity recognition with Active learning

  12. POSBIOTM/W Workbench • Event Extraction Tool • Goal: To extract the events which consist of “interaction”, “effecter”, and “reactant” • Named-entity types: protein (P), gene (G), small molecule (SM), and cellular process (CP). • Interaction: biological interaction (BI) and a chemical interaction (CI). • Event Extraction Tool is a Client Tool of POSBIOTM/Event System

  13. POSBIOTM/W Workbench • Event Extraction Tool • Extraction Result in XML format <Result> <NER> .... <Sentence SNum = "4"><protein>EDG-1</protein>, encoded by the <gene>endothelial_differentiation_gene-1</gene> , is a <protein>heterotrimeric_guanine_nucleotide_binding_protein-coupled_receptor</protein> ( <protein >GPCR</ protein > ) for <small_molecule>sphingosine-1-phosphate</ small_molecule> ( < small_molecule>SPP</ small_molecule> ) that has been shown to stimulate < cellular_process>angiogenesis</ cellular_process> and <cellular_process>cell_migration</ cellular_process> in cultured endothelial cells. </Sentence> ..... </NER> <Event_Extraction> <Event SNum = "4"> <Interaction>stimulate</Interaction> <Effecter>sphingosine-1-phosphate</Effecter> <Reactant>angiogenesis</Reactant> </Event> ..... </ Event_Extraction > </Result>

  14. POSBIOTM/W Workbench • Event Extraction Tool • Extraction Result

  15. POSBIOTM/W Workbench • Annotation Tool • Goal • The GUI-based Annotation tool is designed to manipulate the manual annotations. • Named-entity editing • NE is displayedin different colors which could be changed • add, remove or correct named-entity tags, or change the boundaries of named entities, etc.

  16. POSBIOTM/W Workbench • Annotation Tool • Event editing • extracted events are displayed in a table • double-clicking the event to look up the original sentence from which each event is extracted • Upload function • Users can upload the well-annotated data to the POSBIOTM system • incremental build-up of a massive amount of named-entity and event annotation corpus.

  17. POSBIOTM/W Workbench • Annotation Tool

  18. Contents • Introduction • POSBIOTM/W Workbench • POSBIOTM/NER System • POSBIOTM/NER with Active Machine Learning • POSBIOTM/Event System • Current status

  19. POSBIOTM/NER System • Named Entity Recognition (NER) • Approach • the named entity recognition problem is regarded as a classification problem, marking up each input token with named entity category labels. • CRF • Conditional random fields (CRFs) ([Lafferty et.al. 2001]) is a probabilistic framework for labeling and segmenting a sequential data. (s: state(tag); o: input) • For example:

  20. POSBIOTM/NER System • Named Entity Recognition (NER) • Feature Set

  21. POSBIOTM/NER System • NER Models • Three NER models • GENIA model / GENE-NER model / GPCR-NER model • GENIA model • The named entity classes used in the evaluation : DNA, RNA, protein and cell_line, cell_type • The training data consists of 2000 MEDLINE abstracts of the GENIA version 3 corpus. These abstracts were collected using the search terms “human”, ”blood cell”, “transcription factor”. • The testing data will come from a super-domain of the training data (“blood cell”, ”transcription factor”).

  22. POSBIOTM/NER System • NER Models • GENE-NER model • GENE-NER module uses BioCreative corpus. • The aim of the GENE-NER module is the identification of which terms in biomedical research article are gene and/or protein names. • The training corpus consists of 7.5k sentences, selected from MEDLINE according to their likelihood of containing gene names. • GPCR-NER module (Postech) • aims at recognizing four target named entity categories: protein, gene, small molecule and cellular process. • The training corpus consists of 50 full articles related to GPCR(G-protein coupled receptor) signal transduction pathway.

  23. POSBIOTM/NER System • NER Models • Evaluation for Three NER models

  24. Contents • Introduction • POSBIOTM/W Workbench • POSBIOTM/NER System • POSBIOTM/NER with Active Machine Learning • POSBIOTM/Event System • Current status

  25. POSBIOTM/NER with Active Learning • Active Learning in NER • NER with Machine Learning • To enhance the NER performance through the idea of re-using the annotated data and re-training the NER module • NER with Active Machine Learning • To minimize the human labeling effort without degrading the performance • To select the most informative samples for training

  26. POSBIOTM/NER with Active Learning • Active Learning in NER Framework

  27. POSBIOTM/NER with Active Learning • Active Learning Scoring Strategy • Uncertainty-based Sample Selection • Using an entropy-based measure to quantify the uncertainty that the current classifier holds (entropy or normalized entropy of the CRF conditional probability) • The most uncertain samples are selected for human annotation

  28. POSBIOTM/NER with Active Learning • Active Learning Scoring Strategy • Diversity-based Sample Selection • To catch the most representative sentences in each sampling. • The divergence measures of the two sentences are represented by the minimum similarity among the examples • The similarity score of two words • The similarity score of two sentences (for syntactic path)

  29. POSBIOTM/NER with Active Learning • Active Learning Scoring Strategy • MMR(Maximal Marginal Relevance) method • The two measures for uncertainty and diversity will be combined using the MMR method to give the sampling scores in our active learning strategy

  30. POSBIOTM/NER with Active Learning • Experiment and Discussion • Training Data • 2,000 MEDLINE abstracts from the GENIA corpus • 5 named entity classes • DNA, RNA, protein, cell line, cell type • Test Data • 404 abstracts • Half of them are from the same domain as the training data and the other half are from the super-domain of ‘blood cell’ and ‘transcription factor’

  31. POSBIOTM/NER with Active Learning • Experiment and Discussion • Pool-based sample selection • 100 abstracts were used to train initial NER module • Each time, we chose k examples (sentences) from the given pool to train the new NER module • The number k varied from 1,000 to 17,000 with step size 1,000 • Active learning methods for test • Random selection • Entropy based uncertainty selection • Entropy combined with Diversity • Normalized Entropy combined with Diversity

  32. POSBIOTM/NER with Active Learning • Experiment and Discussion

  33. POSBIOTM/NER with Active Learning • Experiment and Discussion • All three kinds of active learning strategies outperform the random selection • The combined strategy reduces 24.64% training examples compared with the random selection • The normalized combined strategy reduces 35.43% training examples compared with the random selection • Diversity increases the classifier’s performance when the large amount of sample are selected • Up to 4,000 sentences, the entropy strategy and the combined strategy perform similar • After 11,000 sentence point, the combined strategy surpasses the entropy strategy

  34. Contents • Introduction • POSBIOTM/W Workbench • POSBIOTM/NER System • POSBIOTM/NER with Active Machine Learning • POSBIOTM/Event System • Current status

  35. POSBIOTM/Event System • System Architecture

  36. POSBIOTM/Event System • Target Slot Definition • Template Element • Entities - participants of an event • protein (P), gene (G), small molecule (SM), cellular process (CP) • Interaction - relationship between entities • biological interaction (BI) – Functional interaction • About how/whether one component affects the other's status biologically • chemical interaction (CI) – Molecular interaction • About the interaction among entities at the molecular structural level • Event • One Interaction (I) • Connecting the effecter and reactant • Interaction keywords (BI, CI) • One Effecter (E) • Provoking an event • Template element (P, G, SM, CP) or nested event • One Reactant (R) • Responding to an effecter • Template element (P, G, SM, CP) or nested event

  37. POSBIOTM/Event System • Target Slot Definition • Example

  38. POSBIOTM/Event System • Pre-Processor • Sentence boundary detection • Annotating Named Entity (NER) • Protein • Small molecule • Gene • Cellular process • Compound/Complex Sentence Splitter • To simplify the complicated full texts

  39. POSBIOTM/Event System • Pre-Processor • Compound/Complex Sentence Splitter • Simple splitting rules • [S] NP1 VP1 NP2 [SBAR] that|which VP2 [/SBAR] [/S]  NP1 VP1 NP2 + NP2 VP2 • Example • “The best studied of these is EDG-1, which is implicated in cell migration and angiogenesis.” ==> 1. “The best studied of these is EDG-1.” 2. “EDG-1 is implicated in cell migration and angiogenesis.”

  40. POSBIOTM/Event System • Biological Event Extraction • Two-level Event Rule Learner

  41. POSBIOTM/Event System • Biological Event Extraction • Event Rule Learner • Adapt a supervised machine learning algorithm: WHISK • learns rules in the form of context-based regular expressions • induces the rules with top-down manner • Ex) “{NP} .*? (<CP>)[E] {/NP} {VP} (<BI>)[I] {/VP} {NP} both (<P>)[R] and .*? {/NP}” • Limitation of the WHISK • The longer distance between event components, the more difficult to extract the correct event • WHISK consider all lexical words between event components • Cannot handle nested biological events • Propose two-level rule learning method to handle the limitation of the flat rule learning method

  42. POSBIOTM/Event System • Biological Event Extraction • Two-level Event Rule Learner

  43. POSBIOTM/Event System • Biological Event Extraction • Event Extractor • To extract the events with the automatic generated rules • by using regular expression pattern matching • To handle the alias and noun conjunction • aliases and noun conjunctions have general patterns like ‘sphingosine-1-phosphate(SPP)’ or ‘FP, IP, and TP receptors’ • handle them with simple rules like ‘A(B)’ or ‘A, B, C, and D’ • To remove sentences including the negative words • ‘not’, ‘never’, ‘fail’, etc

  44. POSBIOTM/Event System • Event Component Verifier

  45. POSBIOTM/Event System • Event Component Verifier • To remove the incorrectly extracted events • Classify template elements (P, G, SM, CP, BI, CI) into 4 classes • I (interaction), E (effecter), R (reactant), N (none) • I, E, R : event’s components • N : a template element , but not an event component • Use a Maximum Entropy Classifier • Features • POS tag, phrase chunks, the type of template element of neighboring words and semantic information

  46. POSBIOTM/Event System • Event Component Verifier

  47. POSBIOTM/Event System • Event Component Verifier • Example

  48. POSBIOTM/Event System • Experiment and Discussion • 500 Medline abstracts including 2,314 biological events & 10-fold cross validation • Flat rule learner vs. two-level rule learner • Before verification vs. after verification • Performance comparison • Learning Information Extractors for Proteins and their Interactions (2004) - Razvan Bunescu, et. al • 1000 abstracts & 10-fold cross validation

  49. POSBIOTM/Event System • Experiment and Discussion • Trade-off between precision and recall • Before verification : big gap between precision and recall • After verification : low gap between precision and recall • threshold : cut the rules according to the measure on how many of the extracted events from a rule are correct

  50. POSBIOTM/Event System • Experiment and Discussion • Constant good performance regardless of the threshold of rule learner

More Related