A Bio Text Mining Workbench combined with Active Machine Learning

A Bio Text Mining Workbench combined with Active Machine Learning Gary Geunbae Lee Postech 11/25 LBM2005

Contents • Introduction • POSBIOTM/W Workbench • POSBIOTM/NER System • POSBIOTM/NER with Active Machine Learning • POSBIOTM/Event System • Current status (demo)

Introduction • Exponentially growing biological publications

Introduction • Two key issues to deal with biological texts. • Biological named entity recognition. • Extract the biological interaction (events) between biological entities. • Important to biological pathway. Biological Papers

Introduction • Bio-text mining workbench • Development workbench (common in NLP) • Grammar development workbench • POS/Tree Tagging workbench • Use large amount of Corpus • Machine Learning methods are used in NER task and event extraction task. • Annotated corpus is essential to achieve good results in machine learning based methods (both in quantity and quality) • Lack of annotated corpus (notorious in bio/medical fields) • Need • tools in support of collecting, managing, creating, annotating and exploiting rich biomedical text resources. • Tools which interacts with the automatic system to increase the high quality annotated corpus

Contents • Introduction • POSBIOTM/W Workbench • POSBIOTM/NER System • POSBIOTM/NER with Active Machine Learning • POSBIOTM/Event System • Current status

POSBIOTM/W: A development Workbench • Overall Design

POSBIOTM/W Workbench • Managing Tool • Goal • help users to search, collect and manage publications. • Quick Search Bar • provides quick access to PubMed. • Pubmed Search Assistant • Users can select specific abstracts to do the named-entity tagging and event extraction

POSBIOTM/W Workbench • Managing Tool • Pubmed search Assistant

POSBIOTM/W Workbench • NER Tool • Named-entity recognition (NER) task • identification of material names concerned. • Goal: automatically and effectively annotate biomedical-related entities. • NER Tool is a Client Tool of POSBIOTM/NER System • Currently, Three NER models are provided. • The GENIA-NER model, the GENE-NER-model and the GPCR-NER model • Named-entity recognition with Active learning • To minimize the human labeling effort

POSBIOTM/W Workbench • NER Tool • Named-entity recognition with Active learning

POSBIOTM/W Workbench • Event Extraction Tool • Goal: To extract the events which consist of “interaction”, “effecter”, and “reactant” • Named-entity types: protein (P), gene (G), small molecule (SM), and cellular process (CP). • Interaction: biological interaction (BI) and a chemical interaction (CI). • Event Extraction Tool is a Client Tool of POSBIOTM/Event System

POSBIOTM/W Workbench • Event Extraction Tool • Extraction Result in XML format <Result> <NER> .... <Sentence SNum = "4"><protein>EDG-1</protein>, encoded by the <gene>endothelial_differentiation_gene-1</gene> , is a <protein>heterotrimeric_guanine_nucleotide_binding_protein-coupled_receptor</protein> ( <protein >GPCR</ protein > ) for <small_molecule>sphingosine-1-phosphate</ small_molecule> ( < small_molecule>SPP</ small_molecule> ) that has been shown to stimulate < cellular_process>angiogenesis</ cellular_process> and <cellular_process>cell_migration</ cellular_process> in cultured endothelial cells. </Sentence> ..... </NER> <Event_Extraction> <Event SNum = "4"> <Interaction>stimulate</Interaction> <Effecter>sphingosine-1-phosphate</Effecter> <Reactant>angiogenesis</Reactant> </Event> ..... </ Event_Extraction > </Result>

POSBIOTM/W Workbench • Event Extraction Tool • Extraction Result

POSBIOTM/W Workbench • Annotation Tool • Goal • The GUI-based Annotation tool is designed to manipulate the manual annotations. • Named-entity editing • NE is displayedin different colors which could be changed • add, remove or correct named-entity tags, or change the boundaries of named entities, etc.

POSBIOTM/W Workbench • Annotation Tool • Event editing • extracted events are displayed in a table • double-clicking the event to look up the original sentence from which each event is extracted • Upload function • Users can upload the well-annotated data to the POSBIOTM system • incremental build-up of a massive amount of named-entity and event annotation corpus.

POSBIOTM/W Workbench • Annotation Tool

POSBIOTM/NER System • Named Entity Recognition (NER) • Approach • the named entity recognition problem is regarded as a classification problem, marking up each input token with named entity category labels. • CRF • Conditional random fields (CRFs) ([Lafferty et.al. 2001]) is a probabilistic framework for labeling and segmenting a sequential data. (s: state(tag); o: input) • For example:

POSBIOTM/NER System • Named Entity Recognition (NER) • Feature Set

POSBIOTM/NER System • NER Models • Three NER models • GENIA model / GENE-NER model / GPCR-NER model • GENIA model • The named entity classes used in the evaluation : DNA, RNA, protein and cell_line, cell_type • The training data consists of 2000 MEDLINE abstracts of the GENIA version 3 corpus. These abstracts were collected using the search terms “human”, ”blood cell”, “transcription factor”. • The testing data will come from a super-domain of the training data (“blood cell”, ”transcription factor”).

POSBIOTM/NER System • NER Models • GENE-NER model • GENE-NER module uses BioCreative corpus. • The aim of the GENE-NER module is the identification of which terms in biomedical research article are gene and/or protein names. • The training corpus consists of 7.5k sentences, selected from MEDLINE according to their likelihood of containing gene names. • GPCR-NER module (Postech) • aims at recognizing four target named entity categories: protein, gene, small molecule and cellular process. • The training corpus consists of 50 full articles related to GPCR(G-protein coupled receptor) signal transduction pathway.

POSBIOTM/NER System • NER Models • Evaluation for Three NER models

POSBIOTM/NER with Active Learning • Active Learning in NER • NER with Machine Learning • To enhance the NER performance through the idea of re-using the annotated data and re-training the NER module • NER with Active Machine Learning • To minimize the human labeling effort without degrading the performance • To select the most informative samples for training

POSBIOTM/NER with Active Learning • Active Learning in NER Framework

POSBIOTM/NER with Active Learning • Active Learning Scoring Strategy • Uncertainty-based Sample Selection • Using an entropy-based measure to quantify the uncertainty that the current classifier holds (entropy or normalized entropy of the CRF conditional probability) • The most uncertain samples are selected for human annotation

POSBIOTM/NER with Active Learning • Active Learning Scoring Strategy • Diversity-based Sample Selection • To catch the most representative sentences in each sampling. • The divergence measures of the two sentences are represented by the minimum similarity among the examples • The similarity score of two words • The similarity score of two sentences (for syntactic path)

POSBIOTM/NER with Active Learning • Active Learning Scoring Strategy • MMR(Maximal Marginal Relevance) method • The two measures for uncertainty and diversity will be combined using the MMR method to give the sampling scores in our active learning strategy

POSBIOTM/NER with Active Learning • Experiment and Discussion • Training Data • 2,000 MEDLINE abstracts from the GENIA corpus • 5 named entity classes • DNA, RNA, protein, cell line, cell type • Test Data • 404 abstracts • Half of them are from the same domain as the training data and the other half are from the super-domain of ‘blood cell’ and ‘transcription factor’

POSBIOTM/NER with Active Learning • Experiment and Discussion • Pool-based sample selection • 100 abstracts were used to train initial NER module • Each time, we chose k examples (sentences) from the given pool to train the new NER module • The number k varied from 1,000 to 17,000 with step size 1,000 • Active learning methods for test • Random selection • Entropy based uncertainty selection • Entropy combined with Diversity • Normalized Entropy combined with Diversity

POSBIOTM/NER with Active Learning • Experiment and Discussion

POSBIOTM/NER with Active Learning • Experiment and Discussion • All three kinds of active learning strategies outperform the random selection • The combined strategy reduces 24.64% training examples compared with the random selection • The normalized combined strategy reduces 35.43% training examples compared with the random selection • Diversity increases the classifier’s performance when the large amount of sample are selected • Up to 4,000 sentences, the entropy strategy and the combined strategy perform similar • After 11,000 sentence point, the combined strategy surpasses the entropy strategy

POSBIOTM/Event System • System Architecture

POSBIOTM/Event System • Target Slot Definition • Template Element • Entities - participants of an event • protein (P), gene (G), small molecule (SM), cellular process (CP) • Interaction - relationship between entities • biological interaction (BI) – Functional interaction • About how/whether one component affects the other's status biologically • chemical interaction (CI) – Molecular interaction • About the interaction among entities at the molecular structural level • Event • One Interaction (I) • Connecting the effecter and reactant • Interaction keywords (BI, CI) • One Effecter (E) • Provoking an event • Template element (P, G, SM, CP) or nested event • One Reactant (R) • Responding to an effecter • Template element (P, G, SM, CP) or nested event

POSBIOTM/Event System • Target Slot Definition • Example

POSBIOTM/Event System • Pre-Processor • Sentence boundary detection • Annotating Named Entity (NER) • Protein • Small molecule • Gene • Cellular process • Compound/Complex Sentence Splitter • To simplify the complicated full texts

POSBIOTM/Event System • Pre-Processor • Compound/Complex Sentence Splitter • Simple splitting rules • [S] NP1 VP1 NP2 [SBAR] that|which VP2 [/SBAR] [/S]  NP1 VP1 NP2 + NP2 VP2 • Example • “The best studied of these is EDG-1, which is implicated in cell migration and angiogenesis.” ==> 1. “The best studied of these is EDG-1.” 2. “EDG-1 is implicated in cell migration and angiogenesis.”

POSBIOTM/Event System • Biological Event Extraction • Two-level Event Rule Learner

POSBIOTM/Event System • Biological Event Extraction • Event Rule Learner • Adapt a supervised machine learning algorithm: WHISK • learns rules in the form of context-based regular expressions • induces the rules with top-down manner • Ex) “{NP} .*? (<CP>)[E] {/NP} {VP} (<BI>)[I] {/VP} {NP} both (<P>)[R] and .*? {/NP}” • Limitation of the WHISK • The longer distance between event components, the more difficult to extract the correct event • WHISK consider all lexical words between event components • Cannot handle nested biological events • Propose two-level rule learning method to handle the limitation of the flat rule learning method

POSBIOTM/Event System • Biological Event Extraction • Two-level Event Rule Learner

POSBIOTM/Event System • Biological Event Extraction • Event Extractor • To extract the events with the automatic generated rules • by using regular expression pattern matching • To handle the alias and noun conjunction • aliases and noun conjunctions have general patterns like ‘sphingosine-1-phosphate(SPP)’ or ‘FP, IP, and TP receptors’ • handle them with simple rules like ‘A(B)’ or ‘A, B, C, and D’ • To remove sentences including the negative words • ‘not’, ‘never’, ‘fail’, etc

POSBIOTM/Event System • Event Component Verifier

POSBIOTM/Event System • Event Component Verifier • To remove the incorrectly extracted events • Classify template elements (P, G, SM, CP, BI, CI) into 4 classes • I (interaction), E (effecter), R (reactant), N (none) • I, E, R : event’s components • N : a template element , but not an event component • Use a Maximum Entropy Classifier • Features • POS tag, phrase chunks, the type of template element of neighboring words and semantic information

POSBIOTM/Event System • Event Component Verifier

POSBIOTM/Event System • Event Component Verifier • Example

POSBIOTM/Event System • Experiment and Discussion • 500 Medline abstracts including 2,314 biological events & 10-fold cross validation • Flat rule learner vs. two-level rule learner • Before verification vs. after verification • Performance comparison • Learning Information Extractors for Proteins and their Interactions (2004) - Razvan Bunescu, et. al • 1000 abstracts & 10-fold cross validation

POSBIOTM/Event System • Experiment and Discussion • Trade-off between precision and recall • Before verification : big gap between precision and recall • After verification : low gap between precision and recall • threshold : cut the rules according to the measure on how many of the extracted events from a rule are correct

POSBIOTM/Event System • Experiment and Discussion • Constant good performance regardless of the threshold of rule learner

A Bio Text Mining Workbench combined with Active Machine Learning