Improving Scalability of Support Vector Machines for Biomedical Named Entity Recognition

Improving Scalability of Support Vector Machines for Biomedical Named Entity Recognition Ph.D. Thesis Proposal Presented By Mona Soliman Habib December 2007

Outline • Objectives • Named Entity Recognition • NER Challenges • Support Vector Machines • SVM Challenges • Research Proposal • Baseline Experiments and Results

Objectives • Explore the scalability problems associated with solving the named entity recognition problem using high-dimensional input space and support vector machines. • Propose a solution that improves SVM scalability for multi-class problems • Propose an NER solution that fosters language and domain independence • Apply the proposed solution to the biomedical domain • (Optional) Present auxiliary issues related to SVM usability and recommend architecture

Named Entity Recognition • Information extraction task • Identification/classification of words or groups of words denoting a concept or entity • E.g.: person, location, gene, company • Entities may be relevant only to a specific domain, for e.g.: pneumonia is a disease • Language or domain-specific NER solution may not be useful for other languages or domains

General NER Example • Day 2 of “Oprahpalooza” begins in [ORG SC] . She says state of nation , belief in candidate led to her first endorsement . [ORG Associated Press] .[LOC COLUMBIA] , [LOC S.C.] - Media mogul [PER Oprah Winfrey] on Sunday told thousands of people in a football stadium in this early voting state to shrug off [PER Barack Obama] 's detractors and help him " seize the opportunity " in his bid for the [LOC White House] . " [LOC South Carolina] — January 26 th is your moment , " [PER Winfrey] said , referring to the state [MISC Democratic] primary date during a campaign stop alongside the [LOC Illinois] senator . " It 's your time to seize the opportunity to support a man who , as the [PER Bible] says , loves mercy and does justly . " [PER Obama] 's campaign said more than 29,000 attended the event at the [ORG University of South Carolina] 's football stadium . It had the feel of a rock concert , with bands playing for early arrivals and campaign supporters yelling " fire it up " to the crowd . Text Source: http://www.msnbc.msn.com/id/22160762/ 12/09/2007NE Output from http://l2r.cs.uiuc.edu/~cogcomp/eoh/nedemo.htmlLegend: PER person, LOC location, ORG organization, MISC miscellaneous

NER Solution Approaches • Statistical, probabilistic, conditional, inference, .. • Machine learning (Supervised or unsupervised) • Hidden Markov Model • Maximum entropy approach • Decision trees • Rule-based models • Memory-based approach • Support vector machines • AdaBoost, and other approaches • Combination of different approaches

Language-Specific Tools • Part-of-speech tags • Noun phrase tags, syntactic tags • Grammar rules • Affix information (character n-grams) • Orthographic patterns • Lexical features • Punctuation & parentheses handling • Word triggers, word roots, word variations

Domain-Specific Tools • Specialized dictionaries • Gazetteers (reference information) • Bag of words • Definition of rules describing entities and their possible contexts • Cascaded entities • Other external resources

Language and Domain Independence: Why? • Incorporating language or domain-specific knowledge requires additional pre and/or post processing. • Additional tasks, such as part-of-speech tagging or rule definition, are labor and time intensive. • It’s not easy to incorporate new information if/when it becomes available. • Solutions are not easily portable across domains or languages.

NER in Biomedical Domain • Challenging domain for NER • Growing nomenclature, large number of new articles, reports, records, .. • Ambiguity in identifying left boundary of multi-word entities • Strong overlap among different entities • Difficult to annotate training data • Rules definition or inference is difficult • Linguistic information may add no value

Biomedical NER Example TI - Involvement of Extracellular Signal-Regulated Kinase Module in [HIV]virus - Mediated [CD4]protein Signals Controlling Activation of [Nuclear Factor-kappa B] protein and [AP-1]protein Transcription Factors AB - Although the molecular mechanisms by which the [HIV-1]virus triggers either [T cell]cell_type activation, anergy, or apoptosis remain poorly understood, it is well established that the interaction of [HIV-1]virus envelope glycoproteins with [cell surface]cell_line[CD4]protein delivers signals to the target cell, resulting in activation of transcription factors such as [NF-kappa B]protein and [AP-1]protein. In this study, we report the first evidence indicating that kinases [MEK-1]protein ([MAP kinase/Erk kinase]protein) and [ERK-1]protein ([extracellular signal-regulated kinase]protein) act as intermediates in the cascade of events that regulate [NF-kappa B]protein and [AP-1]protein activation upon [HIV-1]virus binding to [cell surface]cell_line[CD4]protein. Annotation: GENIA Corpus - Article Source: (Briant et al. 1998) The Journal of Immunology ERK-1 Example of a single word protein name extracellular signal-regulated kinase Example of multi-word protein

Biomedical NER Example We have shown that [interleukin-1]protein[IL-1]protein) and [IL-2]protein control [IL-2 receptor alpha (IL-2R alpha) gene]DNA transcription in [CD4-CD8-murine T lymphocyte precursors]cell_line. Example MEDLINE sentence marked upfor molecular biology named entities Source: (Collier and Takeuchi 2004) interleukin-1 example of a single word protein name IL-2 Example of a single word protein name IL-2 receptor alpha (IL-2R alpha) gene Example of multi-word DNA

NER Challenges • Entities may appear in any form • Patterns may be difficult to discover • Discovering boundaries of multi-word entities is challenging • Supervised learning requires labeled training data, not easy to obtain • Positive examples are usually scarce • Unbalanced representation of different classes in the training corpus

NER Solution? Now let’s look into Support Vector Machines as a machine learning solution for Named Entity Recognition

Support Vector Machines • Powerful tool for pattern recognition • Based on Vapnik’s statistical learning theory (Vapnik 1995) • Kernel-based machine learning • Increasingly popular due to its high generalization ability and handling of high-dimensional input space

Class 2 Class 1 Linearly Separable Case Problem: How to find a “good” decision boundary?

Maximum Margin Decision Boundary w Class 2 m Class 1 Solution: Maximize the margin between parallel supporting planes

f(.) Input space Feature space Non-Linearly Separable Case Input space is mapped into a higher dimension feature space where classes are linearly separable

SVM Optimization Problem • Linearly separable case: Minimize such that • Non-linearly separable case: Minimize such that where C is a user-defined parameter and are the slack variables, or margin errors.

The Dual Problem • Solving the optimization problem is equivalent to solving its dual problem find  that minimizes subject to the resulting SVM is of the form

The Kernel “Trick” • There exists a mapping such that • The dual problem becomes Minimize • So using the kernel we do not need to compute the vector dot product in the high-dimensional feature space. The dot products are computed in the lower dimension input space instead.

Examples of Kernels • Linear kernel • Polynomial kernel • Radial basis function kernel • Sigmoid function kernel • It’s also possible to use other kernel functions to solve specific problems

Single Class SVM • Binary classification problem, i.e., a point either “belongs” to a class or does not belong to it • Direct application of theory • Two popular implementations: LibSVM and SVM-Light • Useful for applications that look for a yes/no answer (for e.g., intrusion detection)

Multi-Class SVM • A given point belongs to “some” class • Finds more separating hyperplanes to identify the different classes • Different multi-class approaches: • One-against-one • One-against-all • Half-against-half • Solved by building several SVMs and attempting to classify a point by each of them  total time = binary time x n • All-together approach builds one SVM that maximizes all hyperplanes at the same time  a much bigger optimization problem

Class 2 Class 1 Class 2 Class 2 Class 1 Class 1 Class 3 Class 3 Class 3 One-Against-All All-Together One-Against-One Multi-Class Boundaries Overlapping areas are unclassifiable regions

SVM Positive Features • Mathematically sound • Geometric intuition • Theoretical guarantees • Optimization algorithms exist • Can be applied to a variety of problems • SVM vs. neural networks or decision trees: • No problems with local minima • Fewer learning parameters to select • Stable and reproducible results

SVM Scalability Issues • Optimization requires O(n3) time and O(n2) memory for single class training, where n is input size (depends on algorithm used) • Multi-class performance depends on approach used, worse with more classes • Slow training, especially with non-linear kernels • Reduce input data size (pruning, chunking, clustering) • Reduce number of support vectors • Reduce input features dimensions

Towards a Practical NER/SVM Solution How to achieve a practical, scalable, and expandable NER/SVM solution?

Research Proposal • Address SVM scalability issues • Special focus on multi-class all-together optimization problem • Apply proposed solution to biomedical named entity recognition • Recommend a framework that promotes future research work through easy expandability and maintainability

Two Phases • Phase One: Baseline Experiments • Explore the scalability issues through a set of NER/SVM experiments using biomedical abstracts • Identify auxiliary usability problems • Phase Two: Proposed Research • Address multi-class scalability issues • Recommend dynamic architecture to improve SVM usability

Key Ideas for Experiments • Eliminate the use of prior language and domain-specific knowledge • Capitalize on SVM’s ability to handle high-dimensional input space • Generate a very high number of binary orthographic and contextual features • Character and word n-grams do not have to make linguistic sense, for e.g., a meaningful prefix or suffix, logical sequence of words. • Minimize pre and post-processing as much as possible

Baseline Experiments • Using the JNLPBA-04 challenge task data (GENIA biomedical abstracts) • Features generated using jFex (Giuliano 2005) • Single class: find PROTEIN names • Binary classification using SVM-Light (Joachims 2002) • Multi-class: find all classes (PROTEIN, DNA, RNA, CELL-TYPE, CELL-LINE) • All-together classification using Joachims’ SVM-Multiclass implementation • Precision/Recall/F-score performance results are comparable to published results

The JNLPBA-04 Datasets Training Data = 2,000 abstracts (492,551 tokens)Test Data = 404 abstracts (101,039 tokens)%Positive Examples in Training Data= 0.2% - 0.6%

Common Architecture

Baseline Experiments Design

Feature Selection • Orthographic features: • Capitalization: token begins with a capital letter. • Numeric: token is a numeric value. • Punctuation: token is a punctuation. • Uppercase: token is all in uppercase. • Lowercase: token is all in lowercase. • Single character: token length is equal to one. • Symbol: token is a special character. • Includes hyphen: one of the characters is a hyphen. • Includes slash: one of the characters is a slash. • Letters and Digits: token is alphanumeric. • Capitals and digits: token contains caps and digits. • Includes caps: some characters are in uppercase. • General regular expression summarizing the word shape. • Contextual features: • Each word is considered a feature • Collocation of tokens active over three positions around the token itself

Performance Measures

Experimental Results:Single Class (Linear Kernel)

Experimental Results:Multi-Class (Linear Kernel)

Performance Comparison

How Long Was Training Time? All tests performed on the same machine (Dual Core Xeon 3.6 GHz)Margin error = 0.1, max. memory = 2GB for all tests

NER/SVM Scalability Problems • Evidence exists that using high-dimensional orthographic and contextual features leads to good NER classification. • Input vectors are sparse & high-dimensional. • SVM requires O(n3) time and O(n2) memory for single class training, where n is input size. • Multi-class training time is much higher, especially for all-together optimization. • SVM is impractical for large input datasets, especially with non-linear kernel functions.

Other Practical Challenges • Integrated tools are not available • Lack of standardization, incompatible interfaces, need to “reinvent the wheel” to fit pieces together • How to implement new algorithms for partial problems? • How to incorporate optional components into the overall NER/SVM solution? • How to select model parameters? • How to select a kernel function that is suitable to a given problem data? • Adding new training data requires restarting the learning process

Proposed Approach • Reduce online memory requirements • Use a database repository • Reduce training time • Database-supported algorithms • Special focus on improving multi-class optimization algorithm

Proposed Architecture

Database-Supported Algorithms • Use DBMS to store input vectors, evolving model, and intermediate training results • Decompose SVM solution into modules that perform specific tasks, sharing data • Input and intermediate data resulting from previous experiments can/may be reused for others thereby reducing recomputation • As a by-product, building a growing gazetteer list is facilitated. May be used to improve performance measures

Embedded Database Modules • Eliminate communication overhead • Take advantage of DB caching and parallelization ability • Provide a base for a potential SOA for SVM using the DB for data exchange • Extend the DBMS with reusable classification modules • Present a unified interface to the user

Proposed Approach - DBMS • Open source: PostgreSQL or MySQL? • PostgreSQL is selected due to its rich features, adherence to standards, and the flexible options to extend DBMS via internal or embedded functions. • MySQL had better performance. Latest versions of PostgreSQL improved performance and enhanced scalability.

Evaluation Plan • Test using the JNLPBA-04 biomedical data • Repeat experiments with different input sizes and track training time. Compare to training time using traditional SVM. • Evaluate classification performance. • (Optional) Re-train using more data and verify that previously stored model can be augmented with new data. • May need to regenerate features for JNLPBA-04 and repeat baseline tests. Some features were excluded in previous tests due to memory shortage.

Success Criteria • Demonstrate that the database-supported approach requires less online memory and total training time than traditional SVM. • Show that training time consistently outperforms traditional SVM with increased input data size. • Precision/Recall/F-score performance measures remain comparable to those obtained using traditional SVM. • (Optional) Demonstrate the ability to train incrementally without restarting the learning process. • (Auxiliary) Recommend dynamic architecture to improve SVM usability by allowing definition of different solutions.

Improving Scalability of Support Vector Machines for Biomedical Named Entity Recognition

Improving Scalability of Support Vector Machines for Biomedical Named Entity Recognition

Presentation Transcript

Named Entity Recognition

Exploiting Domain Structure for Named Entity Recognition

Named Entity Recognition

Cross-Domain Bootstrapping for Named Entity Recognition

Biomedical Named Entity Recognition

NAMED ENTITY RECOGNITION

Support Vector Machines

Support Vector Machines

Named Entity Recognition

Improving Machine Translation Quality with Automatic Named Entity Recognition

Named Entity Recognition

Automatic Target Recognition with Support Vector Machines

Support Vector Machines

Named Entity Recognition

Support Vector Machines

Applications of Support Vector Machines to Speech Recognition

Named Entity Recognition