1 / 89

Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

Biomedical Named Entity Recognition from Text using Genetic Algorithm Based Classifier Subset Selection. Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu. Outl i ne. Motivation Background Overview of IE tasks Definition of NER Corpus Used Objective of Thesis Related Work

ayala
Download Presentation

Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varo ğlu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biomedical Named Entity Recognition from Text using Genetic Algorithm Based Classifier Subset Selection Nazife Dimililer Supervisor: Asst. Prof. Dr. Ekrem Varoğlu

  2. Outline • Motivation • Background • Overview of IE tasks • Definition of NER • Corpus Used • Objective of Thesis • Related Work • Proposed System • Corpus • Individual Classifiers • Multi Classifier System • Future Work

  3. Motivation Motivation of the Thesis • Vast amount of literature available online • Need for • Intelligent Information Retrieval • Automatically populating databases • Document Understanding/Summarization • … • NER is the first step of all IE tasks • Annotated Corpora : GENIA, BioCreative, FlyBase • Room for improvement • Applicability to other domains

  4. What is Named Entity Recognition? Background Named entity recognition (NER) A subtask of information extraction Identifies and labels strings of a text as belonging to predefined classes (Named Entities) Example NEs : persons, organizations, expressions of times, drugs, proteins, cell types NER poses a significant challenge in Biomedical Domain

  5. Background Overview of IE tasks in Biomedical Domain

  6. Irregularities and mistakes in Tokenization Tagging (Irregular ) use of special symbols Lack of standard naming conventions Changing names and notations Continuous introduction of new names Abbreviations, Synonyms, Variations Homonyms or Ambiguous names Cascaded named entities Complicated constructions Comma separated lists Disjunctions & Conjunctions Inclusion of adjectives as part of some NEs Background Sources of Problems in Biomedical NER

  7. Background State of Current Research for Biomedical NER A large number of systems have been proposed for biomedical NER. • Systems based on individual classifiers • Multiple Classifier systems with small number of members • external sources • hand crafted post processing • corpora with differing NEs • different evaluation schemes

  8. Background State of Current Research in Biomedical NER • A very important milestone in this area was the Bio-Entity Recognition Task in JNLPBA in 2004. • Same systems as in newswire domain was used with slight changes. • Rich feature sets were exploited • Successful classifiers relied on external resources and post processing • Similar systems were used in the Biocreative tasks in 2004,2006 and 2009 and in other publications.

  9. Objective Objective of the Thesis • Improve biomedical NER performance • Use a benchmark corpus • Apply classifier selection techniques to biomedical NER • Train reliable and diverse set of individual classifiers • Utilize a large set of individual classifiers • Use Genetic Algorithm to form an ensemble performing Vote based classifier subset selection

  10. Corpus Used JNLPBA data : based on Genia Corpus v. 3.02 Contains 5 Entities: Protein RNA DNA Cell Line Cell Type IOB2 tagged : 11 classes Corpus • B-Protein I-Protein • B-RNA B-RNA • B-DNA B-DNA • B-Cell Line B-Cell Line • B-Cell Type B-Cell Type • Outside

  11. Corpus Format of JNLPBA Data The O peri-kappa B-DNA B I-DNA site I-DNA mediates O human B-DNA immunodeficiency I-DNA virus I-DNA type I-DNA 2 I-DNA enhancer I-DNA activation O …. Human O immunodeficiency O virus O type O 2 O Our O data O suggest O that O lipoxygenase B-protein metabolites I-protein activate O ROI O formation O which O then O induce O IL-2 B-protein expression O via O NF-kappa B-protein B I-protein activation O . O

  12. MeSH terms "human", "blood cells" and "transcription factors Super domain of "blood cells" and "transcription factors Corpus Data Set Statistics

  13. Individual Classifiers Individual Classifier Architecture • Why use SVM? • Successfully used in many NLP tasks and bioinformatics • CoNNL 2000 and CoNNL 2004 • BioCreAtIve Competition 2004 • Ability to handle large feature sets • IOB2 notation is used to represent entities • Multi class classification problem • Features extracted from the training data only

  14. Individual Classifiers Individual Classifier System used • YamCha : is a generic, customizable, and open source text chunker that uses Support Vector Machines • Tunable parameters : • Parsing direction: Left-to-Right/Right-to-Left • Range of context window • Degree of polynomial kernel

  15. Context Window Individual Classifiers The default setting is "F:-2..2:0.. T:-2..-1".

  16. Training Individual Classifiers All individual classifiers are trained using one-vs-all approach Backward or forward parse direction Different context windows Different degrees of the polynomial kernel Different feature (combination)s Individual Classifiers

  17. Individual Classifiers All classifiers are based on SVM Features Types Lexical Features Morphological Feature Orthographic Features Surface Word Feature Tokens and the predicted tags are also used as features Individual Classifiers

  18. Tokens : words in training data the token to be classified and the preceding and following tokens as specified by the context window. Previously Predicted Tags : Predicted tags of the preceding tokens. Specified by the context window. Individual Classifiers Features Used

  19. Individual Classifiers Features Used (Cont.) Lexical Feature : represents grammatical functions of tokens. • Part Of Speech : tags from Penn Treebank Project added using the Geniatagger. Ex: Adverb, Determiner, Adjective • Phrase Tag :phrasal categories added using an SVM trained on newswire data Ex: Noun Phrase, Verb Phrase, Adjective Phrase • Base Noun Phrase Tag:basic noun phrases are tagged using fnTBL tagger.

  20. Different n-grams of the current token An n-gram of a token is simply formed by using the last or first n characters of the token. Last 1/2/3/4 letters First 1/2/3/4 letters Example: TRANSCRIPTION Individual Classifiers Features Used : Morphological

  21. Features Used : Orthographic Also known as Word Formation Patterns Information about the form of the word Example: Contains uppercase letters, digits etc. Two different approaches used: Simple : existence of a particular word formation pattern is represented by a binary feature Yes/No. Intricate : multiple word formation patterns represented using a list based on representation score Individual Classifiers

  22. Features Used : Orthographic (Cont.) Orthographic Feature - Intricate Approach : A list of word formation patterns is formed in decreasing order of representation score. The representation score of an orthographic property denoted by i for entity labeled as j (RSi,j) is calculated as: Individual Classifiers • The orthographic features that have a representation score of more than 10% for outside tagged tokens are eliminated from the list

  23. Features Used : Orthographic (Cont.) Individual Classifiers Orthographic Features used

  24. Features Used : Orthographic (Cont.) Intricate Use of the Orthographic Feature : Priority Based : Each token is tagged with the first applicable word formation pattern on the list. Example Individual Classifiers

  25. Features Used : Orthographic (Cont.) Intricate Use of the Orthographic Feature : Binary string : A binary string containing one bit to represent each word formation pattern in the list. Example: Individual Classifiers Initial letter capitalized combination of upper letter and other symbol combination of upper and lower case letters combination of upper letter and number Contains number combination of alphabetical chars and numbers combination of lower letter and other symbols combination of alphabetic chars and other symbols combination of lower letter and number contains upper letter

  26. Features Used (Cont.) Surface Words: . A separate pseudo-dictionary for each entity containing tokens with the highest count in the training data such that x% of all tokens in the entity names are in the dictionary. Pseudo dictionaries with 50%, 60%, 70%,80% coverage. Each token is tagged with a 5-bit string where each bit corresponds to the pseudo dictionary of an entity. Individual Classifiers

  27. Effect of Feature Extraction • Each feature type improves the performance in different perspectives • Precision • Recall • Boundaries • Entity based performances • Careful combination of features improves the overall performance

  28. Individual Classifiers Effect of parse direction and lexical feature • Effect of backward parsing: • Precision and recall increased for both boundaries • Precision scores improved more than recallscores • An overall increase in full recall, precision and F-score • Effect of Lexical Features: • Single lexical features: higher precision than recall • Combinations : recall and precision values are more balanced. • Combinations slightly improve performance of both the left boundary and the right boundary F-scores

  29. Individual Classifiers Effect of Morphological Features • F-score improves compared to the baseline system • Suffixes alone result in higher recall than precision • Prefixes alone result in higher precision than recall • Combination improves the overall performance • Morphological feature improves recall but degrades precision compared to the baseline

  30. Individual Classifiers Effect of Orthographic Features • Performance is improved by all orthographic features • Best performance is achieved by the binary string. • For simple orthographic features, precision scores slightly higher than recall scores • intricate orthographic features provide higher recall values resulting in overall improvement in F-scores.

  31. Individual Classifiers Effect of Surface Word Feature • Precision scores improved more than recall scores compared to the baseline classifier • Improvement on the right boundary is more pronounced. • Precision score is greater than the recall score • use pseudo-dictionary to generate classifiers with higher precision values than recall values

  32. Individual Classifiers Effect of Feature Combinations • Some specific combinations do not have a significant improvement in performance. • Careful combination of features is useful for improving overall performance • Different combinations of feature/parameter sets favor different entities

  33. Motivation for Multiple Classifier Systems Multiple Classifier System • For individual classifiers • A set of carefully engineered features improve performance • Unfortunately performance is still NOT satisfactory • Combining multiple classifiers into ensembles • The combined opinions of a number experts is more likely to be correct than that of a single expert

  34. Multiple Classifier System Classifier Pool • Classifiers exploiting state-of-the-art feature sets => highest F-scores • Classifiers with high precision or recall • Classifiers with high precision but low recall and vice versa • One or more classifiers providing the highest F-score for each entity

  35. Training Phase SVM Classifier Set Dictionary & Context Words SVM 1 Feature Set 1 SVM 2 GA based Classifier Selection Feature Set 2 . . . Feature Set M SVM M Best Fitting Ensemble Testing Phase SVM Classifier Set SVM 1 Feature Set 1 SVM 2 Test Data Classifier Fusion Feature Extraction Feature Set 2 . . . Feature Set M SVM M Multiple Classifier System Classifier Fusion Architecture Training Data Feature Extraction Post Processing Predicted Class

  36. Fusion Algorithm Weighted Majority Voting : Full object F-score of each classifier on cross-validation data used as weight Class that receives the highest vote wins the competition Weighted combination of all votes Ties broken by random coin toss Proposed System

  37. Weighted Majority Voting Weight : Full Object F-score

  38. Genetic Algorithm Set Up Initial population: randomly generated bit strings Genetic Algorithm Features Population size : 100 Mutation rate : 2% Crossover Rate : 70% Crossover Operators: Two point crossover Uniform crossover Tournament size= 40 Elitist population 20% Multiple Classifier System

  39. Proposed System Flow chart of the Genetic Algorithm Start Initialize population randomly Compute fitness of each chromosome Apply elitist policy to form new population Select best chromosome as the resultant ensemble Terminate? Yes Compute fitness of each chromosome No End Select parent and apply crossover Mutate Offspring

  40. Genetic Algorithm Set Up Chromosome: List of classifiers to be combined. 3-fold cross validation results are used for individual classifiers Fitness of chromosomes: Full object F-score of the classifier ensemble Static Classifier Selection : Each bit represents a classifier Proposed Vote Based Classifier Selection : Each bit represents reliability of a classifier for predicting a class Proposed System

  41. Classifier 2 Classifier 4 Classifier 1 Classifier 3 Multiple Classifier System Chromosome Structure of Static Classifier Selection Classifier M M classifiers chromosome has M bits If a gene=1, the corresponding classifier participates in the decision for all classes, otherwise it remains silent.

  42. M classifiers chromosome has N classes NxM bits Class 2 Class 4 Class 1 Class 3 Multiple Classifier System Chromosome Structure for the Proposed Vote-based Classifier Selection Classifier M For each classifier, one gene is reserved to represent its probability to participate in the decision of each class.

  43. Multiple Classifier System Proposed System Motivation for Vote Based Classifier Subset Selection • A classifier cannot predict all classes with the same performance • A subset of predictions may be unreliable • A subset of predictions may be correlated with predictions of other classifiers • Allow a classifier to vote only for the classes it trusts

  44. Multiple Classifier System Multiple Classifier Systems used • Single Best (SB) : not an MCS, included as a reference • Full Ensemble (FE) : Ensemble containing all classifiers • Forward Selection (FS) : Ensemble formed using forward selection • Backward Selection (BS) : Ensemble formed using Backward Selection • GA generated Static Ensemble (GAS) : Ensemble formed using GA • Vote Based Classifier Subset Selection using GA (VBS) : vote based ensemble formed using GA.

  45. Proposed System Performance of Ensembles Single Best >> Full Ensemble ≈ Forward Selection >> Backward Selection ≈ GA Static Ensemble 72.51 << Proposed Method

  46. Multiple Classifier System Discussion on ensembles • All ensembles outperform SB • VBS has the highest F-score • GA based ensembles better • BS chose 38 classifiers • FE and BS similar: precision >> recall • FS and GAS chose 9 classifiers • Precision ad recall more balanced • VBS is different: uses 46 classifiers partially • Recall > precision

  47. Backward and forward parsed classifiers are more balanced Backward and forward parsed classifiers are more balanced Multiple Classifier System Discussion on ensembles • BS eliminates mainly classifiers using only two features. • All eliminated classifiers are backward parsed • FS and GAS almost the same • 8 classifiers same • 9th classifier forward parsed for GAS • Even though the 9th classifier has lower F-score, GAS ensemble achives a higher F-score

  48. Multiple Classifier System Entity Based F-scores for the ensembles Single Best Full Ensemble Forward Selection Backward Selection GA Static Ensemble Proposed Method VBS

  49. Largest data set Smallest data set Multiple Classifier System Discussion of entity based scores • VBS achieves the best scores for all entities except for RNA • GAS ensemble outperforms VBS for RNA • VBS • Highest F-score for protein • Lowest F-score for cell line

  50. 0 11 0 0 Multiple Classifier System Distribution of Vote Counts for the VBS • None of the classifiers are eliminated from the ensemble • None of the classifiers vote for all eleven classes

More Related