Towards Improving Classification of Real World Biomedical Articles

Towards Improving Classification of Real World Biomedical Articles

Summary • We propose a method to improve performance in biomedical article classification. • We use Naïve Bayes and Maximum Entropy classifiers to classify real world biomedical articles derived from the dataset used in the classification competition task BC2.5 • To improve classification performance, we use two merging operators, Max and Harmonic Mean to combine results of the two classifiers. • The results show that we can improve classification performance of real world biomedical data

Introduction From the biomedical point of view there are many challenges in classifying biomedical information [3]. Even the most sophisticated of solutions often overfit to the training data and do not perform as well on real-world data[4]. In this paper we try to devise a method which makes real world biomedical data classification more robust. First we parse documents applying a keyword extraction algorithm to find out the keywords from the full text. Second, we apply chi-square feature selection strategy to identify the most relevant. Finally, we apply Naïve Bayes and Maximum Entropy classifiers to classify documents and then combine them using two merging operators to improve performance. 3

THE CLASSIFICATION METHOD • Naïve Bayes Classifiers: • A text classifier could be defined as a function that maps a document d of x1, x2,x3,..,xn words (features), d=( x1, x2,x3,..,xn ), to a confidence that the document d belongs to a text category. • the Naïve Bayes classifier [1] is often used to estimate the probability of each category. • The Bayes theorem can be used to estimate the probabilities:Pr(c|d)=Pr(d|c)*Pr(c)/Pr(d) [6]

THE CLASSIFICATION METHOD • Maximum Entropy Classifiers: • Entropy was used by Shannon (Shannon, 1948), in the communication theory. The entropy H itself measures the average uncertainty of a single random variable X : H(p)=H(x)=-Σp(x)log2p(x) [2] • The maximum entropy model can be specially adjusted for text classification. • This can be done using the iterative scaling (IIS) algorithm and a hillclimbing algorithm for estimating the parameters of the maximum entropy model [6]

Merging Classifiers We use two operators to combine the results of the Naïve Bayes Classifier (NBC) and the Maximum Entropy Classifier (MEC) to improve the classification performance. The Maximum and the Harmonic Mean of the results of the two classifiers MaxC(d) = Max {NBC(d), MEC (d)}HarmC (d) = 2.0 × NBC(d) ×MEC (d) / (NBC(d) + MEC (d)) The MaxC(d) operator chooses a maximum value among the results of the two classifiers. The HarmC (d) operator estimates the Harmonic Mean of the results of these two classifiers.

BioCreAtIvE challenge Description [2004-01-02] The BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge evaluation consists of a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. http://www.biocreative.org/about/background/description/

BioCreative II.5 challenge Evaluation library [2009-12-17] This is the current version of the BioCreative evaluation library including a command line tool to use it; current, official version: 3.2 (use command line option --version to see the version of the script you have installed: bc-evaluate --version. If you have reason to believe that there is a bug with the tool or the library, or any other questions related to it, please contact the author, Florian Leitner. http://www.biocreative.org/resources/biocreative-ii5/evaluation-library/

BioCreative II.5 challenge Task 2: Protein-Protein Interactions [2006-04-01] This task is organized as a collaboration between the IntAct and MINT protein interaction databases and the CNIO Structural Bioinformatics and Biocomputing group. Background Introduction Task description Data Resources http://www.biocreative.org/tasks/biocreative-ii/task-2-protein-protein-interac/

Preparing the Data. • For experimentation purposes we used the data used in the article classification competition task BC2.5 [4]. • This classification task was based on a training data set comprised of 61 full-text articles relevant to protein-protein interaction and 558 irrelevant one. • For training we chose the first 60 relevant and sampled randomly 60 irrelevant articles, for testing we used the Biocreative 2.5 testing data set consisting of 63 full-text articles relevant to protein- protein interaction and 532 irrelevant ones.

Preparing the Data. • Before using the data for training and testing we pre-processed all articles by filtering out stop words and porter stemming the remaining words/keywords. • Finally, we ranked keywords extracted from BC2.5 training articles according to chi-square scoring formula to identify most top relevant keywords [6].

Experiments • The experiments consist of the following phases: • First, we collect five sets of top relevant keywords using chi-square feature selection strategy. Second, we compare the performance of the two classifiers, Naïve Bayes and Maximum Entropy, for each set of word features. Third, we use merging operators to combine the results of these two classifiers to improve performance. • In each experiment we calculate Precision, Recall, True Negative Rate and Accuracy measures.

Results • The Maximum Entropy classifier shows the best performance Precision, Recall and Accuracy, 0,186%, 0.857% and 0.589% at 500 top ranked keywords, while for True Negative Rate shows the best performance 0.565% at 700 top ranked keywords. • We combine the results of the two classifiers using the two merging operators mentioned above to improve the performance, especially the Recall rate. • The merging operators do improve performance, for Precision 0.189%, Recall 0.873%, True Negative Rate 0.560% and Accuracy 0.591%.

Conclusion • The results show that the Maximum Entropy classifier shows the better performance at 500 top relevant keywords. • Combining the results of the two classifiers we can improve classification performance of real world biomedical data

References 1. Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., and Tzeras, K. 1991. Air/X – A rule-based multi-stage indexing system for lage subject fields. RIAO’91, pp. 606-623. 2. Galathiya, A. S., Ganatra, A. P., and Bhensdadia, K. C. 2012. An Improved decision tree induction algorithm, with feature selection, cross validation, model complexity & reduced error pruning, IJSCIT, March 2012. 3. Feldman, R., Sanger, J. 2006. The Text Mining Handbook: advanced approaches in analyzing unstructured data. Cambridge University Press. 4. Krallinger, M., et al. 2009 The BioCreative II. 5 challenge overview. In: Proc. The BioCreative II. 5 Workshop 2009 on Digital Annotations, pp. 7–9.

References 5. Fragos, K., Maistros, I. 2006. A Goodness of Fit Test Approach in Information Retrieval. In journal of “Information Retrieval”, Springer, Volume 9, Number 3, pp 331 – 342. 6. McCallum A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization. 7. Fragos, K., Maistros, I., Skourlas, C. 2005. A X2-Weighted Maximum Entropy Model for Text Classification. 2nd International Conference On N.L.U.C.S, Miami, Florida.

Questions…

Towards Improving Classification of Real World Biomedical Articles

Towards Improving Classification of Real World Biomedical Articles

Presentation Transcript

Improving the ethnic classification of patient registers

Towards automatic coin classification

Towards Standard International Energy Classification

Writing and Improving Your News Articles

Improving the World

Biomedical articles per year

Biomedical articles per year

Classification of the Real Number System

Real-Time Recommendation of Diverse Related Articles

The World of Plant Classification

Improving the World

Semi-Automatic Indexing of Full Text Biomedical Articles

Towards a Packet Classification Benchmark

Improving the World

Towards an Automated Analysis of Biomedical Abstracts

Group Knowledge: Towards a Real-World Approach

Improving Gender Classification of Blog Authors

Towards prediction of algorithm performance in real world problems

Real Estate Investing Articles

Topic: Classification of World Languages

Improving the World

Towards an Automated Analysis of Biomedical Abstracts