1 / 22

A Genetic Algorithm for Text Classification Rule Induction

ECML PKDD 2008 15-19 September 2008, Antwerp, Belgium. A Genetic Algorithm for Text Classification Rule Induction. A.Pietramala 1 , V.Policicchio 1 , P.Rullo 1,2 , I.Sidhu 3 Universit à della Calabria (Rende, Italy) {a.pietramala,policicchio,rullo}@mat.unical.it Exeura Srl (Rende, Italy)

tkeen
Download Presentation

A Genetic Algorithm for Text Classification Rule Induction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECML PKDD 200815-19 September 2008, Antwerp, Belgium A Genetic Algorithm for Text Classification Rule Induction A.Pietramala1, V.Policicchio1, P.Rullo1,2, I.Sidhu3 Università della Calabria (Rende, Italy){a.pietramala,policicchio,rullo}@mat.unical.it Exeura Srl (Rende, Italy) Kenetica Ltd (Chicago, IL-USA) {isidhu}@computer.org

  2. Outline • Motivations • The Olex Hypothesis Language • The Genetic Algorithm Approach (Olex-GA) • Experimental Results and Comparative Evaluation • Discussions • Conclusions and Future Work

  3. Motivations • Rule learning algorithms have become a successful strategy for classifier induction. • Rule-based classifiers provide the desirable property of being readable and, thus, easy to understand (and, possibly, modify). • Genetic Algorithms (GAs) are stochastic search methods inspired to the biological evolution. • GAs show the capability to provide good solutions for classical optimization tasks (e.g. TSP and Knapsack)

  4. Rule Induction and GAs • Rule induction is one of the application fields of GAs. The basic idea is that: • Each individual in the population represents a candidate solution (a classification rule or a classifier) • The fitness of an individual is evaluated in terms of the predictive accuracy. • We propose presents a GA approach, called Olex-GA, for the induction of rule-based text classifiers.

  5. Hc (Pos,Neg) Pos Neg Olex-GA - The hypothesis language • A classifier Hc (Pos,Neg) is of the form: “if any of the terms t1,…,tn occurs in d and none of the terms tn+1,…,tn+m occurs in d, then classify d under category c”

  6. Olex-GA The hypothesis language • The terms in Pos and Neg are chosen among the ones belonging to the local vocabulary: • Intuitively, Vc(k, f ) is the set of the best k terms for category c according to a given scoring function f.

  7. Olex-GAProblem statement • The Olex-GA’s learning problem is stated as an optimization problem: PROBLEM MAX-F Let a category c  Cand a vocabulary V (k, f) over the training set TSbe given. Then, find two subsets of V (k, f),Pos = {t1,…,tn }and Neg = {tn+1,…,tn+m } with Pos ≠ Ø, such that Hc (Pos, Neg) applied to TS yields a maximum value of Fc,(over TS), for a given [0,1]. • Problem MAX-F is NP-Hard.

  8. Olex-GA A Genetic Algorithm to Solve MAX-F • Problem MAX-F is a combinatorial optimization problem aimed at finding a best combination of terms taken from a given vocabulary. • MAX-F is a typical problem for which GAs are known to be a good candidate resolution method.

  9. GA-OlexOur implementation of GA • In the following, we describe our choices concerning: • Population Encoding • Fitness Function • Evolutionary Operators

  10. Given a vocabulary EXAMPLE Hc K 1 0 1 0 0 0 1 0 1 0 t1 t2 t3 t4 t5 t1 t2 t3 t4 t5 Olex-GAPopulation Encoding • Each individual represents an entire classifier. • An individual is simply a binary representation of the sets Pos and Neg of a classifier Hc (Pos, Neg).

  11. Olex-GA Population Encoding • We restrict the search of both positive and negative terms, respectively, to: • Pos*, the set of terms belonging to Vc(k, f ) (candidate positive terms); • Neg*, the set of terms which occur in any document containing some candidate positive term and not belonging to the training set TSc of c (candidate negative terms). • The reduction of search space allows: • an improvement of the algorithm efficiency • a quick convergence toward good solutions

  12. Olex-GAFitness Function • The fitness of a chromosome K, representing Hc(Pos,Neg) is the value of the F-measure resulting from applying Hc(Pos,Neg)to the training set TS. • This choice naturally follows from the formulation of problem MAX-F.

  13. Olex-GAEvolutionary Operators • We perform: • selection via the roulette-wheel method, • crossover by the uniform crossoverscheme. • mutation, which consists in the flipping of each single bit with a given (low) probability. • elitism, in order to ensure that the best individuals of the current generation are passed to the next one without being altered

  14. Olex-GAExperimentation We have experimentally evaluated our algorithm on two standard benchmark corpora: • REUTERS-21578 (R10) • It consists of 12,902 documents • They are manually classified with respect to 135 categories. We have considered the subset of the 10 most populated categories. • OHSUMED • We used the collection consisting of the first 20,000 documents from the 50,216 medical abstracts of the year 1991. • The classification scheme consisted of the 23 MeSH disease categories.

  15. Experimental settings • We applied the stratifiedholdout method: REUTERS: • ModApté split : 9603 documents are used to form the training corpus (seen data) and 3299 to form the test set (unseen data). OHSUMED: • The first 10,000 were used as seen dataand the second 10,000 as unseen data. In both cases, we have randomly split the set of seen data into a • training set(70%), on which to run the GA • and a validation set(30%), on which tuning the model parameters.

  16. Experimental settings • GA Parameters: • For each chromosome Kin the population, we initialized K+ at random, while we set K¡-[t] = 0, for each t  Neg*(thus, Kinitially encodes a classifier Hc(Pos,Neg) with no negative terms).

  17. Comparative Evaluation • On both corpora, we carried out a direct comparison with the following systems: • SVM (both polynomial and radial basis function) • Ripper (with two optimization steps) • C4.5 • Naive Bayes • Olex-Greedy • The performances were evaluated using the Weka library of ML algorithms (apart from Olex-Greedy).

  18. Performance Comparison on Reuters • Efficacy • SVMpoli > SVMrbf > Ripper ≈ Olex-GA> C45 > Olex-Greedy > NB • Efficiency • NB > Olex-Greedy > SVMpoli > Olex-GA > C45 > SVMrbf > Ripper

  19. Performance Comparison on OHSUMED • Efficacy • Olex-GA> Ripper > SVMpoli > Olex-Greedy > SVMrbf ≈ NB > C45 • Efficiency • NB > Olex-Greedy > SVMpoli > Olex-GA > C45 > SVMrbf > Ripper

  20. Discussions – Relationto other inductive rule learners • Conventional Rule Learners (Ripper, C4.5): • Usually rely on a two-stage process: rule induction and rule pruning. • Each of the above step in turn consists of several steps • Olex-GA relies on a a single-step process which does not need any post-induction optimization. • With respect to Olex-Greedy, Olex-GA provides better predictive accuracy, but is less efficient.

  21. Conclusions • Olex-GA encodes a classifier, in a very natural and compact way, as an individual • Fitness of an individual is evaluated as the F-measure of the encoded classifiers • Experimental results point out: • Olex-GA quickly converges to very accurate classifiers; • Olex-GA performs at a competitive level with standard algorithms; • Time efficiency is lower than Olex-Greedy but higher than the other rule learning methods, such as Ripper and C45.

  22. Future work • Extension of the proposed technique to deal with classifiers of the form where each Ti is a conjunction of “simple” terms:

More Related