1 / 46

ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL. DecisionTrees. Decision Trees. Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data.

vilmos
Download Presentation

ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ML: Classical methods from AI • Decision-Tree induction • Exemplar-based Learning • Rule Induction • TBEDL

  2. DecisionTrees Decision Trees • Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data. • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. • From a machine-learning perspective: Decision Trees are n-ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes • Acquisition: Top-Down Induction of Decision Trees (TDIDT) • Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87; Kononenko et al. 95)

  3. A1 v1 v3 v2 ... A2 A2 A3 ... ... v5 v4 Decision Tree A5 A2 ... SIZE v6 small big C3 A5 SHAPE COLOR v7 red circle triang blue C1 C2 C1 neg pos pos neg DecisionTrees An Example

  4. Training DT Training Set + TDIDT = Test DT + = Example Class DecisionTrees Learning Decision Trees

  5. functionTDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion(X)) then tree1 := create_leaf_tree(X) else amax := feature_selection(X,A); tree1 := create_tree(X, amax); for-all val invalues(amax) do X’ := select_exampes(X,amax,val); A’ := A \ {amax}; tree2 := TDIDT(X’,A’); tree1 := add_branch(tree1,tree2,val) end-for end-if return(tree1) end-function DTs General Induction Algorithm

  6. functionTDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion(X)) then tree1 := create_leaf_tree(X) else amax := feature_selection(X,A); tree1 := create_tree(X, amax); for-all val invalues(amax) do X’ := select_examples(X,amax,val); A’ := A \ {amax}; tree2 := TDIDT(X’,A’); tree1 := add_branch(tree1,tree2,val) end-for end-if return(tree1) end-function DTs General Induction Algorithm

  7. DecisionTrees Feature Selection Criteria • Functions derived from Information Theory: • Information Gain, Gain Ratio (Quinlan86) • Functions derived from Distance Measures • Gini Diversity Index (Breiman et al. 84) • RLM (López de Mántaras 91) • Statistically-based • Chi-square test (Sestito & Dillon 94) • Symmetrical Tau (Zhou & Dillon 91) • RELIEFF-IG: variant of RELIEFF (Kononenko 94)

  8. DecisionTrees Information Gain (Quinlan79)

  9. DecisionTrees Information Gain(2) (Quinlan79)

  10. DecisionTrees (Quinlan86) Gain Ratio

  11. DecisionTrees RELIEF (Kira & Rendell, 1992)

  12. DecisionTrees (Kononenko, 1994) RELIEFF

  13. DecisionTrees RELIEFF-IG (Màrquez, 1999) • RELIEFF + the distance measure used for calculating the nearest hits/misses does not treat all attributes equally ( it weights the attributes according to the IG measure).

  14. DecisionTrees Extensions of DTs (Murthy 95) • (pre/post) Pruning • Minimize the effect of the greedy approach: lookahead • Non-lineal splits • Combination of multiple models • etc.

  15. DecisionTrees Decision Trees and NLP • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) • Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) • Parsing (Magerman 95,96; Haruno et al. 98,99) • Text categorization (Lewis & Ringuette 94; Weiss et al. 99) • Text summarization (Mani & Bloedorn 98) • Dialogue act tagging (Samuel et al. 98)

  16. DecisionTrees Decision Trees and NLP • Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) • Discourse analysis in information extraction (Soderland & Lehnert 94) • Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) • Verb classification in Machine Translation (Tanaka 96; Siegel 97) • More recent applications of DTs to NLP: but combined in a boosting framework (we will see it in following sessions)

  17. POS Tagging • He was shot in the hand as he chased the robbers in the back street NN VB JJ VB NN VB (The Wall Street Journal Corpus) DecisionTrees Example: POS Tagging using DT

  18. Language Model Disambiguation Algorithm DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) Raw text Tagged text Morphological analysis … POS tagging

  19. DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) Decision Trees Raw text Disambiguation Algorithm Tagged text Morphological analysis … POS tagging

  20. Language Model Raw text RTT STT RELAX Tagged text Morphological analysis POS tagging DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) …

  21. root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others ... P(IN)=0.13 P(RB)=0.87 tag(+2) Statistical interpretation: IN ^ P( RB | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.987 P(IN)=0.013 P(RB)=0.987 ^ P( IN | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.013 leaf DecisionTrees DT-based Language Modelling “preposition-adverb” tree

  22. root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others Collocations: ... P(IN)=0.13 P(RB)=0.87 tag(+2) “as_RB much_RB as_IN” IN “as_RB soon_RB as_IN” P(IN)=0.013 P(RB)=0.987 “as_RB well_RB as_IN” leaf DecisionTrees DT-based Language Modelling “preposition-adverb” tree

  23. Minimizing the effect of over-fitting, data fragmentation & sparseness DecisionTrees Language Modelling using DTs • Granularity? Ambiguity class level • adjective-noun, adjective-noun-verb, etc. • Algorithm: Top-Down Induction of Decision Trees (TDIDT). Supervised learning • CART (Breiman et al. 84), C4.5 (Quinlan 95), etc. • Attributes: Local context (-3,+2) tokens • Particular implementation: • Branch-merging • CART post-pruning • Smoothing • Attributes with many values • Several functions for attribute selection

  24. DecisionTrees Model Evaluation The Wall Street Journal(WSJ) annotated corpus • 1,170,000 words • Tagset size: 45 tags • Noise: 2-3% of mistagged words • 49,000 word-form frequency lexicon • Manual filtering of 200 most frequent entries • 36.4% ambiguous words • 2.44 (1.52) average tags per word • 243 ambiguity classes

  25. DecisionTrees Model Evaluation The Wall Street Journal(WSJ) annotated corpus Number of ambiguity classes that cover x%of the training corpus Arity of the classification problems

  26. DecisionTrees 12 Ambiguity Classes They cover 57.90% of the ambiguous occurrences! Experimental setting: 10-fold cross validation

  27. DecisionTrees N-fold Cross Validation Divide the training set S into a partition of n equal-size disjoint subsets: s1, s2, …, sn fori:=1 toNdo learn and test a classifier using: training_set := Usj for all j different from i validation_set :=si end_for return: the average accuracy from the n experiments Which is a good value for N? (2-10-...) Extreme case (N=training set size): Leave-one-out

  28. Average size reduction: 51.7%46.5% 74.1% (total) DecisionTrees Size: Number of Nodes

  29. DecisionTrees Accuracy (at least) No loss in accuracy

  30. Statistically equivalent DecisionTrees Feature Selection Criteria

  31. (Màrquez & Rodríguez 97) (Màrquez & Rodríguez 99) (Màrquez & Padró 97) DecisionTrees DT-based POS Taggers • Tree Base = Statistical Component • RTT: ReductionisticTree based tagger • STT: Statistical Tree based tagger • Tree Base = Compatibility Constraints • RELAX: Relaxation-Labelling based tagger

  32. DecisionTrees RTT (Màrquez & Rodríguez 97) Language Model stop? Filter Classify Update Tagged text Raw text Morphological analysis yes no Disambiguation

  33. DecisionTrees STT (Màrquez & Rodríguez 99) N-grams (trigrams)

  34. Estimated using Decision Trees DecisionTrees STT (Màrquez & Rodríguez 99) Contextual probabilities

  35. Lexical probs. + Contextual probs. DecisionTrees STT (Màrquez & Rodríguez 99) LanguageModel Viterbi algorithm Tagged text Raw text Morphological analysis Disambiguation

  36. DecisionTrees STT+ (Màrquez & Rodríguez 99) LanguageModel N-grams Lexical probs. + + Contextual probs. Viterbi algorithm Tagged text Raw text Morphological analysis Disambiguation

  37. (Màrquez & Rodríguez 97) (Màrquez & Rodríguez 99) (Màrquez & Padró 97) DecisionTrees • Tree Base = Statistical Component • RTT: ReductionisticTree based tagger • STT: Statistical Tree based tagger • Tree Base = Compatibility Constraints • RELAX: Relaxation-Labelling based tagger

  38. DecisionTrees (Màrquez & Padró 97) RELAX LanguageModel Linguistic rules N-grams + + Set of constraints Relaxation Labelling (Padró 96) Tagged text Raw text Morphological analysis Disambiguation

  39. root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others Negative constraint Positive constraint ... P(IN)=0.13 P(RB)=0.87 tag(+2) 2.37 (RB) (0 “as” “As”) (1 RB) (2 IN) -5.81 (IN) (0 “as” “As”) (1 RB) (2 IN) IN P(IN)=0.013 P(RB)=0.987 leaf DecisionTrees (Màrquez & Padró 97) RELAX Translating Tress into Constraints Compatibility values: estimated using Mutual Information

  40. DecisionTrees Experimental Evaluation Using the WSJ annotated corpus • Training set: 1,121,776 words • Test set: 51,990 words • Closed vocabulary assumption • Base of 194 trees • Covering 99.5% of the ambiguous occurrences • Storage requirement: 565 Kb • Acquisition time: 12 CPU-hours (Common LISP / Sparc10 workstation)

  41. DecisionTrees Experimental Evaluation RTT results • 67.52% error reduction with respect to MFT • Accuracy =94.45% (ambiguous) 97.29% (overall) • Comparable to the best state-of-the-art automatic POS taggers • Recall = 98.22% Precision = 95.73% (1.08 tags/word) • RTT allows to state a tradeoff between precision and recall

  42. STT allows the incorporation of N-gram information some problems of sparseness and coherence of the resulting tag sequence can be alleviated STT+ results • Better than those of RTT and STT DecisionTrees Experimental Evaluation STT results • Comparable to those of RTT

  43. DecisionTrees Experimental Evaluation Including trees into RELAX • Translation of 44 representative trees covering 84% of the examples = 8,473 constraints • Addition of: • bigrams (2,808 binary constraints) • trigrams (52,161 ternary constraints) • linguistically-motivated manual constraints (20)

  44. 92.82 91.82 92.72 91.35 MFT= baseline, B=bigrams, T=trigrams, C=“tree constraints” H = set of 20 hand-written linguistic rules DecisionTrees Accuracy of RELAX

  45. DecisionTrees Decision Trees: Summary • Advantages • Acquires symbolic knowledge in a understandable way • Very well studied ML algorithms and variants • Can be easily translated into rules • Existence of available software: C4.5, C5.0, etc. • Can be easily integrated into an ensemble

  46. DecisionTrees Decision Trees: Summary • Drawbacks • Computationally expensive when scaling to large natural language domains: training examples, features, etc. • Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation • DTs is a model with high variance (unstable) • Tendency to overfit training data: pruning is necessary • Requires quite a big effort in tuning the model

More Related