Download Presentation
## Seminar: Statistical NLP

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Seminar: Statistical NLP**Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Girona, June 2003**Outline**• Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP**Outline**• Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP**There are many general-purpose definitions of Machine**Learning (or artificial learning): Making a computer automatically acquire some kind of knowledge from a concrete data domain ML4NLP Machine Learning • Learners are computers: we study learning algorithms • Resources are scarce: time, memory, data, etc. • It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc. • Biological plausibility is welcome but not the main goal**We will concentrate on:**Supervisedinductive learning for classification = discriminative learning ML4NLP Machine Learning • Learning... but what for? • To perform some particular task • To react to environmental inputs • Concept learning from data: • modelling concepts underlying data • predictingunseen observations • compacting the knowledge representation • knowledge discovery for expert systems**What to read?**Machine Learning (Mitchell, 1997) Obtaining a description of the concept in some representation language that explains observations and helps predicting new instances of the same distribution ML4NLP Machine Learning A more precise definition:**Lexical and structural ambiguity problems**Word selection (SR, MT) Part-of-speech tagging Semantic ambiguity (polysemy) Prepositional phrase attachment Reference ambiguity (anaphora) etc. Clasification problems ML4NLP Empirical NLP 90’s: Application of Machine Learning techniques (ML) to NLP problems • What to read? Foundations of Statistical Language Processing (Manning & Schütze, 1999)**Ambiguity is a crucial problem for natural language**understanding/processing. Ambiguity Resolution = Classification ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus)**Morpho-syntactic ambiguity**ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street NN VB JJ VB NN VB (The Wall Street Journal Corpus)**Morpho-syntactic ambiguity: Part of**Speech Tagging ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street NN VB JJ VB NN VB (The Wall Street Journal Corpus)**Semantic (lexical) ambiguity**ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street body-part clock-part (The Wall Street Journal Corpus)**Semantic (lexical) ambiguity: Word**Sense Disambiguation ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street body-part clock-part (The Wall Street Journal Corpus)**Structural (syntactic) ambiguity**ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus)**Structural (syntactic) ambiguity**ML4NLP NLP “classification” problems • He was shot in the hand as he chasedthe robbersin the back street (The Wall Street Journal Corpus)**Structural (syntactic) ambiguity:PP-attachment**disambiguation ML4NLP NLP “classification” problems • He was shot in the hand as he (chased (the robbers)NP(in the back street)PP) (The Wall Street Journal Corpus)**Outline**• Machine Learning for NLP • The Classification Problem • Three ML Algorithms in detail • Applications to NLP**An instance is a vector: x=<x1,…, xn>whose components,**called features (or attributes), are discrete or real-valued. Let X be the space of all possible instances. Let Y={y1,…, ym}be the set of categories (or classes). The goal is to learn an unknown target function, f : X Y A training exampleis an instance xbelonging to X, labelled with the correct value for f(x), i.e., a pair <x, f(x)> Let D be the set of all training examples. Classification Feature Vector Classification IA perspective**The goal is to find a function h belonging to H such that**for all pair <x,f(x)>belonging to D, h(x) = f(x) Classification Feature Vector Classification • The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions**Decision Tree**Rules COLOR (COLOR=red) Ù (SHAPE=circle) Þ positive blue red SHAPE negative circle triangle positive negative Classification An Example otherwise Þ negative**Decision Tree**Rules SIZE (SIZE=small)Ù(SHAPE=circle) Þ positive small big (SIZE=big)Ù(COLOR=red) Þ positive SHAPE COLOR otherwise Þ negative red circle triang blue neg pos pos neg Classification An Example**Inductive Bias**“Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99) Language / Search bias Decision Tree COLOR blue red SHAPE negative circle triangle positive negative Classification Some important concepts**Inductive Bias**Training error and generalization error Classification Some important concepts • Generalization ability and overfitting • Batch Learning vs. on-line Leaning • Symbolic vs. statistical Learning • Propositional vs. first-order learning**Relational learning = ILP (induction of logic programs)**course(X) Ù person(Y) Ù link_to(Y,X) Þinstructor_of(X,Y) research_project(X) Ù person(Z) Ù link_to(L1,X,Y) Ù link_to(L2,Y,Z)Ù neighbour_word_people(L1)Þmember_proj(X,Z) Classification Propositional vs. Relational Learning • Propositional learning color(red) Ù shape(circle) ÞclassA**Classification**The Classification SettingClass, Point, Example, Data Set, ... CoLT/SLT perspective • Input Space: XRn • (binary) Output Space: Y = {+1,-1} • A point, pattern or instance:x X, x = (x1, x2, …, xn) • Example: (x, y)with x X, y Y • Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y)S = {(x1, y1), …, (xm, ym)} (X Y)m**Classification**The Classification SettingLearning, Error, ... • The hypotheses space, H, is the set of functions h: XY that the learner can consider as possible definitions. In SVM are of the form: • The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM)**Classification**The Classification SettingLearning, Error, ... • Expected error (risk) • Problem: P itself is unknown. Known are training examples an induction principle is needed • Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal**Overfitting**Underfitting Classification The Classification SettingError, Over(under)fitting,... • Low training error low true error? • The overfitting dilemma: (Müller et al., 2001) • Trade-off between training error and complexity • Different learning biases can be used**Outline**• Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP**Outline**• Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Decision Trees • AdaBoost • Support Vector Machines • Applications to NLP**Algorithms**Learning Paradigms • Statistical learning: • HMM, Bayesian Networks, ME, CRF, etc. • Traditional methods from Artificial Intelligence (ML, AI) • Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc. • Methods from Computational Learning Theory (CoLT/SLT) • Winnow, AdaBoost, SVM’s, etc.**Algorithms**Learning Paradigms • Classifier combination: • Bagging, Boosting, Randomization, ECOC, Stacking, etc. • Semi-supervised learning: learning from labelled and unlabelled examples • Bootstrapping, EM, Transductive learning (SVM’s, AdaBoost), Co-Training, etc. • etc.**Algorithms**Decision Trees • Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data. • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. • From a machine-learning perspective: Decision Trees are n-ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes**Algorithms**Decision Trees • Acquisition: Top-Down Induction of Decision Trees (TDIDT) • Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95) etc.**A1**v1 v3 v2 ... A2 A2 A3 ... ... v5 v4 Decision Tree A5 A2 ... SIZE v6 small big C3 A5 SHAPE COLOR v7 red circle triang blue C1 C2 C1 neg pos pos neg Algorithms An Example**Training**DT Training Set + TDIDT = Test DT + = Example Class Algorithms Learning Decision Trees**functionTDIDT (X:set-of-examples; A:set-of-features)**var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion(X)) then tree1 := create_leaf_tree(X) else amax := feature_selection(X,A); tree1 := create_tree(X, amax); for-all val invalues(amax) do X’ := select_examples(X,amax,val); A’ := A - {amax}; tree2 := TDIDT(X’,A’); tree1 := add_branch(tree1,tree2,val) end-for end-if return(tree1) end-function Algorithms General Induction Algorithm**functionTDIDT (X:set-of-examples; A:set-of-features)**var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion(X)) then tree1 := create_leaf_tree(X) else amax := feature_selection(X,A); tree1 := create_tree(X, amax); for-all val invalues(amax) do X’ := select_examples(X,amax,val); A’ := A - {amax}; tree2 := TDIDT(X’,A’); tree1 := add_branch(tree1,tree2,val) end-for end-if return(tree1) end-function Algorithms General Induction Algorithm**Functions derived from Information Theory:**Information Gain, Gain Ratio (Quinlan 86) Functions derived from Distance Measures Gini Diversity Index (Breiman et al. 84) RLM (López de Mántaras 91) Statistically-based Chi-square test (Sestito & Dillon 94) Symmetrical Tau (Zhou & Dillon 91) RELIEFF-IG: variant of RELIEFF (Kononenko 94) Algorithms Feature Selection Criteria**Algorithms**Extensions of DTs (Murthy 95) • Pruning (pre/post) • Minimize the effect of the greedy approach: lookahead • Non-lineal splits • Combination of multiple models • Incremental learning (on-line) • etc.**Algorithms**Decision Trees and NLP • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) • Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) • Parsing (Magerman 95,96; Haruno et al. 98,99) • Text categorization (Lewis & Ringuette 94; Weiss et al. 99) • Text summarization (Mani & Bloedorn 98) • Dialogue act tagging (Samuel et al. 98)**Algorithms**Decision Trees and NLP • Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) • Discourse analysis in information extraction (Soderland & Lehnert 94) • Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) • Verb classification in Machine Translation (Tanaka 96; Siegel 97)**Algorithms**Decision Trees: pros&cons • Advantages • Acquires symbolic knowledge in a understandable way • Very well studied ML algorithms and variants • Can be easily translated into rules • Existence of available software: C4.5, C5.0, etc. • Can be easily integrated into an ensemble**Algorithms**Decision Trees: pros&cons • Drawbacks • Computationally expensive when scaling to large natural language domains: training examples, features, etc. • Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation • DTs is a model with high variance (unstable) • Tendency to overfit training data: pruning is necessary • Requires quite a big effort in tuning the model**Algorithms**Boosting algorithms • Idea “to combine many simple and moderately accurate hypotheses (weak classifiers) into a single and highly accurate classifier” • AdaBoost(Freund & Schapire 95) has been theoretically and empirically studied extensively • Many other variants extensions (1997-2003) http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html**Linear**combination TEST F(h1,h2,...,hT) a1 a2 aT hT h1 h2 . . . Weak Learner Weak Learner Weak Learner Probability distribution updating TS1 TST TS2 . . . D1 DT D2 Algorithms AdaBoost: general scheme TRAINING**(Freund & Schapire 97)**Algorithms AdaBoost: algorithm**Algorithms**AdaBoost: example Weak hypotheses = vertical/horizontal hyperplanes**Algorithms**AdaBoost: round 1**Algorithms**AdaBoost: round 2**Algorithms**AdaBoost: round 3