- 88 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Seminar: Statistical NLP' - tausiq

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

Presentation Transcript

Machine Learning for Natural Language Processing

Lluís Màrquez

TALP Research Center

Llenguatges i Sistemes Informàtics

Universitat Politècnica de Catalunya

Girona, June 2003

There are many general-purpose definitions of Machine Learning (or artificial learning):

Making a computer automatically acquire some kind of knowledge from a concrete data domain

ML4NLP

Machine Learning- Learners are computers: we study learning algorithms
- Resources are scarce: time, memory, data, etc.
- It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc.
- Biological plausibility is welcome but not the main goal

We will concentrate on:

Supervisedinductive learning for classification

= discriminative learning

ML4NLP

Machine Learning- Learning... but what for?
- To perform some particular task
- To react to environmental inputs
- Concept learning from data:
- modelling concepts underlying data
- predictingunseen observations
- compacting the knowledge representation
- knowledge discovery for expert systems

What to read?

Machine Learning (Mitchell, 1997)

Obtaining a description of the concept in some representation language that explains observations and helps predicting new instances of the same distribution

ML4NLP

Machine LearningA more precise definition:

Lexical and structural ambiguity problems

Word selection (SR, MT)

Part-of-speech tagging

Semantic ambiguity (polysemy)

Prepositional phrase attachment

Reference ambiguity (anaphora)

etc.

Clasification

problems

ML4NLP

Empirical NLP90’s: Application of Machine Learning techniques

(ML) to NLP problems

- What to read? Foundations of Statistical Language Processing (Manning & Schütze, 1999)

Ambiguity is a crucial problem for natural language understanding/processing. Ambiguity Resolution = Classification

ML4NLP

NLP “classification” problems- He was shot in the hand as he chased the robbers in the back street

(The Wall Street Journal Corpus)

Morpho-syntactic ambiguity

ML4NLP

NLP “classification” problems- He was shot in the hand as he chased the robbers in the back street

NN

VB

JJ

VB

NN

VB

(The Wall Street Journal Corpus)

Morpho-syntactic ambiguity: Part of Speech Tagging

ML4NLP

NLP “classification” problems- He was shot in the hand as he chased the robbers in the back street

NN

VB

JJ

VB

NN

VB

(The Wall Street Journal Corpus)

Semantic (lexical) ambiguity

ML4NLP

NLP “classification” problems- He was shot in the hand as he chased the robbers in the back street

body-part

clock-part

(The Wall Street Journal Corpus)

Semantic (lexical) ambiguity: Word Sense Disambiguation

ML4NLP

NLP “classification” problems- He was shot in the hand as he chased the robbers in the back street

body-part

clock-part

(The Wall Street Journal Corpus)

Structural (syntactic) ambiguity

ML4NLP

NLP “classification” problems- He was shot in the hand as he chased the robbers in the back street

(The Wall Street Journal Corpus)

Structural (syntactic) ambiguity

ML4NLP

NLP “classification” problems- He was shot in the hand as he chasedthe robbersin the back street

(The Wall Street Journal Corpus)

Structural (syntactic) ambiguity:PP-attachment disambiguation

ML4NLP

NLP “classification” problems- He was shot in the hand as he (chased (the robbers)NP(in the back street)PP)

(The Wall Street Journal Corpus)

Outline

- Machine Learning for NLP

- The Classification Problem
- Three ML Algorithms in detail
- Applications to NLP

An instance is a vector: x=<x1,…, xn>whose components, called features (or attributes), are discrete or real-valued.

Let X be the space of all possible instances.

Let Y={y1,…, ym}be the set of categories (or classes).

The goal is to learn an unknown target function, f : X Y

A training exampleis an instance xbelonging to X, labelled with the correct value for f(x), i.e., a pair <x, f(x)>

Let D be the set of all training examples.

Classification

Feature Vector ClassificationIA

perspective

The goal is to find a function h belonging to H such that for all pair <x,f(x)>belonging to D, h(x) = f(x)

Classification

Feature Vector Classification- The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions

Rules

COLOR

(COLOR=red) Ù

(SHAPE=circle) Þ positive

blue

red

SHAPE

negative

circle

triangle

positive

negative

Classification

An Exampleotherwise Þ negative

Rules

SIZE

(SIZE=small)Ù(SHAPE=circle) Þ positive

small

big

(SIZE=big)Ù(COLOR=red) Þ positive

SHAPE

COLOR

otherwise Þ negative

red

circle

triang

blue

neg

pos

pos

neg

Classification

An ExampleInductive Bias

“Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99)

Language / Search bias

Decision Tree

COLOR

blue

red

SHAPE

negative

circle

triangle

positive

negative

Classification

Some important conceptsInductive Bias

Training error and generalization error

Classification

Some important concepts- Generalization ability and overfitting
- Batch Learning vs. on-line Leaning
- Symbolic vs. statistical Learning
- Propositional vs. first-order learning

Relational learning = ILP (induction of logic programs)

course(X) Ù person(Y) Ù link_to(Y,X) Þinstructor_of(X,Y)

research_project(X) Ù person(Z) Ù link_to(L1,X,Y) Ù

link_to(L2,Y,Z)Ù neighbour_word_people(L1)Þmember_proj(X,Z)

Classification

Propositional vs. Relational Learning

- Propositional learning

color(red) Ù shape(circle) ÞclassA

The Classification SettingClass, Point, Example, Data Set, ...

CoLT/SLT

perspective

- Input Space: XRn
- (binary) Output Space: Y = {+1,-1}
- A point, pattern or instance:x X, x = (x1, x2, …, xn)
- Example: (x, y)with x X, y Y
- Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y)S = {(x1, y1), …, (xm, ym)} (X Y)m

The Classification SettingLearning, Error, ...

- The hypotheses space, H, is the set of functions h: XY that the learner can consider as possible definitions. In SVM are of the form:
- The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM)

The Classification SettingLearning, Error, ...

- Expected error (risk)
- Problem: P itself is unknown. Known are training examples an induction principle is needed
- Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal

Underfitting

Classification

The Classification SettingError, Over(under)fitting,...- Low training error low true error?
- The overfitting dilemma:

(Müller et al., 2001)

- Trade-off between training error and complexity
- Different learning biases can be used

Outline

- Machine Learning for NLP

- The Classification Problem
- Three ML Algorithms
- Decision Trees
- AdaBoost
- Support Vector Machines
- Applications to NLP

Learning Paradigms

- Statistical learning:
- HMM, Bayesian Networks, ME, CRF, etc.
- Traditional methods from Artificial Intelligence (ML, AI)
- Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc.
- Methods from Computational Learning Theory (CoLT/SLT)
- Winnow, AdaBoost, SVM’s, etc.

Learning Paradigms

- Classifier combination:
- Bagging, Boosting, Randomization, ECOC, Stacking, etc.
- Semi-supervised learning: learning from labelled and unlabelled examples
- Bootstrapping, EM, Transductive learning (SVM’s, AdaBoost), Co-Training, etc.
- etc.

Decision Trees

- Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data.
- They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization.
- From a machine-learning perspective: Decision Trees are n-ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes

Decision Trees

- Acquisition: Top-Down Induction of Decision Trees (TDIDT)
- Systems:

CART (Breiman et al. 84),

ID3, C4.5, C5.0 (Quinlan 86,93,98),

ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95)

etc.

v1

v3

v2

...

A2

A2

A3

...

...

v5

v4

Decision Tree

A5

A2

...

SIZE

v6

small

big

C3

A5

SHAPE

COLOR

v7

red

circle

triang

blue

C1

C2

C1

neg

pos

pos

neg

Algorithms

An ExamplefunctionTDIDT (X:set-of-examples; A:set-of-features)

var: tree1,tree2: decision-tree;

X’: set-of-examples;

A’: set-of-features

end-var

if (stopping_criterion(X)) then

tree1 := create_leaf_tree(X)

else

amax := feature_selection(X,A);

tree1 := create_tree(X, amax);

for-all val invalues(amax) do

X’ := select_examples(X,amax,val);

A’ := A - {amax};

tree2 := TDIDT(X’,A’);

tree1 := add_branch(tree1,tree2,val)

end-for

end-if

return(tree1)

end-function

Algorithms

General Induction AlgorithmfunctionTDIDT (X:set-of-examples; A:set-of-features)

var: tree1,tree2: decision-tree;

X’: set-of-examples;

A’: set-of-features

end-var

if (stopping_criterion(X)) then

tree1 := create_leaf_tree(X)

else

amax := feature_selection(X,A);

tree1 := create_tree(X, amax);

for-all val invalues(amax) do

X’ := select_examples(X,amax,val);

A’ := A - {amax};

tree2 := TDIDT(X’,A’);

tree1 := add_branch(tree1,tree2,val)

end-for

end-if

return(tree1)

end-function

Algorithms

General Induction Algorithm Functions derived from Information Theory:

Information Gain, Gain Ratio (Quinlan 86)

Functions derived from Distance Measures

Gini Diversity Index (Breiman et al. 84)

RLM (López de Mántaras 91)

Statistically-based

Chi-square test (Sestito & Dillon 94)

Symmetrical Tau (Zhou & Dillon 91)

RELIEFF-IG: variant of RELIEFF (Kononenko 94)

Algorithms

Feature Selection CriteriaExtensions of DTs

(Murthy 95)

- Pruning (pre/post)
- Minimize the effect of the greedy approach: lookahead
- Non-lineal splits
- Combination of multiple models
- Incremental learning (on-line)
- etc.

Decision Trees and NLP

- Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)
- POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00)
- Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96)
- Parsing (Magerman 95,96; Haruno et al. 98,99)
- Text categorization (Lewis & Ringuette 94; Weiss et al. 99)
- Text summarization (Mani & Bloedorn 98)
- Dialogue act tagging (Samuel et al. 98)

Decision Trees and NLP

- Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95)
- Discourse analysis in information extraction (Soderland & Lehnert 94)
- Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94)
- Verb classification in Machine Translation (Tanaka 96; Siegel 97)

Decision Trees: pros&cons

- Advantages
- Acquires symbolic knowledge in a understandable way
- Very well studied ML algorithms and variants
- Can be easily translated into rules
- Existence of available software: C4.5, C5.0, etc.
- Can be easily integrated into an ensemble

Decision Trees: pros&cons

- Drawbacks
- Computationally expensive when scaling to large natural language domains: training examples, features, etc.
- Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation
- DTs is a model with high variance (unstable)
- Tendency to overfit training data: pruning is necessary
- Requires quite a big effort in tuning the model

Boosting algorithms

- Idea

“to combine many simple and moderately accurate hypotheses (weak classifiers) into a single and highly accurate classifier”

- AdaBoost(Freund & Schapire 95) has been theoretically and empirically studied extensively
- Many other variants extensions (1997-2003)

http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html

combination

TEST

F(h1,h2,...,hT)

a1

a2

aT

hT

h1

h2

. . .

Weak

Learner

Weak

Learner

Weak

Learner

Probability

distribution

updating

TS1

TST

TS2

. . .

D1

DT

D2

Algorithms

AdaBoost: general schemeTRAINING

AdaBoost and NLP

- POS Tagging(Abney et al. 99; Màrquez 99)
- Text and Speech Categorization(Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)
- PP-attachment Disambiguation(Abney et al. 99)
- Parsing(Haruno et al. 99)
- Word Sense Disambiguation(Escudero et al. 00, 01)
- Shallow parsing(Carreras & Màrquez, 01a; 02)
- Email spam filtering(Carreras & Màrquez, 01b)
- Term Extraction(Vivaldi, et al. 01)

AdaBoost: pros&cons

- Easy to implement and few parameters to set
- Time and space grow linearly with number of examples. Ability to manage very large learning problems
- Does not constrain explicitly the complexity of the learner
- Naturally combines feature selection with learning
- Has been succesfully applied to many practical problems

AdaBoost: pros&cons

- Seems to be rather robust to overfitting (number of rounds) but sensitive to noise
- Performance is very good when there are relatively few relevant terms (features)
- Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly

SVM: A General Definition

- “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000)

SVM: A General Definition

- “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linearfunctions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000)

Key Concepts

+

+

+

+

_

w

_

_

_

+

_

_

_

_

_

Algorithms

Linear Classifiers- Hyperplanesin RN.
- Defined by a weight vector (w) and a threshold (b).
- They induce a classification rule:

Set of hypotheses

Dual formulation

Kernel function

Evaluation

Seminari SVMs 22/05/2001

Algorithms

Non-linear SVMs- Implicit mapping into feature space via kernel functions

Seminari SVMs 22/05/2001

Algorithms

Non-linear SVMs- Kernel functions
- Must be efficiently computable
- Characterization via Mercer’s theorem
- One of the curious facts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space! (Cristianini & Shawe-Taylor, 2000)
- Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc.

Seminari SVMs 22/05/2001

Algorithms

Non linear SVMsDegree 3 polynomial kernel

lin. non-separable

lin. separable

Toy Examples

- All examples have been run with the 2D graphic interface of SVMLIB (Chang and Lin, National University of Taiwan)

“LIBSVMis an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool. LIBSVM provides a simple interface where users can easily link it with their own programs…”

- Available from: www.csie.ntu.edu.tw/~cjlin/libsvm(it icludes a Web integrated demo tool)

What happens if we add

a blue training example

here?

Algorithms

Toy Examples (I)Linearly separable data set

Linear SVM

Maximal margin Hyperplane

Toy Examples (I)

(still) Linearly separable data set

Linear SVM

High value of C parameter

Maximal margin Hyperplane

The example is

correctly classified

Toy Examples (I)

(still) Linearly separable data set

Linear SVM

Low value of C parameter

Trade-off between: margin and training error

The example is

now a bounded SV

SVM: Summary

- SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik, 1992). Great developement since then
- Kernel-induced feature spaces: SVMs work efficiently in very high dimensional feature spaces (+)
- Learning bias: maximal margin optimisation. Reduces the danger of overfitting. Generalization bounds for SVMs (+)
- Compact representation of the induced hypothesis. The solution is sparse in terms of SVs (+)

SVM: Summary

- Due to Mercer’s conditions on the kernels the optimi-sation problems are convex. No local minima (+)
- Optimisation theory guides the implementation. Efficient learning (+)
- Mainly for classification but also for regression, density estimation, clustering, etc.
- Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. (+)
- Parameter tuning (–). Implications in convergence times, sparsity of the solution, etc.

NLP problems

- Warning! We will not focus on final NLP applications, but on intermediate tasks...
- We will classify the NLP tasks according to their (structural) complexity

NLP problems: structural complexity

- Decisional problems
- Text Categorization, Document filtering, Word Sense Disambiguation, etc.
- Sequence tagging and detection of sequential structures
- POS tagging, Named Entity extraction, syntactic chunking, etc.
- Hierarchical structures
- Clause detection, full parsing, IE of complex concepts, composite Named Entities, etc.

Morpho-syntactic ambiguity: Part of Speech Tagging

Applications

POS tagging- He was shot in the hand as he chased the robbers in the back street

NN

VB

JJ

VB

NN

VB

(The Wall Street Journal Corpus)

P(IN)=0.81

P(RB)=0.19

Word Form

“As”,“as”

others

...

P(IN)=0.83

P(RB)=0.17

tag(+1)

RB

others

...

P(IN)=0.13

P(RB)=0.87

tag(+2)

Probabilistic interpretation:

IN

^

P( RB | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.987

P(IN)=0.013

P(RB)=0.987

^

P( IN | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.013

leaf

Applications

POS tagging“preposition-adverb” tree

P(IN)=0.81

P(RB)=0.19

Word Form

“As”,“as”

others

...

P(IN)=0.83

P(RB)=0.17

tag(+1)

RB

others

...

P(IN)=0.13

P(RB)=0.87

tag(+2)

IN

P(IN)=0.013

P(RB)=0.987

leaf

Applications

POS tagging“preposition-adverb” tree

Collocations:

“as_RB much_RB as_IN”

“as_RB soon_RB as_IN”

“as_RB well_RB as_IN”

A Sequential Model for Multi-class Classification:

NLP/POS Tagging (Even-Zohar & Roth, 01)

Applications

POS taggingRTT (Màrquez & Rodríguez 97)

Language

Model

stop?

Filter

Classify

Update

Tagged

text

Raw

text

Morphological

analysis

yes

no

Disambiguation

Lexical

probs.

+

The Use of Classifiers in sequential inference:

Chunking (Punyakanok & Roth, 00)

Contextual probs.

Viterbi

algorithm

Tagged

text

Raw

text

Morphological

analysis

Disambiguation

Applications

POS taggingSTT (Màrquez & Rodríguez 97)

Detection of sequential and hierarchical structures

- Named Entity recognition
- Clause detection

Summary/conclusions

- We have briefly outlined:
- The ML setting: “supervised learning for classification”
- Three concrete machine learning algorithms
- How to apply them to solve itermediate NLP tasks

Any ML algorithm for NLP should be:

Robust to noise and outliers

Efficient in large feature/example spaces

Adaptive to new/changing domains: portability, tuning, etc.

Able to take advantage of unlabelled examples: semi-supervised learning

Conclusions

Summary/conclusions

Summary/conclusions

- Statistical and ML-based Natural Language Processing is a very active and multidisciplinary area of research

Some current research lines

- Appropriate learning paradigm for all kind of NLP problems: TiMBL(DBZ99), TBEDL(Brill95),ME(Ratnaparkhi98),SNoW(Roth98), CRF (Pereira & Singer02).
- Definition of an adequate (and task-specific) feature space: mapping from the input space to a high dimensional feature space, kernels, etc.
- Resolution of complex NLP problems: inference with classifiers + constraint satisfaction
- etc.

Bibliografia

- You may found additional information at:

http://www.lsi.upc.es/~lluism/

tesi.html

publicacions/pubs.html

cursos/talks.html

cursos/MLandNL.html

cursos/emnlp1.html

- This talk at:

http://www.lsi.upc.es/~lluism/udg03.ppt.gz

Machine Learning for Natural Language Processing

Lluís Màrquez

TALP Research Center

Llenguatges i Sistemes Informàtics

Universitat Politècnica de Catalunya

Girona, June 2003

Download Presentation

Connecting to Server..