Seminar: Statistical NLP
Download
1 / 89

Seminar: Statistical NLP - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Seminar: Statistical NLP. Machine Learning for Natural Language Processing. Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya. Girona, June 2003. Outline. Machine Learning for NLP. The Classification Problem Three ML Algorithms

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Seminar: Statistical NLP' - tausiq


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Seminar statistical nlp

Seminar: Statistical NLP

Machine Learning for Natural Language Processing

Lluís Màrquez

TALP Research Center

Llenguatges i Sistemes Informàtics

Universitat Politècnica de Catalunya

Girona, June 2003


Outline
Outline

  • Machine Learning for NLP

  • The Classification Problem

  • Three ML Algorithms

  • Applications to NLP


Outline1
Outline

  • Machine Learning for NLP

  • The Classification Problem

  • Three ML Algorithms

  • Applications to NLP


Machine learning

There are many general-purpose definitions of Machine Learning (or artificial learning):

Making a computer automatically acquire some kind of knowledge from a concrete data domain

ML4NLP

Machine Learning

  • Learners are computers: we study learning algorithms

  • Resources are scarce: time, memory, data, etc.

  • It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc.

  • Biological plausibility is welcome but not the main goal


Machine learning1

We will concentrate on: Learning (or artificial learning):

Supervisedinductive learning for classification

= discriminative learning

ML4NLP

Machine Learning

  • Learning... but what for?

    • To perform some particular task

    • To react to environmental inputs

    • Concept learning from data:

      • modelling concepts underlying data

      • predictingunseen observations

      • compacting the knowledge representation

      • knowledge discovery for expert systems


Machine learning2

What to read? Learning (or artificial learning):

Machine Learning (Mitchell, 1997)

Obtaining a description of the concept in some representation language that explains observations and helps predicting new instances of the same distribution

ML4NLP

Machine Learning

A more precise definition:


Empirical nlp

Lexical and structural Learning (or artificial learning):ambiguity problems

Word selection (SR, MT)

Part-of-speech tagging

Semantic ambiguity (polysemy)

Prepositional phrase attachment

Reference ambiguity (anaphora)

etc.

Clasification

problems

ML4NLP

Empirical NLP

90’s: Application of Machine Learning techniques

(ML) to NLP problems

  • What to read? Foundations of Statistical Language Processing (Manning & Schütze, 1999)


Nlp classification problems

Ambiguity Learning (or artificial learning):is a crucial problem for natural language understanding/processing. Ambiguity Resolution = Classification

ML4NLP

NLP “classification” problems

  • He was shot in the hand as he chased the robbers in the back street

(The Wall Street Journal Corpus)


Nlp classification problems1

Morpho-syntactic ambiguity Learning (or artificial learning):

ML4NLP

NLP “classification” problems

  • He was shot in the hand as he chased the robbers in the back street

NN

VB

JJ

VB

NN

VB

(The Wall Street Journal Corpus)


Nlp classification problems2

Morpho-syntactic ambiguity Learning (or artificial learning):: Part of Speech Tagging

ML4NLP

NLP “classification” problems

  • He was shot in the hand as he chased the robbers in the back street

NN

VB

JJ

VB

NN

VB

(The Wall Street Journal Corpus)


Nlp classification problems3

Semantic (lexical) ambiguity Learning (or artificial learning):

ML4NLP

NLP “classification” problems

  • He was shot in the hand as he chased the robbers in the back street

body-part

clock-part

(The Wall Street Journal Corpus)


Nlp classification problems4

Semantic (lexical) ambiguity Learning (or artificial learning):: Word Sense Disambiguation

ML4NLP

NLP “classification” problems

  • He was shot in the hand as he chased the robbers in the back street

body-part

clock-part

(The Wall Street Journal Corpus)


Nlp classification problems5

Structural (syntactic) ambiguity Learning (or artificial learning):

ML4NLP

NLP “classification” problems

  • He was shot in the hand as he chased the robbers in the back street

(The Wall Street Journal Corpus)


Nlp classification problems6

Structural (syntactic) ambiguity Learning (or artificial learning):

ML4NLP

NLP “classification” problems

  • He was shot in the hand as he chasedthe robbersin the back street

(The Wall Street Journal Corpus)


Nlp classification problems7

Structural (syntactic) ambiguity: Learning (or artificial learning):PP-attachment disambiguation

ML4NLP

NLP “classification” problems

  • He was shot in the hand as he (chased (the robbers)NP(in the back street)PP)

(The Wall Street Journal Corpus)


Outline2
Outline Learning (or artificial learning):

  • Machine Learning for NLP

  • The Classification Problem

  • Three ML Algorithms in detail

  • Applications to NLP


Feature vector classification

An Learning (or artificial learning):instance is a vector: x=<x1,…, xn>whose components, called features (or attributes), are discrete or real-valued.

Let X be the space of all possible instances.

Let Y={y1,…, ym}be the set of categories (or classes).

The goal is to learn an unknown target function, f : X Y

A training exampleis an instance xbelonging to X, labelled with the correct value for f(x), i.e., a pair <x, f(x)>

Let D be the set of all training examples.

Classification

Feature Vector Classification

IA

perspective


Feature vector classification1

The Learning (or artificial learning):goal is to find a function h belonging to H such that for all pair <x,f(x)>belonging to D, h(x) = f(x)

Classification

Feature Vector Classification

  • The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions


An example

Decision Tree Learning (or artificial learning):

Rules

COLOR

(COLOR=red) Ù

(SHAPE=circle) Þ positive

blue

red

SHAPE

negative

circle

triangle

positive

negative

Classification

An Example

otherwise Þ negative


An example1

Decision Tree Learning (or artificial learning):

Rules

SIZE

(SIZE=small)Ù(SHAPE=circle) Þ positive

small

big

(SIZE=big)Ù(COLOR=red) Þ positive

SHAPE

COLOR

otherwise Þ negative

red

circle

triang

blue

neg

pos

pos

neg

Classification

An Example


Some important concepts

Inductive Learning (or artificial learning):Bias

“Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99)

Language / Search bias

Decision Tree

COLOR

blue

red

SHAPE

negative

circle

triangle

positive

negative

Classification

Some important concepts


Some important concepts1

Inductive Learning (or artificial learning):Bias

Training error and generalization error

Classification

Some important concepts

  • Generalization ability and overfitting

  • Batch Learning vs. on-line Leaning

  • Symbolic vs. statistical Learning

  • Propositional vs. first-order learning


Seminar statistical nlp

course(X) Ù person(Y) Ù link_to(Y,X) Þinstructor_of(X,Y)

research_project(X) Ù person(Z) Ù link_to(L1,X,Y) Ù

link_to(L2,Y,Z)Ù neighbour_word_people(L1)Þmember_proj(X,Z)

Classification

Propositional vs. Relational Learning

  • Propositional learning

color(red) Ù shape(circle) ÞclassA


The classification setting class point example data set

Classification Learning (or artificial learning):

The Classification SettingClass, Point, Example, Data Set, ...

CoLT/SLT

perspective

  • Input Space: XRn

  • (binary) Output Space: Y = {+1,-1}

  • A point, pattern or instance:x  X, x = (x1, x2, …, xn)

  • Example: (x, y)with x  X, y Y

  • Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y)S = {(x1, y1), …, (xm, ym)}  (X  Y)m


The classification setting learning error

Classification Learning (or artificial learning):

The Classification SettingLearning, Error, ...

  • The hypotheses space, H, is the set of functions h: XY that the learner can consider as possible definitions. In SVM are of the form:

  • The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM)


The classification setting learning error1

Classification Learning (or artificial learning):

The Classification SettingLearning, Error, ...

  • Expected error (risk)

  • Problem: P itself is unknown. Known are training examples an induction principle is needed

  • Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal


The classification setting error over under fitting

Over Learning (or artificial learning):fitting

Underfitting

Classification

The Classification SettingError, Over(under)fitting,...

  • Low training error  low true error?

  • The overfitting dilemma:

(Müller et al., 2001)

  • Trade-off between training error and complexity

  • Different learning biases can be used


Outline3
Outline Learning (or artificial learning):

  • Machine Learning for NLP

  • The Classification Problem

  • Three ML Algorithms

  • Applications to NLP


Outline4
Outline Learning (or artificial learning):

  • Machine Learning for NLP

  • The Classification Problem

  • Three ML Algorithms

    • Decision Trees

    • AdaBoost

    • Support Vector Machines

  • Applications to NLP


Learning paradigms

Algorithms Learning (or artificial learning):

Learning Paradigms

  • Statistical learning:

    • HMM, Bayesian Networks, ME, CRF, etc.

  • Traditional methods from Artificial Intelligence (ML, AI)

    • Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc.

  • Methods from Computational Learning Theory (CoLT/SLT)

    • Winnow, AdaBoost, SVM’s, etc.


Learning paradigms1

Algorithms Learning (or artificial learning):

Learning Paradigms

  • Classifier combination:

    • Bagging, Boosting, Randomization, ECOC, Stacking, etc.

  • Semi-supervised learning: learning from labelled and unlabelled examples

    • Bootstrapping, EM, Transductive learning (SVM’s, AdaBoost), Co-Training, etc.

  • etc.


Decision trees

Algorithms Learning (or artificial learning):

Decision Trees

  • Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data.

  • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization.

  • From a machine-learning perspective: Decision Trees are n-ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes


Decision trees1

Algorithms Learning (or artificial learning):

Decision Trees

  • Acquisition: Top-Down Induction of Decision Trees (TDIDT)

  • Systems:

    CART (Breiman et al. 84),

    ID3, C4.5, C5.0 (Quinlan 86,93,98),

    ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95)

    etc.


An example2

A1 Learning (or artificial learning):

v1

v3

v2

...

A2

A2

A3

...

...

v5

v4

Decision Tree

A5

A2

...

SIZE

v6

small

big

C3

A5

SHAPE

COLOR

v7

red

circle

triang

blue

C1

C2

C1

neg

pos

pos

neg

Algorithms

An Example


Learning decision trees

Training Learning (or artificial learning):

DT

Training

Set

+

TDIDT

=

Test

DT

+

=

Example

Class

Algorithms

Learning Decision Trees


General induction algorithm

function Learning (or artificial learning):TDIDT (X:set-of-examples; A:set-of-features)

var: tree1,tree2: decision-tree;

X’: set-of-examples;

A’: set-of-features

end-var

if (stopping_criterion(X)) then

tree1 := create_leaf_tree(X)

else

amax := feature_selection(X,A);

tree1 := create_tree(X, amax);

for-all val invalues(amax) do

X’ := select_examples(X,amax,val);

A’ := A - {amax};

tree2 := TDIDT(X’,A’);

tree1 := add_branch(tree1,tree2,val)

end-for

end-if

return(tree1)

end-function

Algorithms

General Induction Algorithm


General induction algorithm1

function Learning (or artificial learning):TDIDT (X:set-of-examples; A:set-of-features)

var: tree1,tree2: decision-tree;

X’: set-of-examples;

A’: set-of-features

end-var

if (stopping_criterion(X)) then

tree1 := create_leaf_tree(X)

else

amax := feature_selection(X,A);

tree1 := create_tree(X, amax);

for-all val invalues(amax) do

X’ := select_examples(X,amax,val);

A’ := A - {amax};

tree2 := TDIDT(X’,A’);

tree1 := add_branch(tree1,tree2,val)

end-for

end-if

return(tree1)

end-function

Algorithms

General Induction Algorithm


Feature selection criteria

Functions derived from Learning (or artificial learning):Information Theory:

Information Gain, Gain Ratio (Quinlan 86)

Functions derived from Distance Measures

Gini Diversity Index (Breiman et al. 84)

RLM (López de Mántaras 91)

Statistically-based

Chi-square test (Sestito & Dillon 94)

Symmetrical Tau (Zhou & Dillon 91)

RELIEFF-IG: variant of RELIEFF (Kononenko 94)

Algorithms

Feature Selection Criteria


Extensions of dts

Algorithms Learning (or artificial learning):

Extensions of DTs

(Murthy 95)

  • Pruning (pre/post)

  • Minimize the effect of the greedy approach: lookahead

  • Non-lineal splits

  • Combination of multiple models

  • Incremental learning (on-line)

  • etc.


Decision trees and nlp

Algorithms Learning (or artificial learning):

Decision Trees and NLP

  • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)

  • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00)

  • Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96)

  • Parsing (Magerman 95,96; Haruno et al. 98,99)

  • Text categorization (Lewis & Ringuette 94; Weiss et al. 99)

  • Text summarization (Mani & Bloedorn 98)

  • Dialogue act tagging (Samuel et al. 98)


Decision trees and nlp1

Algorithms Learning (or artificial learning):

Decision Trees and NLP

  • Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95)

  • Discourse analysis in information extraction (Soderland & Lehnert 94)

  • Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94)

  • Verb classification in Machine Translation (Tanaka 96; Siegel 97)


Decision trees pros cons

Algorithms Learning (or artificial learning):

Decision Trees: pros&cons

  • Advantages

    • Acquires symbolic knowledge in a understandable way

    • Very well studied ML algorithms and variants

    • Can be easily translated into rules

    • Existence of available software: C4.5, C5.0, etc.

    • Can be easily integrated into an ensemble


Decision trees pros cons1

Algorithms Learning (or artificial learning):

Decision Trees: pros&cons

  • Drawbacks

    • Computationally expensive when scaling to large natural language domains: training examples, features, etc.

    • Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation

    • DTs is a model with high variance (unstable)

    • Tendency to overfit training data: pruning is necessary

    • Requires quite a big effort in tuning the model


Boosting algorithms

Algorithms Learning (or artificial learning):

Boosting algorithms

  • Idea

    “to combine many simple and moderately accurate hypotheses (weak classifiers) into a single and highly accurate classifier”

  • AdaBoost(Freund & Schapire 95) has been theoretically and empirically studied extensively

  • Many other variants extensions (1997-2003)

    http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html


Adaboost general scheme

Linear Learning (or artificial learning):

combination

TEST

F(h1,h2,...,hT)

a1

a2

aT

hT

h1

h2

. . .

Weak

Learner

Weak

Learner

Weak

Learner

Probability

distribution

updating

TS1

TST

TS2

. . .

D1

DT

D2

Algorithms

AdaBoost: general scheme

TRAINING


Adaboost algorithm

(Freund & Schapire 97) Learning (or artificial learning):

Algorithms

AdaBoost: algorithm


Adaboost example

Algorithms Learning (or artificial learning):

AdaBoost: example

Weak hypotheses = vertical/horizontal hyperplanes


Adaboost round 1

Algorithms Learning (or artificial learning):

AdaBoost: round 1


Adaboost round 2

Algorithms Learning (or artificial learning):

AdaBoost: round 2


Adaboost round 3

Algorithms Learning (or artificial learning):

AdaBoost: round 3


Combined hypothesis

www.research.att.com/ Learning (or artificial learning):~yoav/adaboost

Algorithms

Combined Hypothesis


Adaboost and nlp

Algorithms Learning (or artificial learning):

AdaBoost and NLP

  • POS Tagging(Abney et al. 99; Màrquez 99)

  • Text and Speech Categorization(Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)

  • PP-attachment Disambiguation(Abney et al. 99)

  • Parsing(Haruno et al. 99)

  • Word Sense Disambiguation(Escudero et al. 00, 01)

  • Shallow parsing(Carreras & Màrquez, 01a; 02)

  • Email spam filtering(Carreras & Màrquez, 01b)

  • Term Extraction(Vivaldi, et al. 01)


Adaboost pros cons

Algorithms Learning (or artificial learning):

AdaBoost: pros&cons

  • Easy to implement and few parameters to set

  • Time and space grow linearly with number of examples. Ability to manage very large learning problems

  • Does not constrain explicitly the complexity of the learner

  • Naturally combines feature selection with learning

  • Has been succesfully applied to many practical problems


Adaboost pros cons1

Algorithms Learning (or artificial learning):

AdaBoost: pros&cons

  • Seems to be rather robust to overfitting (number of rounds) but sensitive to noise

  • Performance is very good when there are relatively few relevant terms (features)

  • Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly


Seminar statistical nlp

Algorithms Learning (or artificial learning):

SVM: A General Definition

  • “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000)


Svm a general definition

Algorithms Learning (or artificial learning):

SVM: A General Definition

  • “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linearfunctions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000)

Key Concepts


Linear classifiers

+ Learning (or artificial learning):

+

+

+

+

_

w

_

_

_

+

_

_

_

_

_

Algorithms

Linear Classifiers

  • Hyperplanesin RN.

  • Defined by a weight vector (w) and a threshold (b).

  • They induce a classification rule:


Optimal hyperplane geometric intuition

Algorithms Learning (or artificial learning):

Optimal Hyperplane: Geometric Intuition


Optimal hyperplane geometric intuition1

These are the Learning (or artificial learning):

Support

Vectors

Algorithms

Optimal Hyperplane: Geometric Intuition

Maximal

Margin

Hyperplane


Linearly separable data

Seminari SVM Learning (or artificial learning):s 22/05/2001

Algorithms

Linearly separable data

Quadratic

Programming


Non separable case soft margin

Seminari SVM Learning (or artificial learning):s 22/05/2001

Algorithms

Non-separable case (soft margin)


Non linear svms

Non-linear mapping Learning (or artificial learning):

Set of hypotheses

Dual formulation

Kernel function

Evaluation

Seminari SVMs 22/05/2001

Algorithms

Non-linear SVMs

  • Implicit mapping into feature space via kernel functions


Non linear svms1

Seminari SVM Learning (or artificial learning):s 22/05/2001

Algorithms

Non-linear SVMs

  • Kernel functions

    • Must be efficiently computable

    • Characterization via Mercer’s theorem

    • One of the curious facts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space! (Cristianini & Shawe-Taylor, 2000)

    • Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc.


Non linear svms2

Seminari SVM Learning (or artificial learning):s 22/05/2001

Algorithms

Non linear SVMs

Degree 3 polynomial kernel

lin. non-separable

lin. separable


Toy examples

Algorithms Learning (or artificial learning):

Toy Examples

  • All examples have been run with the 2D graphic interface of SVMLIB (Chang and Lin, National University of Taiwan)

    “LIBSVMis an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool. LIBSVM provides a simple interface where users can easily link it with their own programs…”

  • Available from: www.csie.ntu.edu.tw/~cjlin/libsvm(it icludes a Web integrated demo tool)


Toy examples i

. Learning (or artificial learning):

What happens if we add

a blue training example

here?

Algorithms

Toy Examples (I)

Linearly separable data set

Linear SVM

Maximal margin Hyperplane


Toy examples i1

Algorithms Learning (or artificial learning):

Toy Examples (I)

(still) Linearly separable data set

Linear SVM

High value of C parameter

Maximal margin Hyperplane

The example is

correctly classified


Toy examples i2

Algorithms Learning (or artificial learning):

Toy Examples (I)

(still) Linearly separable data set

Linear SVM

Low value of C parameter

Trade-off between: margin and training error

The example is

now a bounded SV


Toy examples ii

Algorithms Learning (or artificial learning):

Toy Examples (II)


Toy examples ii1

Algorithms Learning (or artificial learning):

Toy Examples (II)


Toy examples ii2

Algorithms Learning (or artificial learning):

Toy Examples (II)


Toy examples iii

Algorithms Learning (or artificial learning):

Toy Examples (III)


Svm summary

Algorithms Learning (or artificial learning):

SVM: Summary

  • SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik, 1992). Great developement since then

  • Kernel-induced feature spaces: SVMs work efficiently in very high dimensional feature spaces (+)

  • Learning bias: maximal margin optimisation. Reduces the danger of overfitting. Generalization bounds for SVMs (+)

  • Compact representation of the induced hypothesis. The solution is sparse in terms of SVs (+)


Svm summary1

Algorithms Learning (or artificial learning):

SVM: Summary

  • Due to Mercer’s conditions on the kernels the optimi-sation problems are convex. No local minima (+)

  • Optimisation theory guides the implementation. Efficient learning (+)

  • Mainly for classification but also for regression, density estimation, clustering, etc.

  • Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. (+)

  • Parameter tuning (–). Implications in convergence times, sparsity of the solution, etc.


Outline5
Outline Learning (or artificial learning):

  • Machine Learning for NLP

  • The Classification Problem

  • Three ML Algorithms

  • Applications to NLP


Nlp problems

Applications Learning (or artificial learning):

NLP problems

  • Warning! We will not focus on final NLP applications, but on intermediate tasks...

  • We will classify the NLP tasks according to their (structural) complexity


Nlp problems structural complexity

Applications Learning (or artificial learning):

NLP problems: structural complexity

  • Decisional problems

    • Text Categorization, Document filtering, Word Sense Disambiguation, etc.

  • Sequence tagging and detection of sequential structures

    • POS tagging, Named Entity extraction, syntactic chunking, etc.

  • Hierarchical structures

    • Clause detection, full parsing, IE of complex concepts, composite Named Entities, etc.


Pos tagging

Morpho-syntactic ambiguity Learning (or artificial learning):: Part of Speech Tagging

Applications

POS tagging

  • He was shot in the hand as he chased the robbers in the back street

NN

VB

JJ

VB

NN

VB

(The Wall Street Journal Corpus)


Pos tagging1

root Learning (or artificial learning):

P(IN)=0.81

P(RB)=0.19

Word Form

“As”,“as”

others

...

P(IN)=0.83

P(RB)=0.17

tag(+1)

RB

others

...

P(IN)=0.13

P(RB)=0.87

tag(+2)

Probabilistic interpretation:

IN

^

P( RB | word=“A/as”  tag(+1)=RB  tag(+2)=IN) = 0.987

P(IN)=0.013

P(RB)=0.987

^

P( IN | word=“A/as”  tag(+1)=RB  tag(+2)=IN) = 0.013

leaf

Applications

POS tagging

“preposition-adverb” tree


Pos tagging2

root Learning (or artificial learning):

P(IN)=0.81

P(RB)=0.19

Word Form

“As”,“as”

others

...

P(IN)=0.83

P(RB)=0.17

tag(+1)

RB

others

...

P(IN)=0.13

P(RB)=0.87

tag(+2)

IN

P(IN)=0.013

P(RB)=0.987

leaf

Applications

POS tagging

“preposition-adverb” tree

Collocations:

“as_RB much_RB as_IN”

“as_RB soon_RB as_IN”

“as_RB well_RB as_IN”


Pos tagging3

A Sequential Model for Multi-class Classification: Learning (or artificial learning):

NLP/POS Tagging (Even-Zohar & Roth, 01)

Applications

POS tagging

RTT (Màrquez & Rodríguez 97)

Language

Model

stop?

Filter

Classify

Update

Tagged

text

Raw

text

Morphological

analysis

yes

no

Disambiguation


Pos tagging4

Language Learning (or artificial learning):Model

Lexical

probs.

+

The Use of Classifiers in sequential inference:

Chunking (Punyakanok & Roth, 00)

Contextual probs.

Viterbi

algorithm

Tagged

text

Raw

text

Morphological

analysis

Disambiguation

Applications

POS tagging

STT (Màrquez & Rodríguez 97)


Detection of sequential and hierarchical structures

Applications Learning (or artificial learning):

Detection of sequential and hierarchical structures

  • Named Entity recognition

  • Clause detection


Summary conclusions

Conclusions Learning (or artificial learning):

Summary/conclusions

  • We have briefly outlined:

    • The ML setting: “supervised learning for classification”

    • Three concrete machine learning algorithms

    • How to apply them to solve itermediate NLP tasks


Seminar statistical nlp

Any ML algorithm for NLP should be: Learning (or artificial learning):

Robust to noise and outliers

Efficient in large feature/example spaces

Adaptive to new/changing domains: portability, tuning, etc.

Able to take advantage of unlabelled examples: semi-supervised learning

Conclusions

Summary/conclusions


Summary conclusions1

Conclusions Learning (or artificial learning):

Summary/conclusions

  • Statistical and ML-based Natural Language Processing is a very active and multidisciplinary area of research


Some current research lines

Conclusions Learning (or artificial learning):

Some current research lines

  • Appropriate learning paradigm for all kind of NLP problems: TiMBL(DBZ99), TBEDL(Brill95),ME(Ratnaparkhi98),SNoW(Roth98), CRF (Pereira & Singer02).

  • Definition of an adequate (and task-specific) feature space: mapping from the input space to a high dimensional feature space, kernels, etc.

  • Resolution of complex NLP problems: inference with classifiers + constraint satisfaction

  • etc.


Bibliografia

Conclusions Learning (or artificial learning):

Bibliografia

  • You may found additional information at:

    http://www.lsi.upc.es/~lluism/

    tesi.html

    publicacions/pubs.html

    cursos/talks.html

    cursos/MLandNL.html

    cursos/emnlp1.html

  • This talk at:

    http://www.lsi.upc.es/~lluism/udg03.ppt.gz


Seminar statistical nlp

Seminar: Statistical NLP Learning (or artificial learning):

Machine Learning for Natural Language Processing

Lluís Màrquez

TALP Research Center

Llenguatges i Sistemes Informàtics

Universitat Politècnica de Catalunya

Girona, June 2003