Cmsc 723 ling 645 intro to computational linguistics
1 / 53

CMSC 723 / LING 645: Intro to Computational Linguistics - PowerPoint PPT Presentation

  • Uploaded on

CMSC 723 / LING 645: Intro to Computational Linguistics. September 22, 2004: Dorr Porter Stemmer, Intro to Probabilistic NLP and N-grams (chap 6.1-6.3) Prof. Bonnie J. Dorr Dr. Christof Monz TA: Adam Lee. Computational Morphology (continued). The Rules and the Lexicon

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'CMSC 723 / LING 645: Intro to Computational Linguistics' - shanton

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cmsc 723 ling 645 intro to computational linguistics l.jpg
CMSC 723 / LING 645: Intro to Computational Linguistics

September 22, 2004: Dorr

Porter Stemmer,Intro to Probabilistic NLP and N-grams (chap 6.1-6.3)

Prof. Bonnie J. DorrDr. Christof MonzTA: Adam Lee

Computational morphology continued l.jpg
Computational Morphology (continued)

  • The Rules and the Lexicon

    • General versus Specific

    • Regular versus Irregular

    • Accuracy, speed, space

    • The Morphology of a language

  • Approaches

    • Lexicon only

    • Lexicon and Rules

      • Finite-state Automata

      • Finite-state Transducers

    • Rules only

Lexicon free morphology porter stemmer l.jpg
Lexicon-Free Morphology:Porter Stemmer

  • Lexicon-Free FST Approach

  • By Martin Porter (1980)

  • Cascade of substitutions given specific conditions






Porter stemmer l.jpg
Porter Stemmer

  • Definitions

    • C = string of one or more consonants, where a consonant is anything other than A E I O U or (Y preceded by C)

    • V = string of one or more vowels

    • M = Measure, roughly with number of syllables

    • Words = (C)*(V*C*)M(V)*

      • M=0 TR, EE, TREE, Y, BY



  • Conditions

    • *S - stem ends with S

    • *v* - stem contains a V

    • *d - stem ends with double C, e.g., -TT, -SS

    • *o - stem ends CVC, where second C is not W, X or Y, e.g., -WIL, HOP

Porter stemmer5 l.jpg

*<S> = ends with <S>

*v* = contains a V

*d = ends with double C

*o = ends with CVC second C is not W, X or Y

Porter Stemmer

Step 1: Plural Nouns and Third Person Singular Verbs

SSES  SS caresses  caress

IES  I ponies  poni

ties  ti

SS  SS caress  caress

S  cats  cat

Step 2a: Verbal Past Tense and Progressive Forms

(M>0) EED  EE feed  feed, agreed  agree

i (*v*) ED  plastered  plaster, bled  bled

ii (*v*) ING  motoring  motor, sing  sing

Step 2b: If 2a.i or 2a.ii is successful, Cleanup

AT  ATE conflat(ed)  conflate

BL  BLE troubl(ed)  trouble

IZ  IZE siz(ed)  size

(*d and not (*L or *S or *Z)) hopp(ing)  hop, tann(ed)  tan

 single letter hiss(ing)  hiss, fizz(ed)  fizz

(M=1 and *o)  E fail(ing)  fail, fil(ing)  file

Porter stemmer6 l.jpg

*<S> = ends with <S>

*v* = contains a V

*d = ends with double C

*o = ends with CVC second C is not W, X or Y

Porter Stemmer

Step 3: Y  I

(*v*) Y  I happy  happi

sky  sky

Porter stemmer7 l.jpg
Porter Stemmer

Step 4: Derivational Morphology I: Multiple Suffixes

(m>0) ATIONAL -> ATE relational -> relate

(m>0) TIONAL -> TION conditional -> condition

rational -> rational

(m>0) ENCI -> ENCE valenci -> valence

(m>0) ANCI -> ANCE hesitanci -> hesitance

(m>0) IZER -> IZE digitizer -> digitize

(m>0) ABLI -> ABLE conformabli -> conformable

(m>0) ALLI -> AL radicalli -> radical

(m>0) ENTLI -> ENT differentli -> different

(m>0) ELI -> E vileli - > vile

(m>0) OUSLI -> OUS analogousli -> analogous

(m>0) IZATION -> IZE vietnamization -> vietnamize

(m>0) ATION -> ATE predication -> predicate

(m>0) ATOR -> ATE operator -> operate

(m>0) ALISM -> AL feudalism -> feudal

(m>0) IVENESS -> IVE decisiveness -> decisive

(m>0) FULNESS -> FUL hopefulness -> hopeful

(m>0) OUSNESS -> OUS callousness -> callous

(m>0) ALITI -> AL formaliti -> formal

(m>0) IVITI -> IVE sensitiviti -> sensitive

(m>0) BILITI -> BLE sensibiliti -> sensible

Porter stemmer8 l.jpg
Porter Stemmer

Step 5: Derivational Morphology II: More Multiple Suffixes

(m>0) ICATE -> IC triplicate -> triplic

(m>0) ATIVE -> formative -> form

(m>0) ALIZE -> AL formalize -> formal

(m>0) ICITI -> IC electriciti -> electric

(m>0) ICAL -> IC electrical -> electric

(m>0) FUL -> hopeful -> hope

(m>0) NESS -> goodness -> good

Porter stemmer9 l.jpg

*<S> = ends with <S>

*v* = contains a V

*d = ends with double C

*o = ends with CVC second C is not W, X or Y

Porter Stemmer

Step 6: Derivational Morphology III: Single Suffixes

(m>1) AL -> revival -> reviv

(m>1) ANCE -> allowance -> allow

(m>1) ENCE -> inference -> infer

(m>1) ER -> airliner -> airlin

(m>1) IC -> gyroscopic -> gyroscop

(m>1) ABLE -> adjustable -> adjust

(m>1) IBLE -> defensible -> defens

(m>1) ANT -> irritant -> irrit

(m>1) EMENT -> replacement -> replac

(m>1) MENT -> adjustment -> adjust

(m>1) ENT -> dependent -> depend

(m>1 and (*S or *T)) ION -> adoption -> adopt

(m>1) OU -> homologou -> homolog

(m>1) ISM -> communism -> commun

(m>1) ATE -> activate -> activ

(m>1) ITI -> angulariti -> angular

(m>1) OUS -> homologous -> homolog

(m>1) IVE -> effective -> effect

(m>1) IZE -> bowdlerize -> bowdler

Porter stemmer10 l.jpg

*<S> = ends with <S>

*v* = contains a V

*d = ends with double C

*o = ends with CVC second C is not W, X or Y

Porter Stemmer

Step 7a: Cleanup

(m>1) E  probate  probat

rate  rate

(m=1 and not *o) E  cease  ceas

Step 7b: More Cleanup

(m > 1 and *d and *L) controll  control

 single letter roll  roll

Porter stemmer11 l.jpg
Porter Stemmer

  • Errors of Omission

    • European Europe

    • analysis analyzes

    • matrices matrix

    • noise noisy

    • explain explanation

  • Errors of Commission

    • organization organ

    • doing doe

    • generalization generic

    • numerical numerous

    • university universe

From Krovetz ‘93

Why not statistics for nlp l.jpg
Why (not) Statistics for NLP?

  • Pro

    • Disambiguation

    • Error Tolerant

    • Learnable

  • Con

    • Not always appropriate

    • Difficult to debug

Weighted automata transducers l.jpg
Weighted Automata/Transducers

  • Speech recognition: storing a pronunciation lexicon

  • Augmentation of FSA: Each arc is associated with a probability

Pronunciation network for about l.jpg
Pronunciation network for “about”

Probability definitions l.jpg
Probability Definitions

  • Experiment (trial)

    • Repeatable procedure with well-defined possible outcomes

  • Sample space

    • Complete set of outcomes

  • Event

    • Any subset of outcomes from sample space

  • Random Variable

    • Uncertain outcome in a trial

More definitions l.jpg
More Definitions

  • Probability

    • How likely is it to get a particular outcome?

    • Rate of getting that outcome in all trials

      Probability of drawing a spade from 52 well-shuffled playing cards:

  • Distribution: Probabilities associated with each outcome a random variable can take

    • Each outcome has probability between 0 and 1

    • The sum of all outcome probabilities is 1.

Conditional probability l.jpg




Conditional Probability

  • What is P(A|B)?

  • First, what is P(A)?

    • P(“It is raining”) = .06

  • Now what about P(A|B)?

    • P(“It is raining” | “It was clear 10 minutes ago”) = .004

Note: P(A,B)=P(A|B) · P(B)

Also: P(A,B) = P(B,A)

Independence l.jpg

  • What is P(A,B) if A and B are independent?

  • P(A,B)=P(A) ·P(B) iff A,B independent.

    • P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25

    • P(doctor,blue-eyes) = P(doctor) · P(blue-eyes) = .01 · .2 = .002

  • What if A,B independent?

    • P(A|B)=P(A) iff A,B independent

    • Also: P(B|A)=P(B) iff A,B independent

Bayes theorem l.jpg
Bayes Theorem

  • Swap the order of dependence

  • Sometimes easier to estimate one kind of dependence than the other

What does this have to do with the noisy channel model l.jpg



Best H

Best H =





argmax P(H|O)



What does this have to do with the Noisy Channel Model?



Noisy channel applied to word recognition l.jpg
Noisy Channel Applied to Word Recognition

  • argmaxw P(w|O) = argmaxw P(O|w) P(w)

  • Simplifying assumptions

    • pronunciation string correct

    • word boundaries known

  • Problem:

    • Given [n iy], what is correct dictionary word?

  • What do we need?

[ni]: knee, neat, need, new

What is the most likely word given ni l.jpg




































What is the most likely word given [ni]?

  • Compute prior P(w)

  • Now compute likelihood P([ni]|w), then multiply

Why n grams l.jpg

























Why N-grams?

  • Compute likelihood P([ni]|w), then multiply

  • Unigram approach: ignores context

  • Need to factor in context (n-gram)

    • Use P(need|I) instead of just P(need)

    • Note: P(new|I) < P(need|I)

Next word prediction borrowed from j hirschberg l.jpg
Next Word Prediction[borrowed from J. Hirschberg]

From a NY Times story...

  • Stocks plunged this ….

  • Stocks plunged this morning, despite a cut in interest rates

  • Stocks plunged this morning, despite a cut in interest ratesby the Federal Reserve, as Wall ...

  • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began

Next word prediction cont l.jpg
Next Word Prediction (cont)

  • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last…

  • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since lastTuesday's terrorist attacks.

Human word prediction l.jpg
Human Word Prediction

  • Domain knowledge

  • Syntactic knowledge

  • Lexical knowledge

Claim l.jpg

  • A useful part of the knowledge needed to allow Word Prediction can be captured using simple statistical techniques.

  • Compute:

    • probability of a sequence

    • likelihood of words co-occurring

Why would we want to do this l.jpg
Why would we want to do this?

  • Rank the likelihood of sequences containing various alternative alternative hypotheses

  • Assess the likelihood of a hypothesis

Why is this useful l.jpg
Why is this useful?

  • Speech recognition

  • Handwriting recognition

  • Spelling correction

  • Machine translation systems

  • Optical character recognizers

Handwriting recognition l.jpg
Handwriting Recognition

  • Assume a note is given to a bank teller, which the teller reads as I have a gub. (cf. Woody Allen)

  • NLP to the rescue ….

    • gub is not a word

    • gun, gum, Gus, and gull are words, but gun has a higher probability in the context of a bank

Real word spelling errors l.jpg
Real Word Spelling Errors

  • They are leaving in about fifteen minuets to go to her house.

  • The study was conducted mainly be John Black.

  • The design an construction of the system will take more than a year.

  • Hopefully, all with continue smoothly in my absence.

  • Can they lave him my messages?

  • I need to notified the bank of….

  • He is trying to fine out.

For spell checkers l.jpg
For Spell Checkers

  • Collect list of commonly substituted words

    • piece/peace, whether/weather, their/there ...

  • Example:“On Tuesday, the whether …’’“On Tuesday, the weather …”

Language model l.jpg
Language Model

  • Definition: Language model is a model that enables one to compute the probability, or likelihood, of a sentence S, P(S).

  • Let’s look at different ways of computing P(S) in the context of Word Prediction

Word prediction simple vs smart l.jpg

n times

Word Prediction: Simple vs. Smart

  • Simple:Every word follows every other word w/ equal probability (0-gram)

    • Assume |V| is the size of the vocabulary

    • Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V|

    • If English has 100,000 words, probability of each next word is 1/100000 = .00001

  • Smarter:Probability of each next word is related to word frequency (unigram)

    – Likelihood of sentence S = P(w1) × P(w2) × … × P(wn)

    – Assumes probability of each word is independent of probabilities of other words.

  • Even smarter: Look at probability given previous words (N-gram)

    – Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1)

    – Assumes probability of each word is dependent on probabilities of other words.

Chain rule l.jpg
Chain Rule

  • Conditional Probability

    • P(A1,A2) = P(A1) · P(A2|A1)

  • The Chain Rulegeneralizes to multiple events

    • P(A1, …,An) = P(A1) P(A2|A1) P(A3|A1,A2)…P(An|A1…An-1)

  • Examples:

    • P(the dog) = P(the) P(dog | the)

    • P(the dog bites) = P(the) P(dog | the) P(bites| the dog)

Relative frequencies and conditional probabilities l.jpg
Relative Frequencies and Conditional Probabilities

  • Relative word frequencies are better than equal probabilities for all words

    • In a corpus with 10K word types, each word would have P(w) = 1/10K

    • Does not match our intuitions that different words are more likely to occur (e.g. the)

  • Conditional probability more useful than individual relative word frequencies

    • Dog may be relatively rare in a corpus

    • But if we see barking, P(dog|barking) may be very large

For a word string l.jpg



For a Word String

  • In general, the probability of a complete string of words w1…wn is:

    P(w )

    = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1)


  • But this approach to determining the probability of a word sequence is not very helpful in general….

Markov assumption l.jpg




Markov Assumption

  • How do we compute P(wn|w1n-1)? Trick: Instead of P(rabbit|I saw a), we use P(rabbit|a).

    • This lets us collect statistics in practice

    • A bigram model: P(the barking dog) = P(the|<start>)P(barking|the)P(dog|barking)

  • Markov models are the class of probabilistic models that assume that we can predict the probability of some future unit without looking too far into the past

    • Specifically, for N=2 (bigram): P(w1) ≈Π P(wk|wk-1)

  • Order of a Markov model: length of prior context

    • bigram is first order, trigram is second order, …

Counting words in corpora l.jpg
Counting Words in Corpora

  • What is a word?

    • e.g., arecatand cats the same word?

    • September and Sept?

    • zeroand oh?

    • Is seventy-two one word or two? AT&T?

    • Punctuation?

  • How many words are there in English?

  • Where do we find the things to count?

Corpora l.jpg

  • Corpora are (generally online) collections of text and speech

  • Examples:

    • Brown Corpus (1M words)

    • Wall Street Journal and AP News corpora

    • ATIS, Broadcast News (speech)

    • TDT (text and speech)

    • Switchboard, Call Home (speech)

    • TRAINS, FM Radio (speech)

Training and testing l.jpg
Training and Testing

  • Probabilities come from a training corpus, which is used to design the model.

    • overly narrow corpus: probabilities don't generalize

    • overly general corpus: probabilities don't reflect task or domain

  • A separate test corpus is used to evaluate the model, typically using standard metrics

    • held out test set

    • cross validation

    • evaluation differences should be statistically significant

Terminology l.jpg

  • Sentence: unit of written language

  • Utterance: unit of spoken language

  • Word Form: the inflected form that appears in the corpus

  • Lemma: lexical forms having the same stem, part of speech, and word sense

  • Types (V): number of distinct words that might appear in a corpus (vocabulary size)

  • Tokens (N): total number of words in a corpus

  • Types seen so far (T): number of distinct words seen so far in corpus (smaller than V and N)

Simple n grams l.jpg
Simple N-Grams

  • An N-gram model uses the previous N-1 words to predict the next one:P(wn | wn-N+1 wn-N+2… wn-1 )

    • unigrams: P(dog)

    • bigrams: P(dog | big)

    • trigrams: P(dog | the big)

    • quadrigrams: P(dog | chasing the big)

Using n grams l.jpg






Using N-Grams

  • Recall that

    • N-gram: P(wn|w1 ) ≈P(wn|wn-N+1)

    • Bigram: P(w1) ≈Π P(wk|wk-1)

  • For a bigram grammar

    • P(sentence) can be approximated by multiplying all the bigram probabilities in the sequence

  • Example:P(I want to eat Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese)

Computing sentence probability l.jpg
Computing Sentence Probability

  • P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25×.32×.65×.26×.001×.60 = .000080

  • vs. I want to eat Chinese food = .00015

  • Probabilities seem to capture “syntactic” facts, “world knowledge”

    • eat is often followed by a NP

    • British food is not too popular

  • N-gram models can be trained by counting and normalization

Berp bigram probabilities use unigram count l.jpg
BERP Bigram Probabilities: Use Unigram Count

  • Normalization: divide bigram count by unigram count of first word.

  • Computing the probability of I I

    • P(I|I) = C(I,I)/C(I) = 8 / 3437 = .0023

  • A bigram grammar is an NxN matrix of probabilities, where N is the vocabulary size

Learning a bigram grammar l.jpg
Learning a Bigram Grammar

  • The formula P(wn|wn-1) = C(wn,wn-1)/C(wn-1) is used for bigram “parameter estimation”

  • Relative Frequency

  • Maximum Likelihood Estimation (MLE): Parameter set maximizes likelihood of training set T given model M — P(T|M).

What do we learn about the language l.jpg
What do we learn about the language?

  • What's being captured with ...

    • P(want | I) = .32

    • P(to | want) = .65

    • P(eat | to) = .26

    • P(food | Chinese) = .56

    • P(lunch | eat) = .055

  • What about...

    • P(I | I) = .0023

    • P(I | want) = .0025

    • P(I | food) = .013

Readings for next time l.jpg
Readings for next time

  • J&M Chapter 5, 7.1-7.3