richard sproat url http www cslu ogi edu sproatr courses compling l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CS506/606: Computational Linguistics Fall 2009 Unit 1 PowerPoint Presentation
Download Presentation
CS506/606: Computational Linguistics Fall 2009 Unit 1

Loading in 2 Seconds...

play fullscreen
1 / 159

CS506/606: Computational Linguistics Fall 2009 Unit 1 - PowerPoint PPT Presentation


  • 236 Views
  • Uploaded on

Richard Sproat URL: http://www.cslu.ogi.edu/~sproatr/Courses/CompLing/. CS506/606: Computational Linguistics Fall 2009 Unit 1. This Unit. Overview of the course What is computational linguistics? First linguistic problem: grammatical part-of-speech tagging The problem

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS506/606: Computational Linguistics Fall 2009 Unit 1' - betty_james


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
richard sproat url http www cslu ogi edu sproatr courses compling
Richard Sproat

URL: http://www.cslu.ogi.edu/~sproatr/Courses/CompLing/

CS506/606: Computational LinguisticsFall 2009Unit 1
this unit
Computational LinguisticsThis Unit
  • Overview of the course
  • What is computational linguistics?
  • First linguistic problem: grammatical part-of-speech tagging
    • The problem
    • The source-channel model
    • Language modeling
    • Estimation
    • Finite-state methods
    • First homework: a WFST-based implementation of a source-channel tagger.
format of the course
Computational LinguisticsFormat of the course
  • Lectures
  • Homeworks
    • 2-3 homeworks, which will be 70% of the grade
    • The homeworks will be work.
    • It is assumed you know how to program
  • Individual projects (30% of the grade)
    • You must discuss your project with me by the end of the third week
    • The final week will consist of (short) project presentations
final projects
Computational LinguisticsFinal projects
  • The project can be on any topic related to the course, e.g.:
    • Implement a parsing algorithm
    • Design a morphological analyzer for a non-trivial amount of morphology for a language
    • Build a sense-disambiguation system
    • Design a word-segmentation method for some written language that doesn't delimit words with spaces
    • Doing a serious literature review of some area of the field
readings for the course
Computational LinguisticsReadings for the course
  • Textbooks:
    • Brian Roark, Richard Sproat. Computational Approaches to Syntax and Morphology. Oxford University Press, 2007.
    • Daniel Jurafsky, James H. Martin. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Second Edition. Prentice Hall, 2008.
  • A few readings from online sources
prerequisites for this course
Computational LinguisticsPrerequisites for this course
  • Not many really:
    • You must know how to program:
      • Any programming language is fine
      • Should also know simple shell scripting
    • You will need access to a computer (duh).
    • You will need linux or a linux-like environment:
      • For Windows users I recommend cygwin(www.cygwin.com)
what you should expect to get out of this course
Computational LinguisticsWhat you should expect to get out of this course
  • An understanding of a range of linguistic issues:
    • The course is organized around “units”, each of which deals with one linguistic problem and one or more computational solution
    • Some problems are “practical” some of more theoretical interest
  • Some sense of variation across languages and the kinds of things to expect when one deals with computational problems in various languages
  • A feel for some of the kinds of computational solutions people have proposed
slide10
Computational Linguistics

linguistics?

computational linguistics

physics

chemistry

biology

neuropsychology

psychology

literary

criticism

more rigorous

less rigorous

more flakey

what defines the rigor of a field
Computational LinguisticsWhat defines the rigor of a field?
  • Whether results are reproducible
  • Whether theories are testable/falsifiable
  • Whether there are a common set of methods for similar problems
  • Whether approaches to problems can yield interesting new questions/answers
slide13
Computational Linguistics

engineering

linguistics

sociology

literary

criticism

more rigorous

less rigorous

the true situation with linguistics
Computational LinguisticsThe true situation with linguistics

“theoretical” linguistics

(e.g. lexical-functional grammar)‏

other areas of sociolinguistics

(e.g. Deborah Tannen)‏

some areas of sociolinguistics

(e.g. Bill Labov)‏

“theoretical” linguistics

(e.g. minimalist syntax)‏

experimental phonetics

historical linguistics

psycholinguistics

more rigorous

less rigorous

okay enough already what is computational linguistics
Computational LinguisticsOkay enough alreadyWhat is computational linguistics
  • Text normalization/segmentation
  • Morphological analysis
  • Automatic word pronunciation prediction
  • Transliteration
  • Word-class prediction: e.g. part of speech tagging
  • Parsing
  • Semantic role labeling
  • Machine translation
  • Dialog systems
  • Topic detection
  • Summarization
  • Text retrieval
  • Bioinformatics
  • Language modeling for automatic speech recognition
  • Computer-aided language learning (CALL)‏
computational linguistics16
Computational LinguisticsComputational linguistics
  • Often thought of as natural language engineering
  • But there is also a serious scientific component to it.
goals of computational linguistics natural language processing
Computational LinguisticsGoals of Computational Linguistics/ Natural Language Processing
  • To get computers to deal with language the way humans do:
    • They should be able to understand language and respond appropriately in language
    • They should be able to learn human language the way children do
    • They should be able to perform linguistic tasks that skilled humans can do, such as translation
  • Yeah, right
some interesting themes
Computational LinguisticsSome interesting themes…
  • Finite-state methods:
    • Many application areas
    • Raises many interesting questions about how “regular” language is
  • Grammar induction:
    • Linguists have done a poor job at their stated goal of explaining how humans learn grammar
  • Computational models of language change:
    • Historical evidence for language change is only partial. There are many changes in language for which we have no direct evidence.
why cl may seem ad hoc
Computational LinguisticsWhy CL may seem ad hoc
  • Wide variety of areas (as in linguistics)
  • If it’s natural language engineering the goal is often just to build something that works
  • Techniques tend to change in somewhat faddish ways…
    • For example: machine learning approaches fall in and out of favor
machine learning in cl
Computational LinguisticsMachine learning in CL
  • In general it’s a plus since it has meant that evaluation has become more rigorous
  • But it’s important that the field not turn into applied machine learning
  • For this to be avoided, people need to continue to focus on what linguistic features are important
  • Fortunately, this seems to be happening
a well worn example
Computational LinguisticsA well-worn example

Astronauts Poole (Gary Lockwood) and Bowman (Keir Dullea) trying to

elude the HAL 9000 computer.

the hal 9000
Computational LinguisticsThe HAL 9000
  • Perfect speech recognition
  • Perfect language understanding
  • Perfect synthesis:
    • Here’s the current reality:
  • Perfect modeling of discourse
  • (Vision)
  • (World knowledge)
  • And “experts” in the 1960’s thought this would all be possible
another example
Computational LinguisticsAnother example

The Gorn uses the Universal Translator in the Star Trek episode

“Metamorphosis”)‏

are these even reasonable goals
Computational LinguisticsAre these even reasonable goals?
  • These are nice goals but they have more to do with science fiction than with science fact
  • Realistically we don’t have to go this far to have stuff that is useful:
    • Spelling correctors, grammar checkers, MT systems, tools for linguistic analysis, …
    • Limited speech interaction systems:
      • Early systems like AT&T’s VRCP (Voice Recognition Call Processing):
        • Please say collect, third party or calling card
      • More recent examples: Goog411, United Airlines flight info
named entity recognition
Computational LinguisticsNamed Entity Recognition
  • Build a system that can find the names in a text:

Israeli Leader Suffers Serious Stroke

By STEVEN ERLANGER

JERUSALEM, Thursday, Jan. 5 - Israeli Prime Minister Ariel Sharon suffered a serious stroke Wednesday night after being taken to the hospital from his ranch in the Negev desert, and he underwent brain surgery early today to stop cerebral bleeding, a hospital official said.

Mr. Sharon's powers as prime minister were transferred to Vice Premier Ehud Olmert, said the cabinet secretary, Yisrael Maimon.

name transliteration
Computational LinguisticsName Transliteration
  • Handle cross-language transliteration
abbreviation expansion
Computational LinguisticsAbbreviation Expansion
  • Recover the underlying words in cases such as:
interpret text into scenes
Computational LinguisticsInterpret text into scenes

the very huge fried egg is on the table-vp23846. the very large american party hat is six inches above the egg. the chinstrap of the hat is invisible. the table is on the white tile floor. the french door is behind the table. the tall white wall is behind the french door. a white wooden chair is to the right of the table. it is facing left. it is sunrise. the impi-61 photograph is on the wall. it is three inches left of the door. it is three feet above the ground. the photograph is eighteen inches wide. a white table-vp23846 is one foot to the right of the chair. the big white teapot is on the table.

interpret text into scenes35
Computational LinguisticsInterpret Text into Scenes

the glass bowling ball is behind the bowling pin. the

ground is silver. a goldfish is inside the bowling ball.

interpret text into scenes36
Computational LinguisticsInterpret Text into Scenes

the humongous blue transparent ice cube is on the silver mountain range. the humongous green transparent ice cube is next to the blue ice cube. the humongous red transparent ice cube is on top of the green ice cube. the humongous yellow transparent ice cube is to the left of the green ice cube. the tiny santa claus is inside the red ice cube. the tiny christmas tree is inside the blue ice cube. the four tiny reindeer are inside the green ice cube. the tiny blue sleigh is inside the yellow ice cube. the small snowman-vp21048 is three feet in front of the green ice cube. the sky is pink.

interpret text into scenes37
Computational LinguisticsInterpret Text into Scenes

the donut shop is on the dirty ground. the donut of the donut shop is silver. a green a tarmac road is to the right of the donut shop. the road is 1000 feet long and 50 feet wide. a yellow volkswagen bus is eight feet to the right of the donut shop. it is on the road. a restaurant waiter is in front of the donut shop. a red volkswagen beetle is eight feet in front of the volkswagen bus. the taxi is ten feet behind the volkswagen bus. the convertible is to the left of the donut shop. it is facing right. the shoulder of the road has a dirt texture. the grass of the road has a dirt texture.

interpret text into scenes38
Computational LinguisticsInterpret Text into Scenes

The shiny blue goldfish is on the watery ground. The shiny red colorful-vp3982 is six inches away from the shiny blue goldfish. The polka dot colorful-vp3982 is to the right of the shiny blue goldfish. The polka dot colorful-vp3982 is five inches away from the shiny blue goldfish. The transparent orange colorful-vp3982 is above the shiny blue goldfish.The striped colorful-vp3982 is one foot away from the transparent orange colorful-vp3982. The huge silver wall is facing the shiny blue goldfish. The shiny blue goldfish is facing the silver wall. The silver wall is five feet away from the shiny blue goldfish.

how does the nlp in wordseye work
Computational LinguisticsHow does the NLP in WordsEye work?
  • Statistical part-of-speech tagger
  • Simple morphological analyzer
  • Statistical parser
  • Reference resolution model based on world model
  • Semantic hierarchy (similar to WordNet)
part of speech tagging
Computational LinguisticsPart-of-speech tagging
  • Part of speech (POS) tagging is simply the problem of placing words into equivalence classes.
  • Notion of part of speech tags can be attributed to Dionysius Thrax, 1st Century BC Greek grammarian who classified Greek words into eight classes:
    • noun, verb, pronoun, preposition, adverb, conjunction, participle and article.
  • Tagging is arguably easiest in languages with rich (inflectional) morphology (e.g. Spanish) for two reasons:
    • It’s more obvious what the basic set of tags should be since words fall into
    • The morphology gives important cues to what the part of speech is:
    • cantaremos is highly likely to be a verb given the ending -ar-emos.
  • It’s arguably hardest in languages with minimal (inflectional) morphology:
    • there are fewer cues in English than there are in Spanish
    • for some languages like Chinese, cues are almost completely absent
    • linguists can’t even agree on whether (e.g.) Chinese distinguishes verbs from adjectives.
part of speech tags
Computational LinguisticsPart-of-speech tags
  • Linguists typically distinguish a relatively small set of basic categories (like Dionysius Thrax)—sometimes just 4 in the case of Chomsky’s [±N,±V] proposal.
    • But usually these analyses assume an additional set of morphosyntactic features.
  • Computational models of tagging usually involve a larger set, which in manycases can be thought of as the linguists’ small set, plus the features squished into one term:
    • eat/VB, eat/VBP, eats/VBZ, ate/VBD, eaten/VBN
  • Tagset size has a clear effect on performance of taggers:
    • “the Penn Treebank project collapsed many tags compared to the original Brown tagset, and got better results.” (http://www.ilc.cnr.it/EAGLES96/morphsyn/node18.html)
  • But choosing the right size tagset depends upon the intended application.
    • As far as I know, there is no demonstration of what is the “optimal” tagset.
the penn treebank tagset
Computational LinguisticsThe Penn Treebank tagset
  • 46 tags, collapsed from the Brown Corpus tagset
  • Some details:
    • to/TO not disambiguated
    • verbs and auxiliaries (have, be) not distinguished (though these were in the Brown tagset).
  • Some links:
    • http://www.computing.dcu.ie/~acahill/tagset.html
    • http://www.mozart-oz.org/mogul/doc/lager/brill-tagger/penn.html
    • http://www.scs.leeds.ac.uk/amalgam/tagsets/upenn.html
  • Link for the original Brown corpus tags:
    • http://www.scs.leeds.ac.uk/ccalas/tagsets/brown.html
  • Motivations for the Penn tagset modifications
    • “the Penn Treebank tagset is based on that of the Brown Corpus. However the stochastic orientation of the Penn Treebank and the resulting concern with sparse data led us to modify the Brown tagset by paring it down considerably” (Marcus, Santorini and Marcinkiewicz, 1993).
    • eliminated distinctions that were lexically recoverable: thus no separate tags for be, do, have
    • as well as distinctions that were syntactically recoverable (e.g. the distinction between subject and object pronouns)
problematic cases
Computational LinguisticsProblematic cases
  • Even with a well-designed tagset, there are cases that even experts find it difficult to agree on.
    • adjective or participle?
      • a seen event, a rarely seen event, an unseen event,
      • a child seat, *a very child seat, *this seat is child
      • but: that’s a very MIT paper, she’s sooooooo California
  • Some cases are difficult to get in the absence of further knowledge: preposition or particle?
    • he threw out the garbage
    • he threw the garbage out
    • he threw the garbage out the door
    • he threw the garbage the door out
typical examples used to motivate tagging
Computational LinguisticsTypical examples used to motivate tagging
  • Can they can cans?
  • May may leave
  • He does not shoot does
  • You might use all your might
  • I am arriving at 3 am
the source channel model
Computational LinguisticsThe source-channel model
  • The basic idea: a lot of problems in computational linguistics can be construed as the problem of reconstructing an underlying “truth” given possibly noisy observations.
  • This is very much like the problem that Claude Shannon (the “father of Information Theory”) set out to solve for communication over a phone line.
    • Input I is clean speech
    • The channel (the phone line) corrupts I and produces O — what you hear at the other end
    • Can we reconstruct I from O?
  • Answer: you can if you have an estimate of the probability of the possible I’s and an estimate of the probability of generating O given I:
  • First term P(I) is the language model and the second term P(O|I) is the channel model.
the source channel model51
Computational LinguisticsThe source-channel model
  • For the tagging problem:
    • Want to maximize P(T|W)‏
    • From Bayes’ rule we know that:

this is a constant

for any sentence

class based language models
Computational LinguisticsClass-based language models
  • Suppose your corpus does not have every Monday but it does have every DAY-OF-WEEK for all the other days of the week.
  • A class-based language model can model this situation
  • P(wi | Ci) P(Ci | C0,C1…Ci-1)
what are classes
Computational LinguisticsWhat are classes?
  • A word can be in its own class
  • Part-of-speech
  • Semantic classes (DAY-OF-WEEK)‏
hidden markov models hmms
Computational Linguistics

transition cost

Hidden Markov Models (HMMs)‏

emission costs

<S>

P(dog|N) = 0.9

P(eats|N) = 0.1

P(dog|V) = 0.1

P(eats|V) = 0.9

P(N|<S>)=0.5

P(V|<S>)=0.5

P(V|N)=0.8

N

V

P(N|N)=0.1

P(V|V)=0.1

P(N|V)=0.7

P(</S>|N)=0.1

P(</S>|V)=0.2

</S>

<s>

dog

eats

dog

</s>

1.0* 0.5 * 0.9 * 0.8 * 0.9 * 0.7 * 0.9 * 0.1 = 0.02

Note: set probabilities of starting in any state other than <s> to 0

why hidden
Computational LinguisticsWhy “hidden”?
  • But if we see dog eat dog we don’t actually know what underlying sequence it came from:
    • The true sequence is hidden
  • Another possibility would be:
    • <s> V V V </s>:
    • 1.0 * 0.5 * 0.1 * 0.1 * 0.9 * 0.1 * 0.1 * 0.2 = .000009
  • So we need to consider all possibilities if we want an estimate of the probability of the sentence given the model
part of speech tagging57
Computational LinguisticsPart-of-speech tagging

We don’t want the probability of the observed sequence: we want the part

of speech sequence that maximizes that probability.

viterbi algorithm pseudocode
Computational LinguisticsViterbi algorithm pseudocode

backpointer for class j at time t

reconstruct the best scoring path

this section
Computational LinguisticsThis Section
  • Introduction to formal languages
    • Regular languages
    • Finite (state) automata
    • Right linear grammars
  • Regular relations
    • Finite (state) transducers
finite state methods
Computational LinguisticsFinite state methods
  • Used from the 1950’s onwards
  • Went out of fashion a bit during the 1980’s
  • Then a revival in the 1990’s with the advent of weighted finite-state methods
formal languages
Computational LinguisticsFormal languages
  • A language is a set (finite or infinite) of strings that can be formed out of an alphabet
  • An alphabet is a set (finite or infinite): letters, words of English, Chinese characters, beer bottles, varieties of Capsicum peppers.
some languages
Computational LinguisticsSome Languages
  • English
  • Python
  • The set of palindromes over a given alphabet
  • Zero or more a’s followed by zero or more b’s
  • All words in an English dictionary ending in -ism
regular languages
Computational LinguisticsRegular Languages
  • A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations:
    • Set union
    • Concatenation
    • Transitive closure (Kleene star)
some things that are regular languages
Computational LinguisticsSome things that are regular languages
  • Zero or more a’s followed by zero or more b’s
  • The set of words in an English dictionary
  • English?
some things that are not regular languages
Computational LinguisticsSome things that are not regular languages
  • Zero or more a’s followed by exactly the same number of b’s
  • The set for a procedure to find all palindromes for English
  • The set of well-formed Bambara phrasal reduplications (C. Culy, 1985)
some special regular languages
Computational LinguisticsSome special regular languages
  • The universal language (Σ*)
  • The empty language (Ø)

Note: the empty language is not the same as the empty string

some closure properties of regular languages
Computational LinguisticsSome closure properties of regular languages
  • Intersection
  • Complementation
  • Difference
  • Reversal
  • Power
regular expressions
Computational LinguisticsRegular expressions
  • Regular expressions are a formal way of specifying a regular language
caveats
Computational LinguisticsCaveats
  • Perl and other languages (see J&M, Chapter 2) have lots of stuff in their “regular expression” syntax. Strictly speaking, not all of these correspond to regular expressions in the formal sense since they don’t describe regular languages.
  • For example, arbitrary substring copying is not expressible as a regular language, though one can do this in Perl (or Python …)

/(…+)\1/

finite state automata formal definition
Computational LinguisticsFinite state automata: formal definition

Every regular language can be recognized by a finite-state automaton.

Every finite-state automaton recognizes a regular language. (Kleene’s theorem)‏

deterministic versus non deterministic finite automata
Computational LinguisticsDeterministic versus non-deterministic finite automata
  • The definition of finite-state automata given above was for non-deterministic finite automata (NDFA): δ is a relation, meaning that from any state and given any symbol, one can in principle transition to any number of states.
  • In deterministic finite automata (DFA), every state/symbol pair maps to a unique state.In other words, δ is a function
any nfa can be represented by a dfa
Computational LinguisticsAny NFA can be represented by a DFA

http://www.cs.duke.edu/csed/jflap/tutorial/fa/nfa2dfa/index.html

subset construction for determinization
Computational LinguisticsSubset construction for determinization

http://www.cs.duke.edu/csed/jflap/tutorial/fa/nfa2dfa/index.html

nfas vs dfas
Computational LinguisticsNFAs vs DFAs
  • If NDFA’s are typically smaller and simpler than their equivalent DFA’s, why do we care about DFA’s?
    • Answer: efficiency
kleene s theorem
Computational LinguisticsKleene’s theorem
  • Kleene’s Theorem, part 1: To each regular expression there corresponds a NDFA.
  • Kleene’s Theorem, part 2: To each NDFA there corresponds a regular expression.
pumping lemma
Computational LinguisticsPumping Lemma

http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages

regular languages

right left linear grammars
Computational LinguisticsRight- (Left-) Linear Grammars
  • (Assuming familiarity with the notions grammar, non-terminal, terminal . . . )
    • A grammar is right linear if each rule is of the form:

A → w(B)

where A is a non-terminal, w is a string of terminals, and (B) is a string consisting of zero or one non-terminal

  • Right (Left) Linear Grammars generate regular languages
some closure properties for regular relations
Computational LinguisticsSome closure properties for regular relations
  • Power (an)‏
  • Reversal
  • Inversion (a-1)‏
  • Composition: R1 ○R2
important consequence of closure under inversion
Computational LinguisticsImportant consequence of closure under inversion
  • Since regular relations are closed under inversion, one can write a set of rules that derive a surface form from a more abstract form, and then invert the resulting transducer to produce a transducer that will analyze surface forms into abstract forms.
some things that are not generally true of relations transducers
Computational LinguisticsSome things that are not generally true of relations/transducers
  • Determinization: FST’s are not generally determinizable
  • Difference: relations are not generally closed under difference
semirings110
Computational LinguisticsSemirings

(The term “tropical” is in honor of ImreSimon.)

interpretation
Computational LinguisticsInterpretation
  • In the “times/plus” semiring, weights (typically probabilities): are multiplied along paths, and during intersection.
    • The weight for all paths is the sum of all paths.
    • The cheapest path is the one with the highest weight.
  • In the “tropical” semiring, weights (typically probabilities): are summed along paths, and during intersection.
    • The weight for all paths is the minimum of all paths.
    • The cheapest path is the one with the lowest weight.
  • The cheapest (or best) path is computed by a shortest pathalgorithm – cf. the Viterbi algorithm
this section115
Computational LinguisticsThis Section
  • N-gram models
  • Sparse data
  • Smoothing:
    • “Add One”
    • Witten-Bell
    • Good-Turing
  • Backoff
  • Other issues:
    • Good-Turing and Word Frequency Distributions
    • Good-Turing and Morphological Productivity
  • Implementation of language models as weighted automata
n gram models
Computational LinguisticsN-gram models
  • Remember the chain rule:
    • P(w1w2w3 . . .wn) = P(w1)P(w2|w1)P(w3|w1w2) . . .
  • Problem is we can’t model all these conditional probabilities
  • N-gram models approximate P(w1w2w3 . . .wn) by setting a bound on the amount of previous context.
  • This is the Markov assumption, and n-grams are often termed Markov models
approximating shakespeare
Computational LinguisticsApproximating Shakespeare
  • As we increase the value of N, the accuracy of the n-gram model increases
  • Generating sentences with random unigrams:
    • Every enter now severally so, let
    • Hill he late speaks; or! a more to leg less first you enter
  • With bigrams:
    • What means, sir. I confess she? then all sorts, he is trim, captain.
    • Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.
  • Trigrams:
    • Sweet prince, Falstaff shall die.
    • This shall forbid it should be branded, if renown made it empty.
  • Tetragrams
    • What! I will go seek the traitor Gloucester.
    • Will you not tell me who I am?
approximating shakespeare127
Computational LinguisticsApproximating Shakespeare
  • There are 884,647 tokens, with 29,066 word form types, in Shakespeare’s works
  • Shakespeare produced 300,000 bigram types out of 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table).
  • Tetragrams are worse: What’s coming out looks like Shakespeare because it is Shakespeare.
  • The zeroes in the table are causing problems: we are being forced down a path of selecting only the tetragrams that Shakespeare used — not a very good model of Shakespeare, in fact
  • This is the sparse data problem
sparse data
Computational LinguisticsSparse data
  • In fact the sparse data problem extends beyond zeroes:
    • the occurs about 28,000 times in Shakespeare, so by the MLE:
    • P(the) = 28000/884647 = .032
  • womenkind occurs once, so:
    • P(womenkind) = 1/884647 = .0000011
  • Do we believe this?
n gram training sensitivity
Computational LinguisticsN-gram training sensitivity
  • If we repeated the Shakespeare experiment but trained on a Wall Street Journal corpus, there would be little overlap in the output
  • This has major implications for corpus selection or design
some useful empirical observations a review
Computational LinguisticsSome useful empirical observations: a review
  • A small number of events occur with high frequency
  • A large number of events occur with low frequency
  • You can quickly collect statistics on the high frequency events
  • You might have to wait an arbitrarily long time to get valid statistics on low frequency events
  • Some of the zeroes in the table are really zeroes. But others are simply low frequency events you haven’t seen yet.
  • Whatever are we to do?
smoothing general issues
Computational LinguisticsSmoothing: general issues
  • Smoothing techniques manipulate the counts of the seen and unseen cases and replace each count c by an adjusted count c.
  • Alternatively we can view smoothing as producing an adjusted probability P from an original probability P.
  • More sophisticated smoothing techniques try to arrange it so that the probability estimates of the higher counts are not changed too much, since we tend to trust those.
kneser ney modeling
Computational LinguisticsKneser-Ney modeling
  • Lower-order ngrams are only used when higher-order ngrams are lacking
    • So build these lower-order ngrams to suit that situation
  • New York is frequent
    • York is not too frequent except after New
    • If the previous word is New then we don’t care about the unigram estimate of York
    • If the previous word is not New then we don’t want to be counting all those cases when New occurs before York
estimation techniques miscellanea
Computational LinguisticsEstimation techniques: miscellanea
  • What if you have reason to doubt your counts? In some what used to be recent but is now not so recent work (Riley, Roark and Sproat, 2003), we’ve tried to generalize Good-Turing to the case where the counts are “fractional” as in the (lattice) output of a speech recognizer.
  • Chen and Goodman (1998) http://citeseer.nj.nec.com/22209.html is an oft-cited study of these various techniques (and many others) and how effective they are.
  • By the way, we haven’t said anything about how one measures effectiveness.
    • There are a couple of ways:
      • Actually use the n-gram language model in a real system (such as an ASR system)
      • Measure the perplexity on some held-out corpus
smoothing isn t just for ngrams
Computational LinguisticsSmoothing isn’t just for ngrams
  • The Good-Turing estimate of the probability mass of the unseen cases is related to the growth of the vocabulary
  • It gives you a measure of how likely it is that there are “more where that came from”
  • Hence it can be used to measure the productivity of a process
related points
Computational LinguisticsRelated points
  • Baayen and Sproat (1996) showed that the best predictor of the prior probability of a given usage of an unseen morphologically complex word is the most frequent usage among the hapax legomena (see http://acl.ldc.upenn.edu/J/J96/J96-2001.pdf).
  • Sproat and Shih (1996) showed that root compounds in Chinese are productive using a Good-Turing estimate
summary
Computational LinguisticsSummary
  • N-gram models are an approximation to the correct model as given by the chain rule
  • N-gram models are relatively easy to use, but suffer from severe sparse data problems
  • There are a variety of techniques for ameliorating sparse data problems
  • These techniques relate more generally to word frequency distributions and are useful in areas beyond n-gram modeling
backoff156
Computational LinguisticsBackoff

history wi-2 wi-1

failure arc

)‏

hidden markov models hmms157
Computational Linguistics

transition cost

Hidden Markov Models (HMMs)‏

emission costs

<S>

P(dog|N) = 0.9

P(eats|N) = 0.1

P(dog|V) = 0.1

P(eats|V) = 0.9

P(N|<S>)=0.5

P(V|<S>)=0.5

P(V|N)=0.8

N

V

P(N|N)=0.1

P(V|V)=0.1

P(N|V)=0.7

P(</S>|N)=0.1

P(</S>|V)=0.2

</S>

<s>

dog

eats

dog

</s>

1.0* 0.5 * 0.9 * 0.8 * 0.9 * 0.7 * 0.9 * 0.1 = 0.02

Note: set probabilities of starting in any state other than <s> to 0

an equivalent wfst
Computational LinguisticsAn equivalent WFST

<S>

V:dog/P(V|<S>)P(dog|V)

V:eats/P(V|<S>)P(eat|V)

N

V

</S>

  • Arcs are labeled with tag:word pairs
  • States represent last seen tag
  • Arc costs are combined transition and emission costs
homework 1
Computational LinguisticsHomework 1
  • See: http://www.cslu.ogi.edu/~sproatr/Courses/CompLing/Homework/homework1.html