ling c sc 439 539 statistical natural language processing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
LING / C SC 439/539 Statistical Natural Language Processing PowerPoint Presentation
Download Presentation
LING / C SC 439/539 Statistical Natural Language Processing

Loading in 2 Seconds...

play fullscreen
1 / 59

LING / C SC 439/539 Statistical Natural Language Processing - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 2 1/14/2013. For you to do. Sign attendance sheet Turn in survey/quiz Fill out Doodle poll for office hours, if you think you might want to go to office hours Link was sent to you in an e-mail. Recommended reading.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'LING / C SC 439/539 Statistical Natural Language Processing' - snowy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
for you to do
For you to do
  • Sign attendance sheet
  • Turn in survey/quiz
  • Fill out Doodle poll for office hours, if you think you might want to go to office hours
    • Link was sent to you in an e-mail
recommended reading
Recommended reading
  • Manning & Schutze
    • 1.4: Zipf’s Law
    • Chapter 4: Corpus-based work
  • Jurafsky & Martin Chapter 4
    • 4.1, 4.2, 4.3: counting n-grams
    • 4.5: smoothing
outline
Outline
  • Word frequencies and Zipf’s law
  • N-grams and sparse data
  • Probability theory I
  • Smoothing I
get statistics to do statistical nlp
Get statistics to do statistical NLP
  • Determine how often things occur in language
    • But language is an abstract entity
    • Look at a corpus instead
  • Corpus: large electronic file of language usage
    • Written texts: news articles, fiction, poetry, web pages
    • Spoken language: audio files, text transcriptions
  • Gather statistics from a corpus
    • Count occurrences of different types of linguistic units: words, n-grams (sequences), part of speech (POS) tags, etc.
where to get corpora
Where to get corpora
  • Do a web search for “<language> corpus”
  • NLTK (Natural Language Toolkit):

http://nltk.org/nltk_data/

  • Linguistic Data Consortium (LDC)

http://www.ldc.upenn.edu/

    • Many LDC corpora are at the UA library
    • Main languages are English, Arabic, Chinese; spoken and written
    • http://corplinguistics.wordpress.com/2011/10/30/top-five-ldc-corpora/
  • Many other corpora, some accessible online:

http://en.wikipedia.org/wiki/Text_corpus

brown corpus
Brown Corpus
  • http://en.wikipedia.org/wiki/Brown_Corpus
  • Compilation of texts of American English in the 1960s
  • 500 texts from 15 different genres: news, religious texts, popular books, government documents, fiction, etc.
  • 1,154,633 words
  • You will you access to this corpus
how do we define word
How do we define “word”?
  • Simplest answer: any contiguous sequence of non-whitespace characters is a word
    • Examples on next slide
  • It is useful to first perform tokenization: split a raw text into linguistically meaningful lexical units
  • In this class we will use a pre-tokenized version of the Brown Corpus
some examples of words in the untokenized brown corpus
Some examples of “words” in the untokenized Brown Corpus

already

The

the

(the

Caldwell

Caldwell's

Caldwell,

ribbon''

Lt.

Ave.,

1964

1940s

$50

$6,100,000,000

fire-fighting

natural-law

Baden-Baden

1961-62

France-Germany

Kohnstamm-positive

Colquitt--After

frequencies of words in a corpus types and tokens
Frequencies of words in a corpus: types and tokens
  • Brown corpus:
    • 1,154,633 word tokens
    • 49,680 word types
  • Type vs. token (for words):
    • Type: a distinct word
      • e.g. “with”
    • Token: an individual occurrence of a word
      • e.g. “with” occurs 7,270 times
    • Question: what percentage of the word types in a corpus appear exactly once? (make a guess)
frequency and rank
Frequency and rank
  • Sort words by decreasing frequency
  • Rank = order in sorted list
    • Rank 1: most-frequent word
    • Rank 2: second most-frequent word
    • Rank 3: third most-frequent word
    • etc.
  • Plot word frequencies by rank
plot of word frequencies log log scale
Plot of word frequencies, log-log scale

Log1010 = 1

Log10100 = 2

Log101000 = 3

plot of word frequencies log log scale1
Plot of word frequencies, log-log scale

10 most-freq words: freq. > 10,000

Next 90 words: 1,000 < freq. < 10,000

Next 900 words: 100 < freq. < 1,000

Next 9000: 10 < freq. < 100

~40,000 words: 1 <= freq < 10

word frequency distributions in language
Word frequency distributions in language
  • Exemplifies a power law distribution
  • For any corpus and any language:
    • There are a few very common words
    • A substantial number of medium freq. words
    • A huge number of low frequency words
  • Brown corpus
    • 1,154,633 word tokens, 49,680 word types
    • 21,919 types appear only once! = 44.1%
word frequencies follow zipf s law
Word frequencies follow Zipf’s law
  • http://en.wikipedia.org/wiki/Zipf%27s_law
  • Zipf’s Law (1935, 1949): the frequencyF of a word w is inversely proportional to the rankR of w:

F  1 / R

i.e., F x R = k, for some constant k

  • Example: 50th most common word occurs approximately

3 times as frequently as 150th most common word

freq. at rank 50:  1 / 50

freq. at rank 150:  1 / 150

( 1 / 50 ) / ( 1 / 150 ) = 3

zipf s law explains linear like relationship between frequency and rank in log log scale
Zipf’s Law explains linear-like relationship between frequency and rank in log-log scale

Red line = constant k in Zipf’s Law

some words with a frequency of 20 in brown corpus
Some words with a frequency of 20 in Brown Corpus
  • pursue
  • purchased
  • punishment
  • promises
  • procurement
  • probabilities
  • precious
  • pitcher
  • pitch

replacement

repair

relating

rehabilitation

refund

receives

ranks

queen

quarrel

puts

some words with a frequency of 1 in brown corpus
Some words with a frequency of 1 in Brown Corpus
  • government-controlled
  • gouverne
  • goutte
  • gourmets
  • gourmet's
  • gothic
  • gossiping
  • gossiped
  • gossamer
  • gosh
  • gorgeously
  • gorge
  • gooshey
  • goooolick
  • goofed
  • gooey
  • goody
  • goodness'
  • goodies
  • good-will
frequencies of different types of words
Frequencies of different types of words
  • Extremely frequent words
    • Belong to part-of-speech categories of Determiner, Preposition, Conjunction, Pronoun
    • Also frequent adverbs, and punctuation
    • Linguistically, these are function words: convey the syntactic information in a sentence
  • Moderately frequent words
    • Common Nouns, Verbs, Adjectives, Adverbs
    • Linguistically, these are content words: convey semantic information in a sentence
  • Infrequent words
    • Also content words: Nouns, Verbs, Adjectives, rare Adverbs (such as “gorgeously”)
    • New words, names, foreign words, many numbers
consequence of zipf s law unknown words
Consequence of Zipf’s Law:unknown words
  • Because of highly skewed distribution, many possible words won’t appear in a corpus, and are therefore “unknown”
  • Some common words not found in Brown Corpus:

combustible parabola

preprocess headquartering

deodorizer deodorizers

usurps usurping

  • Names of people and places, especially foreign names
  • Vocabulary of specialized domains
    • Medical, legal, scientific, etc.
  • Neologisms (newly-formed words) won’t be in a corpus
some neologisms http www wordspy com diversions neologisms asp
Some neologismshttp://www.wordspy.com/diversions/neologisms.asp
  • self-coin, verb
    • To coin an already existing word that you didn't know about.
  • awkword, noun
    • A word that is difficult to pronounce.
  • jig-sawdust, noun
    • The sawdust-like bits that fall off jig-saw puzzle pieces.
  • multidude, noun
    • The collective noun for a group of surfers.
outline1
Outline
  • Word frequencies and Zipf’s law
  • N-grams and sparse data
  • Probability theory I
  • Smoothing I
n grams
N-grams
  • An n-gram is a contiguous sequence of N units, for some particular linguistic unit.
  • Example sentence, POS-tagged:

The/DT brown/JJ fox/NN jumped/VBD ./.

    • 5 different word 1-grams (unigrams): The, brown, fox, jumped, .
    • 4 different word 2-grams (bigrams): The brown, brown fox, fox jumped, jumped .
    • 3 different word 3-grams (trigrams): The brown fox, brown fox jumped, fox jumped .
    • 4 different POS bigrams: DT JJ, JJ NN, NN VBD, VBD .
n gram frequencies in a corpus
N-gram frequencies in a corpus
  • Corpus linguists research language usage
    • See what words an author uses
    • See how language changes
  • Look at n-gram frequencies in corpora
    • Example: Google n-grams

http://books.google.com/ngrams

sparse data problem
Sparse data problem
  • For many linguistic units, there will often be very few or even zero occurrences of logically possible n-grams in a corpus
  • Example: quick brown beaver
    • Not found in the Brown corpus (1.2 million words)
    • Not found by Google either!
sparse data results from combinatorics
Sparse data results from combinatorics
  • As the units to be counted in a corpus become larger (i.e., higher N for n-grams), fewer occurrences are expected to be seen
  • Due to exponential increase in # of logically possible N-grams
  • Example: Brown corpus: ~50,000 word types
    • Number of logically possible word N-grams

N = 1: 50,000 different unigrams

N = 2: 50,0002 = 2.5 billion different bigrams

N = 3: 50,0003 = 125 trillion different trigrams

expected frequency of an n gram brown 1 2 million tokens 50k word types
Expected frequency of an N-gram(Brown: 1.2 million tokens, 50k word types)
  • Suppose N-grams are uniformly distributed
    • “uniform” = each occurs equally often
  • Calculate expected frequencies of N-grams:
    • 50,000 word unigrams
      • 1.2 million / 50,000 = 24 occurrences
    • 2.5 billion word bigrams
      • 1.2 million / 2.5 billion = .00048 occurrences
    • 125 trillion word trigrams:
      • 1.2 million / 125 trillion = .0000000096 occurrences
consequences of zipf s law ii
Consequences of Zipf’s Law (II)
  • Zipf’s law: highly skewed distribution for words
    • Also holds for n > 1 (bigrams, trigrams, etc.), and for any linguistic unit
  • Makes the sparse data problem even worse!
    • Because of combinatorics, a higher value of N leads to sparse data for N-grams
    • But because of Zipf’s Law, even if N is small, the frequencies of N-grams consisting of low-frequency words will be quite sparse
dealing with sparse data
Dealing with sparse data
  • Since in NLP we develop systems to analyze novel examples of language, we must deal with sparse data somehow (cannot ignore)
  • Would need a massively massive corpus to count frequencies of rare n-grams
    • Infeasible; even Google isn’t big enough!
  • Smooth the frequencies
  • Impose structure upon data
    • Linguistic analysis: POS categories, syntax, morphemes, phonemes, semantic categories, etc.
google ngrams again
Google ngrams again
  • Access at http://books.google.com/ngrams
  • Also available in library on 6 DVDs:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13

  • Number of tokens: 1,024,908,267,229
  • Number of sentences: 95,119,665,584
  • Quantity of ngrams (types)

Number of unigrams: 13,588,391

Number of bigrams: 314,843,401

Number of trigrams: 977,069,902

Number of fourgrams: 1,313,818,354

Number of fivegrams: 1,176,470,663

outline2
Outline
  • Word frequencies and Zipf’s law
  • N-grams and sparse data
  • Probability theory I
  • Smoothing I
1 discrete random variables
1. Discrete random variables
  • A discrete random variabletakes on a range of values, or events
  • The set of possible events is the sample space, Ω
  • Example: rolling a die

Ω = {1 dot, 2 dots, 3 dots, 4 dots, 5 dots, 6 dots}

  • The occurrence of a random value taking on a particular value from the sample space is a trial
2 probability distribution
2. Probability distribution
  • A set of data can be described as a probability distribution over a set of events
  • Definition of a probability distribution:
    • We have a set of events x drawn from a finite sample space Ω
    • Probability of each event is between 0 and 1
    • Sum of probabilities of all events is
example probability distribution
Example: Probability distribution
  • Suppose you have a die that is equally weighted on all sides.
  • Let X be the random variable for the outcome of a single roll.

p(X=1 dot) = 1 / 6

p(X=2 dots) = 1 / 6

p(X=3 dots) = 1 / 6

p(X=4 dots) = 1 / 6

p(X=5 dots) = 1 / 6

p(X=6 dots) = 1 / 6

probability estimation
Probability estimation
  • Suppose you have a die and you don’t know how it is weighted.
  • Let X be the random variable for the outcome of a roll.
  • Want to produce values for p̂(X), which is an estimate of the probability distribution of X.
    • Read as “p-hat”
  • Do this through Maximum Likelihood Estimation (MLE): the probability of an event is the number of times it occurs, divided by the total number of trials.
example roll a die random variable x
Example: roll a die; random variable X
  • Data: roll a die 60 times, record the frequency of each event
    • 1 dot 9 rolls
    • 2 dots 10 rolls
    • 3 dots 9 rolls
    • 4 dots 12 rolls
    • 5 dots 9 roll
    • 6 dots 11 rolls
example roll a die random variable x1
Example: roll a die; random variable X
  • Maximum Likelihood Estimate:

p̂(X=x) = count(x) / total_count_of_all_events

  • p̂( X = 1 dot) = 9 / 60 = 0.150

p̂( X = 2 dots) = 10 / 60 = 0.166

p̂( X = 3 dots) = 9 / 60 = 0.150

p̂( X = 4 dots) = 12 / 60 = 0.200

p̂( X = 5 dots) = 9 / 60 = 0.150

p̂( X = 6 dots) = 11 / 60 = 0.183

Sum = 60 / 60 = 1.0

convergence of p x
Convergence of p̂(X)
  • Suppose we know that the die is equally weighted.
  • We observe that our values for p̂(X) are close to p(X), but not all exactly equal.
  • We would expect that as the number of trials increases, p̂(X) will get closer to p(X).
    • For example, we could roll the die 1,000,000 times. Probability estimate will improve with more data.
simplify notation
Simplify notation
  • People are often not precise, and write “p(X)” when they mean “p̂(X)”
    • We will do this also
  • Can also leave out the name of the random variable when it is understood
    • Example: p(X=4 dots) p(4 dots)
outline3
Outline
  • Word frequencies and Zipf’s law
  • N-grams and sparse data
  • Probability theory I
  • Smoothing I
don t use mle for n gram probabilities
Don’t use MLE for n-gram probabilities
  • Maximum Likelihood Estimation applied to n-grams:
    • count(X) = observed frequency of X in a corpus
    • p(X) = observed probability of X in a corpus
  • However, a corpus is only a sample of the language
    • Does not contain all possible words, phrases, constructions, etc.
  • Linguistically possible but nonexistent items in a corpus are assigned zero probability with MLE
    • Zero probability means impossible
    • But they are not impossible…
    • Need to revise probability estimate
smoothing
Smoothing
  • Take probability mass away from observed items, and assign to zero-count items
    • Solid line: observed counts
    • Dotted line: smoothed counts
smoothing methods
Smoothing methods

1. Add-one smoothing

2. Deleted estimation

3. Good-Turing smoothing

(will see later)

4. Witten-Bell smoothing

5. Backoff smoothing

6. Interpolated backoff

add one smoothing laplace estimator
Add-one smoothing (LaPlace estimator)
  • Simplest method
  • Example: smoothing N-grams
    • First, determine how many N-grams there are (for some value of N)
    • The frequencies of N-grams that are observed in the training data are not modified
    • For all other N-grams, assign a small constant value, such as 0.05
probability distribution changes after add one smoothing
Probability distribution changes after add-one smoothing
  • Since zero-count N-grams now have a nonzero frequency, the probability mass assigned to observed N-grams decreases
  • Example: 5 possible items

Original counts: [ 5, 3, 2, 0, 0 ]

Probability distribution: [ .5, .3, .2, 0.0, 0.0 ]

Smoothed counts, add 0.05: [5, 3, 2, .05, .05 ]

Smoothed probabilities: [.495, .297, .198, .005, .005 ]

Probabilities

now nonzero

Probabilities

are discounted

add one smoothing laplace estimator1
Add-one smoothing (LaPlace estimator)
  • Advantage: very simple technique
  • Disadvantages:
    • Too much probability mass can be assigned to zero-count n-grams. This is because there can be a huge number of zero-count n-grams.
  • Example:
    • Suppose you smooth word bigrams in the Brown corpus (1.2 million words), and use a smoothed frequency of .05.
    • The 495,000 different observed bigrams have a total smoothed count of 1,200,000.
    • The 2.5 billion unobserved bigrams have a total smoothed count of .05 x 2,500,000,000 = 125,000,000.
    • The probability mass assigned to unobserved bigrams is 125,000,000 / (125,000,000 + 1,200,000) = 99% !!!
add one smoothing laplace estimator2
Add-one smoothing (LaPlace estimator)
  • Solutions:
    • Use a smaller smoothed count
      • 0.5, .01, .001, .0001, .00001, etc.
      • Exact value depends on your application and the amount of data to be smoothed
    • Use a different smoothing method
2 deleted estimation
2. Deleted estimation
  • Divide 3 million words of Wall Street Journal into two halves, compare word counts in each half
  • Many words appear in only one half, even for count > 1

Original source; Mark Liberman, 1992 ACL tutorial

estimate counts by split halves
Estimate counts by split halves
  • Average count in 2nd half for words in 1st half,

including words that didn’t appear in 1st half

  • MLE for words in 1st half of corpus is “Half 1” column.
  • Using this as an estimate of word counts in “Half 2”:
    • MLE for zero-count words is very poor.
    • MLE for nonzero-count words is always too high.
deleted estimation
Deleted estimation
  • Smoothed count by deleted estimation is the average of:
    • the average frequency in the first half, plus
    • the average frequency in the second half,
    • for words with a given count in the first half
  • Shifts probability mass towards zero-count items
3 good turing smoothing
3. Good-Turing smoothing
  • Assumes that data is binomially distributed
  • Let Nc be the number of items that occur c times.
  • The adjusted frequency c* is:

http://www.roymech.co.uk/Useful_Tables/Statistics/Statistics_Distributions.html

good turing smoothing example jurafsky martin figure 4 8
Good-Turing smoothing: example(Jurafsky & Martin Figure 4.8)
  • Corpus: AP Newswire
  • c is Maximum Likelihood Estimate
  • c* is discounted, by Good-Turing
  • Example:

5* = (5 + 1) * (N6 / N5)

= 6 * (48190 / 68379)

= 4.22849

results are corpus dependent left ap newswire right berkeley restaurant corpus j m fig 4 8
Results are corpus-dependentLeft: AP Newswire, Right: Berkeley Restaurant corpus (J&M Fig. 4.8)
problems with good turing smoothing
Problems with Good-Turing smoothing
  • To estimate 0*, you must know how many things never occurred
    • Number of zero-frequency N-grams: all possible combinations of words?
  • For larger c, the Nc values get very small, so they themselves must be smoothed!
    • In practice, just smooth low values of c

E.g., leave frequencies of 6 or more unsmoothed

how do we treat novel n grams
How do we treat novel N-grams?
  • Simple methods that assign equal probability to all zero-count N-grams:
    • Add-one smoothing
    • Deleted estimation
    • Good-Turing smoothing
  • Assign differing probability to zero-count N-grams:
    • Witten-Bell smoothing
    • Backoff smoothing
    • Interpolated backoff