Loading in 2 Seconds...

LING / C SC 439/539 Statistical Natural Language Processing

Loading in 2 Seconds...

- By
**snowy** - Follow User

- 83 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'LING / C SC 439/539 Statistical Natural Language Processing' - snowy

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

For you to do

- Sign attendance sheet
- Turn in survey/quiz
- Fill out Doodle poll for office hours, if you think you might want to go to office hours
- Link was sent to you in an e-mail

Recommended reading

- Manning & Schutze
- 1.4: Zipf’s Law
- Chapter 4: Corpus-based work
- Jurafsky & Martin Chapter 4
- 4.1, 4.2, 4.3: counting n-grams
- 4.5: smoothing

Outline

- Word frequencies and Zipf’s law
- N-grams and sparse data
- Probability theory I
- Smoothing I

Get statistics to do statistical NLP

- Determine how often things occur in language
- But language is an abstract entity
- Look at a corpus instead
- Corpus: large electronic file of language usage
- Written texts: news articles, fiction, poetry, web pages
- Spoken language: audio files, text transcriptions
- Gather statistics from a corpus
- Count occurrences of different types of linguistic units: words, n-grams (sequences), part of speech (POS) tags, etc.

Where to get corpora

- Do a web search for “<language> corpus”
- NLTK (Natural Language Toolkit):

http://nltk.org/nltk_data/

- Linguistic Data Consortium (LDC)

http://www.ldc.upenn.edu/

- Many LDC corpora are at the UA library
- Main languages are English, Arabic, Chinese; spoken and written
- http://corplinguistics.wordpress.com/2011/10/30/top-five-ldc-corpora/
- Many other corpora, some accessible online:

http://en.wikipedia.org/wiki/Text_corpus

Brown Corpus

- http://en.wikipedia.org/wiki/Brown_Corpus
- Compilation of texts of American English in the 1960s
- 500 texts from 15 different genres: news, religious texts, popular books, government documents, fiction, etc.
- 1,154,633 words
- You will you access to this corpus

How do we define “word”?

- Simplest answer: any contiguous sequence of non-whitespace characters is a word
- Examples on next slide
- It is useful to first perform tokenization: split a raw text into linguistically meaningful lexical units
- In this class we will use a pre-tokenized version of the Brown Corpus

Some examples of “words” in the untokenized Brown Corpus

already

The

the

(the

Caldwell

Caldwell's

Caldwell,

ribbon''

Lt.

Ave.,

1964

1940s

$50

$6,100,000,000

fire-fighting

natural-law

Baden-Baden

1961-62

France-Germany

Kohnstamm-positive

Colquitt--After

Frequencies of words in a corpus: types and tokens

- Brown corpus:
- 1,154,633 word tokens
- 49,680 word types
- Type vs. token (for words):
- Type: a distinct word
- e.g. “with”
- Token: an individual occurrence of a word
- e.g. “with” occurs 7,270 times
- Question: what percentage of the word types in a corpus appear exactly once? (make a guess)

Frequency and rank

- Sort words by decreasing frequency
- Rank = order in sorted list
- Rank 1: most-frequent word
- Rank 2: second most-frequent word
- Rank 3: third most-frequent word
- etc.
- Plot word frequencies by rank

Plot of word frequencies, log-log scale

10 most-freq words: freq. > 10,000

Next 90 words: 1,000 < freq. < 10,000

Next 900 words: 100 < freq. < 1,000

Next 9000: 10 < freq. < 100

~40,000 words: 1 <= freq < 10

Word frequency distributions in language

- Exemplifies a power law distribution
- For any corpus and any language:
- There are a few very common words
- A substantial number of medium freq. words
- A huge number of low frequency words
- Brown corpus
- 1,154,633 word tokens, 49,680 word types
- 21,919 types appear only once! = 44.1%

Word frequencies follow Zipf’s law

- http://en.wikipedia.org/wiki/Zipf%27s_law
- Zipf’s Law (1935, 1949): the frequencyF of a word w is inversely proportional to the rankR of w:

F 1 / R

i.e., F x R = k, for some constant k

- Example: 50th most common word occurs approximately

3 times as frequently as 150th most common word

freq. at rank 50: 1 / 50

freq. at rank 150: 1 / 150

( 1 / 50 ) / ( 1 / 150 ) = 3

Zipf’s Law explains linear-like relationship between frequency and rank in log-log scale

Red line = constant k in Zipf’s Law

Some words with a frequency of 20 in Brown Corpus

- pursue
- purchased
- punishment
- promises
- procurement
- probabilities
- precious
- pitcher
- pitch

replacement

repair

relating

rehabilitation

refund

receives

ranks

queen

quarrel

puts

Some words with a frequency of 1 in Brown Corpus

- government-controlled
- gouverne
- goutte
- gourmets
- gourmet's
- gothic
- gossiping
- gossiped
- gossamer
- gosh

- gorgeously
- gorge
- gooshey
- goooolick
- goofed
- gooey
- goody
- goodness'
- goodies
- good-will

Frequencies of different types of words

- Extremely frequent words
- Belong to part-of-speech categories of Determiner, Preposition, Conjunction, Pronoun
- Also frequent adverbs, and punctuation
- Linguistically, these are function words: convey the syntactic information in a sentence
- Moderately frequent words
- Common Nouns, Verbs, Adjectives, Adverbs
- Linguistically, these are content words: convey semantic information in a sentence
- Infrequent words
- Also content words: Nouns, Verbs, Adjectives, rare Adverbs (such as “gorgeously”)
- New words, names, foreign words, many numbers

Consequence of Zipf’s Law:unknown words

- Because of highly skewed distribution, many possible words won’t appear in a corpus, and are therefore “unknown”
- Some common words not found in Brown Corpus:

combustible parabola

preprocess headquartering

deodorizer deodorizers

usurps usurping

- Names of people and places, especially foreign names
- Vocabulary of specialized domains
- Medical, legal, scientific, etc.
- Neologisms (newly-formed words) won’t be in a corpus

Some neologismshttp://www.wordspy.com/diversions/neologisms.asp

- self-coin, verb
- To coin an already existing word that you didn't know about.
- awkword, noun
- A word that is difficult to pronounce.
- jig-sawdust, noun
- The sawdust-like bits that fall off jig-saw puzzle pieces.
- multidude, noun
- The collective noun for a group of surfers.

Outline

- Word frequencies and Zipf’s law
- N-grams and sparse data
- Probability theory I
- Smoothing I

N-grams

- An n-gram is a contiguous sequence of N units, for some particular linguistic unit.
- Example sentence, POS-tagged:

The/DT brown/JJ fox/NN jumped/VBD ./.

- 5 different word 1-grams (unigrams): The, brown, fox, jumped, .
- 4 different word 2-grams (bigrams): The brown, brown fox, fox jumped, jumped .
- 3 different word 3-grams (trigrams): The brown fox, brown fox jumped, fox jumped .
- 4 different POS bigrams: DT JJ, JJ NN, NN VBD, VBD .

N-gram frequencies in a corpus

- Corpus linguists research language usage
- See what words an author uses
- See how language changes
- Look at n-gram frequencies in corpora
- Example: Google n-grams

http://books.google.com/ngrams

Sparse data problem

- For many linguistic units, there will often be very few or even zero occurrences of logically possible n-grams in a corpus
- Example: quick brown beaver
- Not found in the Brown corpus (1.2 million words)
- Not found by Google either!

Sparse data results from combinatorics

- As the units to be counted in a corpus become larger (i.e., higher N for n-grams), fewer occurrences are expected to be seen
- Due to exponential increase in # of logically possible N-grams
- Example: Brown corpus: ~50,000 word types
- Number of logically possible word N-grams

N = 1: 50,000 different unigrams

N = 2: 50,0002 = 2.5 billion different bigrams

N = 3: 50,0003 = 125 trillion different trigrams

Expected frequency of an N-gram(Brown: 1.2 million tokens, 50k word types)

- Suppose N-grams are uniformly distributed
- “uniform” = each occurs equally often
- Calculate expected frequencies of N-grams:
- 50,000 word unigrams
- 1.2 million / 50,000 = 24 occurrences
- 2.5 billion word bigrams
- 1.2 million / 2.5 billion = .00048 occurrences
- 125 trillion word trigrams:
- 1.2 million / 125 trillion = .0000000096 occurrences

Consequences of Zipf’s Law (II)

- Zipf’s law: highly skewed distribution for words
- Also holds for n > 1 (bigrams, trigrams, etc.), and for any linguistic unit
- Makes the sparse data problem even worse!
- Because of combinatorics, a higher value of N leads to sparse data for N-grams
- But because of Zipf’s Law, even if N is small, the frequencies of N-grams consisting of low-frequency words will be quite sparse

Dealing with sparse data

- Since in NLP we develop systems to analyze novel examples of language, we must deal with sparse data somehow (cannot ignore)
- Would need a massively massive corpus to count frequencies of rare n-grams
- Infeasible; even Google isn’t big enough!
- Smooth the frequencies
- Impose structure upon data
- Linguistic analysis: POS categories, syntax, morphemes, phonemes, semantic categories, etc.

Google ngrams again

- Access at http://books.google.com/ngrams
- Also available in library on 6 DVDs:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13

- Number of tokens: 1,024,908,267,229
- Number of sentences: 95,119,665,584
- Quantity of ngrams (types)

Number of unigrams: 13,588,391

Number of bigrams: 314,843,401

Number of trigrams: 977,069,902

Number of fourgrams: 1,313,818,354

Number of fivegrams: 1,176,470,663

Outline

- Word frequencies and Zipf’s law
- N-grams and sparse data
- Probability theory I
- Smoothing I

1. Discrete random variables

- A discrete random variabletakes on a range of values, or events
- The set of possible events is the sample space, Ω
- Example: rolling a die

Ω = {1 dot, 2 dots, 3 dots, 4 dots, 5 dots, 6 dots}

- The occurrence of a random value taking on a particular value from the sample space is a trial

2. Probability distribution

- A set of data can be described as a probability distribution over a set of events
- Definition of a probability distribution:
- We have a set of events x drawn from a finite sample space Ω
- Probability of each event is between 0 and 1
- Sum of probabilities of all events is

Example: Probability distribution

- Suppose you have a die that is equally weighted on all sides.
- Let X be the random variable for the outcome of a single roll.

p(X=1 dot) = 1 / 6

p(X=2 dots) = 1 / 6

p(X=3 dots) = 1 / 6

p(X=4 dots) = 1 / 6

p(X=5 dots) = 1 / 6

p(X=6 dots) = 1 / 6

Probability estimation

- Suppose you have a die and you don’t know how it is weighted.
- Let X be the random variable for the outcome of a roll.
- Want to produce values for p̂(X), which is an estimate of the probability distribution of X.
- Read as “p-hat”
- Do this through Maximum Likelihood Estimation (MLE): the probability of an event is the number of times it occurs, divided by the total number of trials.

Example: roll a die; random variable X

- Data: roll a die 60 times, record the frequency of each event
- 1 dot 9 rolls
- 2 dots 10 rolls
- 3 dots 9 rolls
- 4 dots 12 rolls
- 5 dots 9 roll
- 6 dots 11 rolls

Example: roll a die; random variable X

- Maximum Likelihood Estimate:

p̂(X=x) = count(x) / total_count_of_all_events

- p̂( X = 1 dot) = 9 / 60 = 0.150

p̂( X = 2 dots) = 10 / 60 = 0.166

p̂( X = 3 dots) = 9 / 60 = 0.150

p̂( X = 4 dots) = 12 / 60 = 0.200

p̂( X = 5 dots) = 9 / 60 = 0.150

p̂( X = 6 dots) = 11 / 60 = 0.183

Sum = 60 / 60 = 1.0

Convergence of p̂(X)

- Suppose we know that the die is equally weighted.
- We observe that our values for p̂(X) are close to p(X), but not all exactly equal.
- We would expect that as the number of trials increases, p̂(X) will get closer to p(X).
- For example, we could roll the die 1,000,000 times. Probability estimate will improve with more data.

Simplify notation

- People are often not precise, and write “p(X)” when they mean “p̂(X)”
- We will do this also
- Can also leave out the name of the random variable when it is understood
- Example: p(X=4 dots) p(4 dots)

Outline

- Word frequencies and Zipf’s law
- N-grams and sparse data
- Probability theory I
- Smoothing I

Don’t use MLE for n-gram probabilities

- Maximum Likelihood Estimation applied to n-grams:
- count(X) = observed frequency of X in a corpus
- p(X) = observed probability of X in a corpus
- However, a corpus is only a sample of the language
- Does not contain all possible words, phrases, constructions, etc.
- Linguistically possible but nonexistent items in a corpus are assigned zero probability with MLE
- Zero probability means impossible
- But they are not impossible…
- Need to revise probability estimate

Smoothing

- Take probability mass away from observed items, and assign to zero-count items
- Solid line: observed counts
- Dotted line: smoothed counts

Smoothing methods

1. Add-one smoothing

2. Deleted estimation

3. Good-Turing smoothing

(will see later)

4. Witten-Bell smoothing

5. Backoff smoothing

6. Interpolated backoff

Add-one smoothing (LaPlace estimator)

- Simplest method
- Example: smoothing N-grams
- First, determine how many N-grams there are (for some value of N)
- The frequencies of N-grams that are observed in the training data are not modified
- For all other N-grams, assign a small constant value, such as 0.05

Probability distribution changes after add-one smoothing

- Since zero-count N-grams now have a nonzero frequency, the probability mass assigned to observed N-grams decreases
- Example: 5 possible items

Original counts: [ 5, 3, 2, 0, 0 ]

Probability distribution: [ .5, .3, .2, 0.0, 0.0 ]

Smoothed counts, add 0.05: [5, 3, 2, .05, .05 ]

Smoothed probabilities: [.495, .297, .198, .005, .005 ]

Probabilities

now nonzero

Probabilities

are discounted

Add-one smoothing (LaPlace estimator)

- Advantage: very simple technique
- Disadvantages:
- Too much probability mass can be assigned to zero-count n-grams. This is because there can be a huge number of zero-count n-grams.
- Example:
- Suppose you smooth word bigrams in the Brown corpus (1.2 million words), and use a smoothed frequency of .05.
- The 495,000 different observed bigrams have a total smoothed count of 1,200,000.
- The 2.5 billion unobserved bigrams have a total smoothed count of .05 x 2,500,000,000 = 125,000,000.
- The probability mass assigned to unobserved bigrams is 125,000,000 / (125,000,000 + 1,200,000) = 99% !!!

Add-one smoothing (LaPlace estimator)

- Solutions:
- Use a smaller smoothed count
- 0.5, .01, .001, .0001, .00001, etc.
- Exact value depends on your application and the amount of data to be smoothed
- Use a different smoothing method

2. Deleted estimation

- Divide 3 million words of Wall Street Journal into two halves, compare word counts in each half
- Many words appear in only one half, even for count > 1

Original source; Mark Liberman, 1992 ACL tutorial

Estimate counts by split halves

- Average count in 2nd half for words in 1st half,

including words that didn’t appear in 1st half

- MLE for words in 1st half of corpus is “Half 1” column.
- Using this as an estimate of word counts in “Half 2”:
- MLE for zero-count words is very poor.
- MLE for nonzero-count words is always too high.

Deleted estimation

- Smoothed count by deleted estimation is the average of:
- the average frequency in the first half, plus
- the average frequency in the second half,
- for words with a given count in the first half
- Shifts probability mass towards zero-count items

3. Good-Turing smoothing

- Assumes that data is binomially distributed
- Let Nc be the number of items that occur c times.
- The adjusted frequency c* is:

http://www.roymech.co.uk/Useful_Tables/Statistics/Statistics_Distributions.html

Good-Turing smoothing: example(Jurafsky & Martin Figure 4.8)

- Corpus: AP Newswire
- c is Maximum Likelihood Estimate
- c* is discounted, by Good-Turing
- Example:

5* = (5 + 1) * (N6 / N5)

= 6 * (48190 / 68379)

= 4.22849

Results are corpus-dependentLeft: AP Newswire, Right: Berkeley Restaurant corpus (J&M Fig. 4.8)

Problems with Good-Turing smoothing

- To estimate 0*, you must know how many things never occurred
- Number of zero-frequency N-grams: all possible combinations of words?
- For larger c, the Nc values get very small, so they themselves must be smoothed!
- In practice, just smooth low values of c

E.g., leave frequencies of 6 or more unsmoothed

How do we treat novel N-grams?

- Simple methods that assign equal probability to all zero-count N-grams:
- Add-one smoothing
- Deleted estimation
- Good-Turing smoothing
- Assign differing probability to zero-count N-grams:
- Witten-Bell smoothing
- Backoff smoothing
- Interpolated backoff

Download Presentation

Connecting to Server..