- 183 Views
- Uploaded on

Download Presentation
## Morphology & FSTs

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Roadmap

- Two-level morphology summary
- Unsupervised morphology

Combining FST Lexicon & Rules

- Two-level morphological system: ‘Cascade’
- Transducer from Lexicon to Intermediate
- Rule transducers from Intermediate to Surface

Integrating the Lexicon

- Replace classes with stems

Using the E-insertion FST

(fox,fox): q0, q0,q0,q1, accept

(fox#,fox#): q0.q0.q0.q1,q0, accept

(fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept

(fox^s,foxs): q0,q0,q0,q1 ,q2,q5,reject

(fox^z#,foxz#) ?

Issues

- What do you think of creating all the rules for a languages – by hand?
- Time-consuming, complicated

Issues

- What do you think of creating all the rules for a languages – by hand?
- Time-consuming, complicated
- Proposed approach:
- Unsupervised morphology induction

Issues

- What do you think of creating all the rules for a languages – by hand?
- Time-consuming, complicated
- Proposed approach:
- Unsupervised morphology induction
- Potentially useful for many applications
- IR, MT

Unsupervised Morphology

- Start from tokenized text (or word frequencies)
- talk 60
- talked 120
- walked 40
- walk 30

Unsupervised Morphology

- Start from tokenized text (or word frequencies)
- talk 60
- talked 120
- walked 40
- walk 30
- Treat as coding/compression problem
- Find most compact representation of lexicon
- Popular model MDL (Minimum Description Length)
- Smallest total encoding:
- Weighted combination of lexicon size & ‘rules’

Approach

- Generate initial model:
- Base set of words, compute MDL length

Approach

- Generate initial model:
- Base set of words, compute MDL length
- Iterate:
- Generate a new set of words + some model to create a smaller description size

Approach

- Generate initial model:
- Base set of words, compute MDL length
- Iterate:
- Generate a new set of words + some model to create a smaller description size
- E.g. for talk, talked, walk, walked
- 4 words

Approach

- Generate initial model:
- Base set of words, compute MDL length
- Iterate:
- Generate a new set of words + some model to create a smaller description size
- E.g. for talk, talked, walk, walked
- 4 words
- 2 words (talk, walk) + 1 affix (-ed) + combination info
- 2 words (t,w) + 2 affixes (alk,-ed) + combination info

Successful Applications

- Inducing word classes (e.g. N,V) by affix patterns
- Unsupervised morphological analysis for MT
- Word segmentation in CJK
- Word text/sound segmentation in English

Formal Languages

- Formal Languages and Grammars
- Chomsky hierarchy
- Languages and the grammars that accept/generate

Formal Languages

- Formal Languages and Grammars
- Chomsky hierarchy
- Languages and the grammars that accept/generate
- Equivalences
- Regular languages
- Regular grammars
- Regular expressions
- Finite State Automata

Finite-State Automata & Transducers

- Finite-State Automata:
- Deterministic & non-deterministic automata
- Equivalence and conversion
- Probabilistic & weighted FSAs

Finite-State Automata & Transducers

- Finite-State Automata:
- Deterministic & non-deterministic automata
- Equivalence and conversion
- Probabilistic & weighted FSAs
- Packages and operations: Carmel

Finite-State Automata & Transducers

- Finite-State Automata:
- Deterministic & non-deterministic automata
- Equivalence and conversion
- Probabilistic & weighted FSAs
- Packages and operations: Carmel
- FSTs & regular relations
- Closures and equivalences
- Composition, inversion

FSA/FST Applications

- Range of applications:
- Parsing
- Translation
- Tokenization…

FSA/FST Applications

- Range of applications:
- Parsing
- Translation
- Tokenization…
- Morphology:
- Lexicon: cat: N, +Sg; -s: Pl
- Morphotactics: N+PL
- Orthographic rules: fox + s foxes
- Parsing & Generation

Implementation

- Tokenizers
- FSA acceptors
- FST acceptors/translators
- Orthographic rule as FST

Roadmap

- Motivation:
- LM applications
- N-grams
- Training and Testing
- Evaluation:
- Perplexity

Predicting Words

- Given a sequence of words, the next word is (somewhat) predictable:
- I’d like to place a collect …..

Predicting Words

- Given a sequence of words, the next word is (somewhat) predictable:
- I’d like to place a collect …..
- Ngram models: Predict next word given previous N
- Language models (LMs):
- Statistical models of word sequences

Predicting Words

- Given a sequence of words, the next word is (somewhat) predictable:
- I’d like to place a collect …..
- Ngram models: Predict next word given previous N
- Language models (LMs):
- Statistical models of word sequences
- Approach:
- Build model of word sequences from corpus
- Given alternative sequences, select the most probable

N-gram LM Applications

- Used in
- Speech recognition
- Spelling correction
- Augmentative communication
- Part-of-speech tagging
- Machine translation
- Information retrieval

Terminology

- Corpus (pl. corpora):
- Online collection of text of speech
- E.g. Brown corpus: 1M word, balanced text collection
- E.g. Switchboard: 240 hrs of speech; ~3M words

Terminology

- Corpus (pl. corpora):
- Online collection of text of speech
- E.g. Brown corpus: 1M word, balanced text collection
- E.g. Switchboard: 240 hrs of speech; ~3M words
- Wordform:
- Full inflected or derived form of word: cats, glottalized

Terminology

- Corpus (pl. corpora):
- Online collection of text of speech
- E.g. Brown corpus: 1M word, balanced text collection
- E.g. Switchboard: 240 hrs of speech; ~3M words
- Wordform:
- Full inflected or derived form of word: cats, glottalized
- Word types: # of distinct words in corpus

Terminology

- Corpus (pl. corpora):
- Online collection of text of speech
- E.g. Brown corpus: 1M word, balanced text collection
- E.g. Switchboard: 240 hrs of speech; ~3M words
- Wordform:
- Full inflected or derived form of word: cats, glottalized
- Word types: # of distinct words in corpus
- Word tokens: total # of words in corpus

Corpus Counts

- Estimate probabilities by counts in large collections of text/speech
- Should we count:
- Wordformvslemma ?
- Case? Punctuation? Disfluency?
- Type vs Token ?

Words, Counts and Prediction

- They picnicked by the pool, then lay back on the grass and looked at the stars.

Words, Counts and Prediction

- They picnicked by the pool, then lay back on the grass and looked at the stars.
- Word types (excluding punct):

Words, Counts and Prediction

- They picnicked by the pool, then lay back on the grass and looked at the stars.
- Word types (excluding punct): 14
- Word tokens (“ ):

Words, Counts and Prediction

- They picnicked by the pool, then lay back on the grass and looked at the stars.
- Word types (excluding punct): 14
- Word tokens (“ ): 16.
- I do uh main- mainly business data processing
- Utterance (spoken “sentence” equivalent)

Words, Counts and Prediction

- They picnicked by the pool, then lay back on the grass and looked at the stars.
- Word types (excluding punct): 14
- Word tokens (“ ): 16.
- I do uh main- mainly business data processing
- Utterance (spoken “sentence” equivalent)
- What about:
- Disfluencies
- main-: fragment
- uh: filler (aka filled pause)

Words, Counts and Prediction

- They picnicked by the pool, then lay back on the grass and looked at the stars.
- Word types (excluding punct): 14
- Word tokens (“ ): 16.
- I do uh main- mainly business data processing
- Utterance (spoken “sentence” equivalent)
- What about:
- Disfluencies
- main-: fragment
- uh: filler (aka filled pause)
- Keep, depending on app.: can help prediction; uh vs um

LM Task

- Training:
- Given a corpus of text, learn probabilities of word sequences

LM Task

- Training:
- Given a corpus of text, learn probabilities of word sequences
- Testing:
- Given trained LM and new text, determine sequence probabilities, or
- Select most probable sequence among alternatives

LM Task

- Training:
- Given a corpus of text, learn probabilities of word sequences
- Testing:
- Given trained LM and new text, determine sequence probabilities, or
- Select most probable sequence among alternatives
- LM types:
- Basic, Class-based, Structured

Word Prediction

- Goal:
- Given some history, what is probability of some next word?
- Formally, P(w|h)
- e.g. P(call|I’d like to place a collect)

Word Prediction

- Goal:
- Given some history, what is probability of some next word?
- Formally, P(w|h)
- e.g. P(call|I’d like to place a collect)
- How can we compute?

Word Prediction

- Goal:
- Given some history, what is probability of some next word?
- Formally, P(w|h)
- e.g. P(call|I’d like to place a collect)
- How can we compute?
- Relative frequency in a corpus
- C(I’d like to place a collect call)/C(I’d like to place a collect)
- Issues?

Word Prediction

- Goal:
- Given some history, what is probability of some next word?
- Formally, P(w|h)
- e.g. P(call|I’d like to place a collect)
- How can we compute?
- Relative frequency in a corpus
- C(I’d like to place a collect call)/C(I’d like to place a collect)
- Issues?
- Zero counts: language is productive!
- Joint word sequence probability of length N:
- Count of all sequences of length N & count of that sequence

Word Sequence Probability

- Notation:
- P(Xi=the) written as P(the)
- P(w1w2w3…wn) =

Word Sequence Probability

- Notation:
- P(Xi=the) written as P(the)
- P(w1w2w3…wn) =
- Compute probability of word sequence by chain rule
- Links to word prediction by history

Word Sequence Probability

- Notation:
- P(Xi=the) written as P(the)
- P(w1w2w3…wn) =
- Compute probability of word sequence by chain rule
- Links to word prediction by history
- Issues?

Word Sequence Probability

- Notation:
- P(Xi=the) written as P(the)
- P(w1w2w3…wn) =
- Compute probability of word sequence by chain rule
- Links to word prediction by history
- Issues?
- Potentially infinite history

Word Sequence Probability

- Notation:
- P(Xi=the) written as P(the)
- P(w1w2w3…wn) =
- Compute probability of word sequence by chain rule
- Links to word prediction by history
- Issues?
- Potentially infinite history
- Language infinitely productive

Markov Assumptions

- Exact computation requires too much data

Markov Assumptions

- Exact computation requires too much data
- Approximate probability given all prior words
- Assume finitehistory

Markov Assumptions

- Exact computation requires too much data
- Approximate probability given all prior words
- Assume finitehistory
- Unigram: Probability of word in isolation (0th order)
- Bigram: Probability of word given 1 previous
- First-order Markov
- Trigram: Probability of word given 2 previous

Markov Assumptions

- Exact computation requires too much data
- Approximate probability given all prior words
- Assume finitehistory
- Unigram: Probability of word in isolation (0th order)
- Bigram: Probability of word given 1 previous
- First-order Markov
- Trigram: Probability of word given 2 previous
- N-gram approximation

Bigram sequence

Unigram Models

- P(w1w2…w3)~

Unigram Models

- P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn)
- Training:
- Estimate P(w) given corpus

Unigram Models

- P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn)
- Training:
- Estimate P(w) given corpus
- Relative frequency:

Unigram Models

- P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn)
- Training:
- Estimate P(w) given corpus
- Relative frequency: P(w) = C(w)/N, N=# tokens in corpus
- How many parameters?

Unigram Models

- P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn)
- Training:
- Estimate P(w) given corpus
- Relative frequency: P(w) = C(w)/N, N=# tokens in corpus
- How many parameters?
- Testing: For sentence s, compute P(s)
- Model with PFA:
- Input symbols? Probabilities on arcs? States?

Bigram Models

- P(w1w2…w3) = P(BOS w1w2….wnEOS)

Bigram Models

- P(w1w2…w3) = P(BOS w1w2….wnEOS)
- ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)
- Training:
- Relative frequency:

Bigram Models

- P(w1w2…w3) = P(BOS w1w2….wnEOS)
- ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)
- Training:
- Relative frequency: P(wi|wi-1) = C(wi-1wi)/C(wi-1)
- How many parameters?

Bigram Models

- P(w1w2…w3) = P(BOS w1w2….wnEOS)
- ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)
- Training:
- Relative frequency: P(wi|wi-1) = C(wi-1wi)/C(wi-1)
- How many parameters?
- Testing: For sentence s, compute P(s)
- Model with PFA:
- Input symbols? Probabilities on arcs? States?

Trigram Models

- P(w1w2…w3) = P(BOS w1w2….wnEOS)

Trigram Models

- P(w1w2…w3) = P(BOS w1w2….wnEOS)
- ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn)
- Training:
- P(wi|wi-2,wi-1)

Trigram Models

- P(w1w2…w3) = P(BOS w1w2….wnEOS)
- ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn)
- Training:
- P(wi|wi-2,wi-1) = C(wi-2 wi-1wi)/C(wi-2wi-1)
- How many parameters?

Trigram Models

- P(w1w2…w3) = P(BOS w1w2….wnEOS)
- ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn)
- Training:
- P(wi|wi-2,wi-1) = C(wi-2 wi-1wi)/C(wi-2wi-1)
- How many parameters?
- How many states?

An Example

- <s> I am Sam </s>
- <s> Sam I am </s>
- <s> I do not like green eggs and ham </s>

Speech and Language Processing - Jurafsky and Martin

Recap

- Ngrams:
- # FSA states:

Recap

- Ngrams:
- # FSA states: |V|n-1
- # Model parameters:

Recap

- Ngrams:
- # FSA states: |V|n-1
- # Model parameters: |V|n
- Issues:

Recap

- Ngrams:
- # FSA states: |V|n-1
- # Model parameters: |V|n
- Issues:
- Data sparseness, Out-of-vocabulary elements (OOV)
- Smoothing
- Mismatches between training & test data
- Other Language Models

Download Presentation

Connecting to Server..