morphology fsts n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Morphology & FSTs PowerPoint Presentation
Download Presentation
Morphology & FSTs

Loading in 2 Seconds...

play fullscreen
1 / 75

Morphology & FSTs - PowerPoint PPT Presentation


  • 182 Views
  • Uploaded on

Morphology & FSTs. Shallow Processing Techniques for NLP Ling570 October 17, 2011. Roadmap. Two-level morphology summary Unsupervised morphology. Combining FST Lexicon & Rules. Two-level morphological system: ‘Cascade’ Transducer from Lexicon to Intermediate

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Morphology & FSTs' - albina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
morphology fsts

Morphology & FSTs

Shallow Processing Techniques for NLPLing570

October 17, 2011

roadmap
Roadmap
  • Two-level morphology summary
  • Unsupervised morphology
combining fst lexicon rules
Combining FST Lexicon & Rules
  • Two-level morphological system: ‘Cascade’
    • Transducer from Lexicon to Intermediate
    • Rule transducers from Intermediate to Surface
integrating the lexicon
Integrating the Lexicon
  • Replace classes with stems
using the e insertion fst
Using the E-insertion FST

(fox,fox): q0, q0,q0,q1, accept

(fox#,fox#): q0.q0.q0.q1,q0, accept

(fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept

(fox^s,foxs): q0,q0,q0,q1 ,q2,q5,reject

(fox^z#,foxz#) ?

issues
Issues
  • What do you think of creating all the rules for a languages – by hand?
    • Time-consuming, complicated
issues1
Issues
  • What do you think of creating all the rules for a languages – by hand?
    • Time-consuming, complicated
  • Proposed approach:
    • Unsupervised morphology induction
issues2
Issues
  • What do you think of creating all the rules for a languages – by hand?
    • Time-consuming, complicated
  • Proposed approach:
    • Unsupervised morphology induction
  • Potentially useful for many applications
    • IR, MT
unsupervised morphology
Unsupervised Morphology
  • Start from tokenized text (or word frequencies)
    • talk 60
    • talked 120
    • walked 40
    • walk 30
unsupervised morphology1
Unsupervised Morphology
  • Start from tokenized text (or word frequencies)
    • talk 60
    • talked 120
    • walked 40
    • walk 30
  • Treat as coding/compression problem
    • Find most compact representation of lexicon
      • Popular model MDL (Minimum Description Length)
        • Smallest total encoding:
          • Weighted combination of lexicon size & ‘rules’
approach
Approach
  • Generate initial model:
    • Base set of words, compute MDL length
approach1
Approach
  • Generate initial model:
    • Base set of words, compute MDL length
  • Iterate:
    • Generate a new set of words + some model to create a smaller description size
approach2
Approach
  • Generate initial model:
    • Base set of words, compute MDL length
  • Iterate:
    • Generate a new set of words + some model to create a smaller description size
  • E.g. for talk, talked, walk, walked
    • 4 words
approach3
Approach
  • Generate initial model:
    • Base set of words, compute MDL length
  • Iterate:
    • Generate a new set of words + some model to create a smaller description size
  • E.g. for talk, talked, walk, walked
    • 4 words
    • 2 words (talk, walk) + 1 affix (-ed) + combination info
    • 2 words (t,w) + 2 affixes (alk,-ed) + combination info
successful applications
Successful Applications
  • Inducing word classes (e.g. N,V) by affix patterns
  • Unsupervised morphological analysis for MT
  • Word segmentation in CJK
  • Word text/sound segmentation in English
formal languages
Formal Languages
  • Formal Languages and Grammars
    • Chomsky hierarchy
    • Languages and the grammars that accept/generate
formal languages1
Formal Languages
  • Formal Languages and Grammars
    • Chomsky hierarchy
    • Languages and the grammars that accept/generate
  • Equivalences
    • Regular languages
    • Regular grammars
    • Regular expressions
    • Finite State Automata
finite state automata transducers
Finite-State Automata & Transducers
  • Finite-State Automata:
    • Deterministic & non-deterministic automata
      • Equivalence and conversion
    • Probabilistic & weighted FSAs
finite state automata transducers1
Finite-State Automata & Transducers
  • Finite-State Automata:
    • Deterministic & non-deterministic automata
      • Equivalence and conversion
    • Probabilistic & weighted FSAs
  • Packages and operations: Carmel
finite state automata transducers2
Finite-State Automata & Transducers
  • Finite-State Automata:
    • Deterministic & non-deterministic automata
      • Equivalence and conversion
    • Probabilistic & weighted FSAs
  • Packages and operations: Carmel
  • FSTs & regular relations
    • Closures and equivalences
    • Composition, inversion
fsa fst applications
FSA/FST Applications
  • Range of applications:
    • Parsing
    • Translation
    • Tokenization…
fsa fst applications1
FSA/FST Applications
  • Range of applications:
    • Parsing
    • Translation
    • Tokenization…
  • Morphology:
    • Lexicon: cat: N, +Sg; -s: Pl
    • Morphotactics: N+PL
    • Orthographic rules: fox + s  foxes
    • Parsing & Generation
implementation
Implementation
  • Tokenizers
  • FSA acceptors
  • FST acceptors/translators
  • Orthographic rule as FST
roadmap1
Roadmap
  • Motivation:
    • LM applications
  • N-grams
  • Training and Testing
  • Evaluation:
    • Perplexity
predicting words
Predicting Words
  • Given a sequence of words, the next word is (somewhat) predictable:
    • I’d like to place a collect …..
predicting words1
Predicting Words
  • Given a sequence of words, the next word is (somewhat) predictable:
    • I’d like to place a collect …..
  • Ngram models: Predict next word given previous N
  • Language models (LMs):
    • Statistical models of word sequences
predicting words2
Predicting Words
  • Given a sequence of words, the next word is (somewhat) predictable:
    • I’d like to place a collect …..
  • Ngram models: Predict next word given previous N
  • Language models (LMs):
    • Statistical models of word sequences
  • Approach:
    • Build model of word sequences from corpus
    • Given alternative sequences, select the most probable
n gram lm applications
N-gram LM Applications
  • Used in
    • Speech recognition
    • Spelling correction
    • Augmentative communication
    • Part-of-speech tagging
    • Machine translation
    • Information retrieval
terminology
Terminology
  • Corpus (pl. corpora):
    • Online collection of text of speech
      • E.g. Brown corpus: 1M word, balanced text collection
      • E.g. Switchboard: 240 hrs of speech; ~3M words
terminology1
Terminology
  • Corpus (pl. corpora):
    • Online collection of text of speech
      • E.g. Brown corpus: 1M word, balanced text collection
      • E.g. Switchboard: 240 hrs of speech; ~3M words
  • Wordform:
    • Full inflected or derived form of word: cats, glottalized
terminology2
Terminology
  • Corpus (pl. corpora):
    • Online collection of text of speech
      • E.g. Brown corpus: 1M word, balanced text collection
      • E.g. Switchboard: 240 hrs of speech; ~3M words
  • Wordform:
    • Full inflected or derived form of word: cats, glottalized
  • Word types: # of distinct words in corpus
terminology3
Terminology
  • Corpus (pl. corpora):
    • Online collection of text of speech
      • E.g. Brown corpus: 1M word, balanced text collection
      • E.g. Switchboard: 240 hrs of speech; ~3M words
  • Wordform:
    • Full inflected or derived form of word: cats, glottalized
  • Word types: # of distinct words in corpus
  • Word tokens: total # of words in corpus
corpus counts
Corpus Counts
  • Estimate probabilities by counts in large collections of text/speech
  • Should we count:
    • Wordformvslemma ?
    • Case? Punctuation? Disfluency?
    • Type vs Token ?
words counts and prediction
Words, Counts and Prediction
  • They picnicked by the pool, then lay back on the grass and looked at the stars.
words counts and prediction1
Words, Counts and Prediction
  • They picnicked by the pool, then lay back on the grass and looked at the stars.
    • Word types (excluding punct):
words counts and prediction2
Words, Counts and Prediction
  • They picnicked by the pool, then lay back on the grass and looked at the stars.
    • Word types (excluding punct): 14
    • Word tokens (“ ):
words counts and prediction3
Words, Counts and Prediction
  • They picnicked by the pool, then lay back on the grass and looked at the stars.
    • Word types (excluding punct): 14
    • Word tokens (“ ): 16.
  • I do uh main- mainly business data processing
    • Utterance (spoken “sentence” equivalent)
words counts and prediction4
Words, Counts and Prediction
  • They picnicked by the pool, then lay back on the grass and looked at the stars.
    • Word types (excluding punct): 14
    • Word tokens (“ ): 16.
  • I do uh main- mainly business data processing
    • Utterance (spoken “sentence” equivalent)
    • What about:
      • Disfluencies
        • main-: fragment
        • uh: filler (aka filled pause)
words counts and prediction5
Words, Counts and Prediction
  • They picnicked by the pool, then lay back on the grass and looked at the stars.
    • Word types (excluding punct): 14
    • Word tokens (“ ): 16.
  • I do uh main- mainly business data processing
    • Utterance (spoken “sentence” equivalent)
    • What about:
      • Disfluencies
        • main-: fragment
        • uh: filler (aka filled pause)
    • Keep, depending on app.: can help prediction; uh vs um
lm task
LM Task
  • Training:
    • Given a corpus of text, learn probabilities of word sequences
lm task1
LM Task
  • Training:
    • Given a corpus of text, learn probabilities of word sequences
  • Testing:
    • Given trained LM and new text, determine sequence probabilities, or
    • Select most probable sequence among alternatives
lm task2
LM Task
  • Training:
    • Given a corpus of text, learn probabilities of word sequences
  • Testing:
    • Given trained LM and new text, determine sequence probabilities, or
    • Select most probable sequence among alternatives
  • LM types:
    • Basic, Class-based, Structured
word prediction
Word Prediction
  • Goal:
    • Given some history, what is probability of some next word?
    • Formally, P(w|h)
      • e.g. P(call|I’d like to place a collect)
word prediction1
Word Prediction
  • Goal:
    • Given some history, what is probability of some next word?
    • Formally, P(w|h)
      • e.g. P(call|I’d like to place a collect)
  • How can we compute?
word prediction2
Word Prediction
  • Goal:
    • Given some history, what is probability of some next word?
    • Formally, P(w|h)
      • e.g. P(call|I’d like to place a collect)
  • How can we compute?
    • Relative frequency in a corpus
      • C(I’d like to place a collect call)/C(I’d like to place a collect)
  • Issues?
word prediction3
Word Prediction
  • Goal:
    • Given some history, what is probability of some next word?
    • Formally, P(w|h)
      • e.g. P(call|I’d like to place a collect)
  • How can we compute?
    • Relative frequency in a corpus
      • C(I’d like to place a collect call)/C(I’d like to place a collect)
  • Issues?
    • Zero counts: language is productive!
    • Joint word sequence probability of length N:
      • Count of all sequences of length N & count of that sequence
word sequence probability
Word Sequence Probability
  • Notation:
    • P(Xi=the) written as P(the)
    • P(w1w2w3…wn) =
word sequence probability1
Word Sequence Probability
  • Notation:
    • P(Xi=the) written as P(the)
    • P(w1w2w3…wn) =
  • Compute probability of word sequence by chain rule
    • Links to word prediction by history
word sequence probability2
Word Sequence Probability
  • Notation:
    • P(Xi=the) written as P(the)
    • P(w1w2w3…wn) =
  • Compute probability of word sequence by chain rule
    • Links to word prediction by history
  • Issues?
word sequence probability3
Word Sequence Probability
  • Notation:
    • P(Xi=the) written as P(the)
    • P(w1w2w3…wn) =
  • Compute probability of word sequence by chain rule
    • Links to word prediction by history
  • Issues?
    • Potentially infinite history
word sequence probability4
Word Sequence Probability
  • Notation:
    • P(Xi=the) written as P(the)
    • P(w1w2w3…wn) =
  • Compute probability of word sequence by chain rule
    • Links to word prediction by history
  • Issues?
    • Potentially infinite history
    • Language infinitely productive
markov assumptions
Markov Assumptions
  • Exact computation requires too much data
markov assumptions1
Markov Assumptions
  • Exact computation requires too much data
  • Approximate probability given all prior words
    • Assume finitehistory
markov assumptions2
Markov Assumptions
  • Exact computation requires too much data
  • Approximate probability given all prior words
    • Assume finitehistory
    • Unigram: Probability of word in isolation (0th order)
    • Bigram: Probability of word given 1 previous
      • First-order Markov
    • Trigram: Probability of word given 2 previous
markov assumptions3
Markov Assumptions
  • Exact computation requires too much data
  • Approximate probability given all prior words
    • Assume finitehistory
    • Unigram: Probability of word in isolation (0th order)
    • Bigram: Probability of word given 1 previous
      • First-order Markov
    • Trigram: Probability of word given 2 previous
  • N-gram approximation

Bigram sequence

unigram models
Unigram Models
  • P(w1w2…w3)~
unigram models1
Unigram Models
  • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn)
  • Training:
    • Estimate P(w) given corpus
unigram models2
Unigram Models
  • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn)
  • Training:
    • Estimate P(w) given corpus
    • Relative frequency:
unigram models3
Unigram Models
  • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn)
  • Training:
    • Estimate P(w) given corpus
    • Relative frequency: P(w) = C(w)/N, N=# tokens in corpus
    • How many parameters?
unigram models4
Unigram Models
  • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn)
  • Training:
    • Estimate P(w) given corpus
    • Relative frequency: P(w) = C(w)/N, N=# tokens in corpus
    • How many parameters?
  • Testing: For sentence s, compute P(s)
  • Model with PFA:
    • Input symbols? Probabilities on arcs? States?
bigram models
Bigram Models
  • P(w1w2…w3) = P(BOS w1w2….wnEOS)
bigram models1
Bigram Models
  • P(w1w2…w3) = P(BOS w1w2….wnEOS)
    • ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)
  • Training:
    • Relative frequency:
bigram models2
Bigram Models
  • P(w1w2…w3) = P(BOS w1w2….wnEOS)
    • ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)
  • Training:
    • Relative frequency: P(wi|wi-1) = C(wi-1wi)/C(wi-1)
    • How many parameters?
bigram models3
Bigram Models
  • P(w1w2…w3) = P(BOS w1w2….wnEOS)
    • ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)
  • Training:
    • Relative frequency: P(wi|wi-1) = C(wi-1wi)/C(wi-1)
    • How many parameters?
  • Testing: For sentence s, compute P(s)
  • Model with PFA:
    • Input symbols? Probabilities on arcs? States?
trigram models
Trigram Models
  • P(w1w2…w3) = P(BOS w1w2….wnEOS)
trigram models1
Trigram Models
  • P(w1w2…w3) = P(BOS w1w2….wnEOS)
    • ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn)
  • Training:
    • P(wi|wi-2,wi-1)
trigram models2
Trigram Models
  • P(w1w2…w3) = P(BOS w1w2….wnEOS)
    • ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn)
  • Training:
    • P(wi|wi-2,wi-1) = C(wi-2 wi-1wi)/C(wi-2wi-1)
    • How many parameters?
trigram models3
Trigram Models
  • P(w1w2…w3) = P(BOS w1w2….wnEOS)
    • ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn)
  • Training:
    • P(wi|wi-2,wi-1) = C(wi-2 wi-1wi)/C(wi-2wi-1)
    • How many parameters?
  • How many states?
an example
An Example
  • <s> I am Sam </s>
  • <s> Sam I am </s>
  • <s> I do not like green eggs and ham </s>

Speech and Language Processing - Jurafsky and Martin

recap
Recap
  • Ngrams:
    • # FSA states:
recap1
Recap
  • Ngrams:
    • # FSA states: |V|n-1
    • # Model parameters:
recap2
Recap
  • Ngrams:
    • # FSA states: |V|n-1
    • # Model parameters: |V|n
  • Issues:
recap3
Recap
  • Ngrams:
    • # FSA states: |V|n-1
    • # Model parameters: |V|n
  • Issues:
    • Data sparseness, Out-of-vocabulary elements (OOV)
      •  Smoothing
    • Mismatches between training & test data
    • Other Language Models