1 / 17

From Grammar to N-grams

From Grammar to N-grams. Estimating N-grams From a Context-Free Grammar and Sparse Data. Thomas K Harris May 16, 2002. Motivation. Recognizers typically use n-grams. Systems are typically defined by CFGs. Data collection is difficult.

etenia
Download Presentation

From Grammar to N-grams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Grammar to N-grams Estimating N-grams From a Context-Free Grammar and Sparse Data Thomas K Harris May 16, 2002

  2. Motivation • Recognizers typically use n-grams. • Systems are typically defined by CFGs. • Data collection is difficult. • Goal: To have a language model that benefits from the grammar and the priors of the parses.

  3. Other Approaches • Ignore data, use a language model derived from the grammar alone. • Ignore grammar, use a language model derived from the data alone. • Interpolate between these two models.

  4. PCFG Strategy • Train grammar with some data. • Smooth grammar. • Compute n-grams. data PCFG N-grams CFG

  5. The Software • Work in progress - available at http://www.cs.cmu.edu/~tkharris/pcfg • Written in C++ • A library (API) consists of a PCFG class and an n-gram class. • A program which uses the library to create n-grams from Phoenix grammars and data. • A make script to automate building and testing.

  6. Procedure • Read Phoenix grammar file. • Convert to Chomsky Normal Form. • Read data and train grammar. • Smooth the grammar. • Compute n-grams from the smoothed PCFG.

  7. Reading Phoenix Formats • Doesn’t handle #include directive. • Doesn’t handle +* (Kleen closure) marker. • Net – Rewrite distinction is ignored. • + and * markers are rewritten as rules. • Conversion to CNF permanently mangles rules.

  8. Chomsky Normal Form • Remove ε-transitions. • Remove unit productions. • Change all rules A->βaγ of length >1 to A->βNγ and N->a. • Recursively shorten all rules A->βBC of length >2 to A->βN and N->BC.

  9. Training • Initialize rule probabilities. • For each sentence, • Use CYK chart parser to compute inside and outside probabilities. • Use those probabilities to determine the expected number of times the rule is used in the sentence. • Use the expectations to get a new set of rule probabilities. • Repeat until the corpus likelihood appears to asymptote.

  10. Smoothing • A user-specified probability mass can be redistributed over unseen rules. • At the bottom of the tree this generalizes a class-based model. • This only smoothes the trained grammar over other grammatical sentences.

  11. Precise N-grams • Precise n-grams can be computed from a PCFG. • P(wn|w1…wn-1) = E(w1…wn|S)/E(w1…wn-1|S)

  12. S S S A A A B B B Divide and Conquer S wn …w1-n… w1………….wn …w1-n…

  13. Data • USI MovieLine oracle transcripts • 2,000 sentences • Used only parsable sentences (85%) • Divided into 60% training, 40% test

  14. Results

  15. Results

  16. Conclusions • Lower perplexities than pure-grammar method, comparable perplexities to pure-data method. • More flexible and cheaper than pure-data methods.

  17. Future Directions • More smoothing work needs to be done. • Different smoothing over different classes • other smoothing methods?? • Trigrams • Testing for word error rate improvements • Adapting to modified grammars

More Related