1 / 15

A new framework for Language Model Training

A new framework for Language Model Training. David Huggins-Daines January 19, 2006. Overview. Current tools Requirements for new framework User Interface Examples Design and API. Current status of LM training. The CMU SLM toolkit Efficient implementation of basic algorithms

orlando
Download Presentation

A new framework for Language Model Training

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A new framework for Language Model Training David Huggins-Daines January 19, 2006

  2. Overview • Current tools • Requirements for new framework • User Interface Examples • Design and API

  3. Current status of LM training • The CMU SLM toolkit • Efficient implementation of basic algorithms • Doesn’t handle all tasks of building a LM • Text normalization • Vocabulary selection • Interpolation/adaptation • Requires an expert to “put the pieces together” • Lots of scripts • SimpleLM, Communicator, CALO, etc. • Other LM toolkits • SRILM, Lemur, others?

  4. Requirements • LM training should be • Repeatable • An “end-to-end” rebuild should produce the same result • Configurable • It should be easy to change parameters and rebuild the entire model to see their effect • Flexible • Should support many types of source texts, methods of training • Extensible • Modular structure to allow new methods and data sources to be easily implemented

  5. Tasks of building an LM • Normalize source texts • They come in many different formats! • LM toolkit expects a stream of words • What is a “word”? • Compound words, acronyms • Non-lexemes (filler words, pauses, disfluencies) • What is a “sentence”? • Segmentation of input data • Annotate source texts with class tags • Select a vocabulary • Determine optimal vocabulary size • Collect words from training texts • Define vocabulary classes • Vocabulary closure • Build a dictionary (pronunciation modeling)

  6. Tasks, continued • Estimate N-Gram model(s) • Choose the appropriate smoothing parameters • Find the appropriate divisions of the training set • Interpolate N-Gram models • Use a held-out set representative of the test set • Find weights for different models which maximize likelihood (minimize perplexity) on this domain • Evaluate language model • Jointly minimize perplexity and OOV rate • (they tend to move in opposite directions)

  7. A Simple Switchboard Example Top level tag - must be only one <NGramModel> <Transcripts name="swb.files"> <InputFilter::SWB> <Transcripts list="swb.files"/> </InputFilter::SWB> </Transcripts> <Vocabulary cutoff="1"> <Transcripts name="swb.files"/> </Vocabulary> </NGramModel> A set of transcripts The input filter to use A list of files Exclude singletons Backreference to named object

  8. A More Complicated Example <NGramModel name="interp.test"> <Transcripts name="swb.test"> swb.test.lsn </Transcripts> <Transcripts name="icsi.test"> <InputFilter::ICSI> icsi.test.mrt </InputFilter::ICSI> </Transcripts> <Vocabulary name="icsi.swb1"> <Vocabulary cutoff="1"> <Transcripts name="swb.test"/> </Vocabulary> <Vocabulary> <Transcripts name="icsi.test"/> </Vocabulary> BRAZIL </Vocabulary> <NGramModel name="swb.test"> <Transcripts name="swb.test"/> <Vocabulary name="icsi.swb1"/> </NGramModel> <NGramModel name="icsi.test"> <Transcripts name="icsi.test"/> <Vocabulary name="icsi.swb1"/> </NGramModel> <Interpolation> <InputFilter::CMU> cmu.test.trs </InputFilter::CMU> <NGramModel name="swb.test"/> <NGramModel name="icsi.test"/> </Interpolation> </NGramModel> (Interpolation of ICSI and Switchboard) Files can be listed directly in element contents Vocabularies can be nested (merged) Words can be listed directly in element contents Held-out set for interpolation Interpolate previously named LMs

  9. Command-line Interface • lm_train • “Runs” an XML configuration file • build_vocab • Build vocabularies, normalize transcripts • ngram_train • Train individual N-Gram models • ngram_test • Evaluate N-Gram models • ngram_interpolate • Interpolate and combine N-Gram models • ngram_pronounce • Build a pronunciation lexicon from a language model or vocabulary

  10. Programming Interface • NGramFactory • Builds an NGramModel from an XML specification (as seen previously) • NGramModel • Trains a single N-Gram LM from some transcripts • Vocabulary • Builds a vocabulary from transcripts or other vocabularies • InputFilter • Subclassed into InputFilter::CMU, InputFilter::ICSI, InputFilter::HUB5, InputFilter::ISL, etc • Reads transcripts in some format and outputs a word stream

  11. Design in Plain English • NGramFactory builds an NGramModel • NGramModel has a Vocabulary • NGramModel and Vocabulary can have Transcripts • NGramModel and Vocabulary use an InputFilter (or maybe they don’t) • NGramModel can merge two other NGramModels using a set of Transcripts • Vocabulary can merge another Vocabulary

  12. A very simple InputFilter please!!! (InputFilter/Simple.pm) use strict; package InputFilter::Simple; require InputFilter; use base 'InputFilter'; sub process_transcript { my ($self, $file) = @_; local ($_, *FILE); open FILE, "<$file" or die "Failed to open $file: $!"; while (<FILE>) { chomp; my @words = split; $self->output_sentence(\@words); } } 1; Subclass of InputFilter (This is just good practice) Read the input file Tokenize, normalize, etc Pass each sentence to this method

  13. Where to get it • Currently in CVS on fife.speech • :ext:fife.speech.cs.cmu.edu:/home/CVS • module LMTraining • Future: CPAN and cmusphinx.org • Possibly integrated with the CMU SLM toolkit in the future

  14. Stuff TODO • Class LM support • Communicator-style class tags are recognized and supported • NGramModel will build .lmctl and .probdef files • However this requires normalizing the files to a transcript first, then running the semi-automatic Communicator tagger • Automatic tagging would be nice… • Support for languages other than English • Text normalization conventions • Word segmentation (for Asian languages) • Character set support (case conversions etc) • Unicode (also a CMU-SLM problem)

  15. Questions?

More Related