A new framework for language model training
This presentation is the property of its rightful owner.
Sponsored Links
1 / 15

A new framework for Language Model Training PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

A new framework for Language Model Training. David Huggins-Daines January 19, 2006. Overview. Current tools Requirements for new framework User Interface Examples Design and API. Current status of LM training. The CMU SLM toolkit Efficient implementation of basic algorithms

Download Presentation

A new framework for Language Model Training

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

A new framework for language model training

A new framework for Language Model Training

David Huggins-Daines

January 19, 2006



  • Current tools

  • Requirements for new framework

  • User Interface Examples

  • Design and API

Current status of lm training

Current status of LM training

  • The CMU SLM toolkit

    • Efficient implementation of basic algorithms

    • Doesn’t handle all tasks of building a LM

      • Text normalization

      • Vocabulary selection

      • Interpolation/adaptation

    • Requires an expert to “put the pieces together”

  • Lots of scripts

    • SimpleLM, Communicator, CALO, etc.

  • Other LM toolkits

    • SRILM, Lemur, others?



  • LM training should be

    • Repeatable

      • An “end-to-end” rebuild should produce the same result

    • Configurable

      • It should be easy to change parameters and rebuild the entire model to see their effect

    • Flexible

      • Should support many types of source texts, methods of training

    • Extensible

      • Modular structure to allow new methods and data sources to be easily implemented

Tasks of building an lm

Tasks of building an LM

  • Normalize source texts

    • They come in many different formats!

    • LM toolkit expects a stream of words

    • What is a “word”?

      • Compound words, acronyms

      • Non-lexemes (filler words, pauses, disfluencies)

    • What is a “sentence”?

      • Segmentation of input data

    • Annotate source texts with class tags

  • Select a vocabulary

    • Determine optimal vocabulary size

    • Collect words from training texts

    • Define vocabulary classes

    • Vocabulary closure

    • Build a dictionary (pronunciation modeling)

Tasks continued

Tasks, continued

  • Estimate N-Gram model(s)

    • Choose the appropriate smoothing parameters

    • Find the appropriate divisions of the training set

  • Interpolate N-Gram models

    • Use a held-out set representative of the test set

    • Find weights for different models which maximize likelihood (minimize perplexity) on this domain

  • Evaluate language model

    • Jointly minimize perplexity and OOV rate

    • (they tend to move in opposite directions)

A simple switchboard example

A Simple Switchboard Example

Top level tag - must be only one


<Transcripts name="swb.files">


<Transcripts list="swb.files"/>



<Vocabulary cutoff="1">

<Transcripts name="swb.files"/>



A set of transcripts

The input filter to use

A list of files

Exclude singletons

Backreference to named object

A more complicated example

A More Complicated Example

<NGramModel name="interp.test">

<Transcripts name="swb.test">



<Transcripts name="icsi.test">





<Vocabulary name="icsi.swb1">

<Vocabulary cutoff="1">

<Transcripts name="swb.test"/>



<Transcripts name="icsi.test"/>




<NGramModel name="swb.test">

<Transcripts name="swb.test"/>

<Vocabulary name="icsi.swb1"/>


<NGramModel name="icsi.test">

<Transcripts name="icsi.test"/>

<Vocabulary name="icsi.swb1"/>






<NGramModel name="swb.test"/>

<NGramModel name="icsi.test"/>



(Interpolation of ICSI and Switchboard)

Files can be listed directly in element contents

Vocabularies can be nested (merged)

Words can be listed directly in element contents

Held-out set for interpolation

Interpolate previously named LMs

Command line interface

Command-line Interface

  • lm_train

    • “Runs” an XML configuration file

  • build_vocab

    • Build vocabularies, normalize transcripts

  • ngram_train

    • Train individual N-Gram models

  • ngram_test

    • Evaluate N-Gram models

  • ngram_interpolate

    • Interpolate and combine N-Gram models

  • ngram_pronounce

    • Build a pronunciation lexicon from a language model or vocabulary

Programming interface

Programming Interface

  • NGramFactory

    • Builds an NGramModel from an XML specification (as seen previously)

  • NGramModel

    • Trains a single N-Gram LM from some transcripts

  • Vocabulary

    • Builds a vocabulary from transcripts or other vocabularies

  • InputFilter

    • Subclassed into InputFilter::CMU, InputFilter::ICSI, InputFilter::HUB5, InputFilter::ISL, etc

    • Reads transcripts in some format and outputs a word stream

Design in plain english

Design in Plain English

  • NGramFactory builds an NGramModel

  • NGramModel has a Vocabulary

  • NGramModel and Vocabulary can have Transcripts

  • NGramModel and Vocabulary use an InputFilter (or maybe they don’t)

  • NGramModel can merge two other NGramModels using a set of Transcripts

  • Vocabulary can merge another Vocabulary

A very simple inputfilter

A very simple InputFilter



use strict;

package InputFilter::Simple;

require InputFilter;

use base 'InputFilter';

sub process_transcript {

my ($self, $file) = @_;

local ($_, *FILE);

open FILE, "<$file" or die "Failed to open $file: $!";

while (<FILE>) {


my @words = split;





Subclass of InputFilter

(This is just good practice)

Read the input file

Tokenize, normalize, etc

Pass each sentence to this method

Where to get it

Where to get it

  • Currently in CVS on fife.speech

    • :ext:fife.speech.cs.cmu.edu:/home/CVS

    • module LMTraining

  • Future: CPAN and cmusphinx.org

  • Possibly integrated with the CMU SLM toolkit in the future

Stuff todo

Stuff TODO

  • Class LM support

    • Communicator-style class tags are recognized and supported

    • NGramModel will build .lmctl and .probdef files

    • However this requires normalizing the files to a transcript first, then running the semi-automatic Communicator tagger

    • Automatic tagging would be nice…

  • Support for languages other than English

    • Text normalization conventions

    • Word segmentation (for Asian languages)

    • Character set support (case conversions etc)

    • Unicode (also a CMU-SLM problem)



  • Login