1 / 16

SRILM Based Language Model

SRILM Based Language Model. Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska. N-GRAM Concept . The idea of word prediction in formalized with probabilistic model called N-gram. Statistical models of word sequence are also called language models or LM’S

tyrell
Download Presentation

SRILM Based Language Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SRILM Based Language Model Name:Venkatasubramanyansundaresan Instructor:Dr.VetonKepuska

  2. N-GRAM Concept • The idea of word prediction in formalized with probabilistic model called N-gram. • Statistical models of word sequence are also called language models or LM’S • The idea of N-gram model is to approximate the history by just the last few words .

  3. CORPUS • Counting things in natural language is based on a corpus. • What is a corpus ? • It is an online collection of text or speech • There are two popular corpora. • Brown (1 million word collection ) • Switch board (Collection 2430 telephone conversation )

  4. Perplexity • Perplexity is interpreted as the weighted average branching factor of a language. • Branching factor of a language is the number of possible next word that can follow any word . • Perplexity is the most common evaluation metric for N-gram language models . • Improvement in perplexity does not guarantee an improvement in speech recognition performance. • It is commonly used as a quick check of an algorithm.

  5. SMOOTHING • It is the process of flattening a probability distribution implied by a language model ,so the all reasonable word sequence can occur with some probability.

  6. Aspiration • To use SRI-LM (LM-Language modeling) toolkit to build different language models. • The following are the language models : • Good –turning Smoothing • Absolute Discounting

  7. Linux Environment in Windows • To implement Linux environment in windows operating system we have to install “cygwin” • This is a open source software and can be downloaded from : www.cygwin.com. • Another main reason for installing cygwin is ,SRI-LM can be implemented over the cygwin platform .

  8. Installation Procedure “cygwin” • Go to the provided webpage. • Download the setup file . • Select “install from Internet” • Give the required destination place for the cygwin to get installed . • There will be a lot of options to download from website. • Select one site and install all the packages .

  9. SRILM • Download the SRILM toolkit ,srilm.tgz from the following source: • http://www.speech.sri.com/projects/srilm/ • Run the terminal window of Cygwin. • The srilm will be downloaded as a zip file . • Unzip the srilm file inside the cygwin environment • Unzip canbe done with the following with the following command: tar zxvf srilm.tgz

  10. SRILM Installation • Once the installation is completed ,we have to edit the makefile in the cygwin folder . • Once the editing is done , we have run the cygwin ,to install SRILM in cygwin : $ Make World

  11. Function of SRILM • Generate N-gram count from the corpus • Train language model based on the N-gram count file . • Use trained language model to calculate test data perplexity.

  12. Lexicon • Lexicon is a container of words belonging to the same language . • Reference: Wikipedia

  13. Lexicon Generation • Use “wordtokenization.pl” file to generate the Lexicon for our requirement . • Generate lexicon of our requirement using the following command: • cat train/en_*.txt > corpus.txt • Perl wordtokenization.pl <corpus.txt|sort|uniq >lexicon.txt

  14. Count File • Generate 3-gram count file by using following command: • $./ngram-count –vocab lecicon.txt, -text corpus.txt ,-order 2 –write count.txt, -unk

  15. Good-Turing Language Model $ ./ngram-count-readproject/count.txt -order3 -lmproject/gtlm.txt -gt1min 1 -gt1max3 -gt2min 1 -gt2max3-gt3min1 -gt3max 3 • This code has to be typed in the command window of the terminal . • -lmlmfile Estimate a back off N-gram model from the total counts, and write it to lmfile

  16. Absolute Discounting Language Model $ ./ngram-count-readproject/count.txt-order 3-lm adlm.txt-cdiscount1 0.5-cdiscount2 0.5 -cdiscount3 0.5 • Here the order N can be any thing b/w 1 to 9.

More Related