1 / 24

SRILM - The SRI Language Modeling Toolkit

SRILM - The SRI Language Modeling Toolkit. 2008. 10. 22. Presented by Yeon JongHeum Intelligent Database Systems Laboratory, SNU. Contents. Environment Download Compile Making Corpus Execution Result. Environment. Hardware IBM ThinkPad T41 Intel(R) Pentium(R) M processor 1600MHz

kelli
Download Presentation

SRILM - The SRI Language Modeling Toolkit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SRILM - The SRI Language Modeling Toolkit 2008. 10. 22. Presented by Yeon JongHeum Intelligent Database Systems Laboratory, SNU

  2. Contents Environment Download Compile Making Corpus Execution Result

  3. Environment • Hardware • IBM ThinkPad T41 • Intel(R) Pentium(R) M processor 1600MHz • 1GiB DDR RAM • OS • Ubuntu Linux 8.04

  4. Download http://www.speech.sri.com/projects/srilm/download.html http://clab.snu.ac.kr/class/cl-nlp0802/lecture/euc_txt.zip

  5. Compile • ubuntu 환경을 기준으로 하므로, 명령어들은 리눅스 배포본마다 다소 차이가 있을 수 있다. • csh, tcl, gcc, g++, gawk 등의 필요한 패키지를 설치한다. • sudo aptitude install csh tcl tcl-dev build-essential gawk

  6. Compile (cont’d) • 다운로드 받은 SRILM 의 압축을 푼다 • tar xvfz srilm.tgz

  7. Compile (cont’d) 쓰기 권한을 추가한다.

  8. Compile (cont’d) Makefile 의 SRILM 환경변수를 수정한다.

  9. Compile (cont’d) • commom/Makefile.machine.ARCH 파일의 CC, CXX, TCL_INCLUDE 등을 수정한다. • ARCH 는 SRILM 이 실행되는 환경으로 sbin/machine-type 을 실행하여 알아본다.

  10. Compile • make World 명령어로 컴파일한다.

  11. Corpus 형태소 분석된 파일의 인코딩을 euc-kr 에서 utf-8 로 수정 수정된 파일들에서 각 형태소를 찾아 하나의 큰 파일 생성 파일을 Training Set 과 Test Set 으로 나눈다. 한줄에 하나의 문장이 있으며 각 형태소는 공백으로 구분된다. 스크립트는 http://ids.snu.ac.kr/wiki/SRILM 참조

  12. Corpus - Example

  13. Execution SRILM Training Set ngram-count Language Model Test Set ngram Perplexity

  14. ngram-count • Command ngram-count -text train_morCorpus.txt -lm lm_default.txt • Default • Trigram, Good-Turing discounting, Katz backoff • -text : corpus to read • -lm : output file of language model

  15. Good-Turing Discounting Parameters Min Count Max Count • Command ngram-count -text train_morCorpus.txt -lm lm_gt_3_7.txt -order 3 -gt1min 3 -gt1max 7 -gt2min 3 -gt2max 7 -gt3min 3 -gt3max 7 • Parameter • -gtNmin count • -gtNmax count

  16. Format of Language Model \data\ ngram 1=200989 ngram 2=2331224 ngram 3=1547582 \1-grams: -6.522542 무조-0.3102094 -4.676433 무조건-0.2724784 -7.300586 무조소-0.187773 \2-grams: -0.3667601 군종 교구-0.08042386 -1.530162 군종 교구장 -1.530162 군종 사목 Log of Backoff Weight (Base 10) Log probability (Base 10) e.g., lm_default.txt

  17. Ney’s absolute discounting • Command ngram-count -text train_morCorpus.txt -lm lm_absoulte0.5_3gram.txt -order 3 -cdiscount1 0.5 -cdiscount2 0.5 -cdiscount3 0.5 • Parameter • -order n : generate to n-grams. 없으면 trigram 까지 생성한다. • -cdiscountN value : values is a constant to subtract for N-grams

  18. Witten-Bell discounting • Command ngram-count -text train_morCorpus.txt -lm lm_witten_3gram.txt -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3

  19. Ristad's natural discounting • Command ngram-count -text train_morCorpus.txt -lm lm_nd_3gram.txt -order 3 -ndiscount1 -ndiscount2 -ndiscount3

  20. Chen and Goodman's modified Kneser-Ney discounting • Command ngram-count -text train_morCorpus.txt -lm lm_knd_5gram.txt -order 3 -kndiscount1 -kndiscount2 -kndiscount3

  21. Original Kneser-Ney discounting • Command ngram-count -text train_morCorpus.txt -lm lm_uknd_5gram.txt -order 3 -ukndiscount1 -ukndiscount2 -ukndiscount3

  22. Discounting with Interpolate • Original Kneser-Ney discounting + Interpolate ngram-count -text train_morCorpus.txt -lm lm_uknd_inter_5gram.txt -order 3 -ukndiscount1 -ukndiscount2 -ukndiscount3 -interpolate1 -interpolate2 -interpolate3 • Parameter • -interpolateN • Only Witten-Bell, absolute discounting, and (original or modified) Kneser-Ney smoothing currently support interpolation

  23. Compute Perplexity • Command ngram -lm lm_default.txt -ppl testCorpus.txt • Parameter • -lm : Language Model • -ppl : Compute sentence scores (log probabilities) and perplexities from the sentences in textfile • Result • file testCorpus.txt: 171154 sentences, 4829620 words, 26626 OOVs • 0 zeroprobs, logprob= -9.34268e+06 ppl= 75.5524 ppl1= 88.1413

  24. Result

More Related