1 / 44

>65536

>65536. Arthur Chan May 4, 2006. What so special about 65536?. 65536 = 2 ^ 16 Do you know? Sphinx III did not support language model with more than 65536 (2^16) words CMU-Cambridge LM Toolkit V2 is not happy about text with 65536 unique words as well.

matana
Download Presentation

>65536

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. >65536 Arthur Chan May 4, 2006

  2. What so special about 65536? • 65536 = 2 ^ 16 • Do you know? • Sphinx III did not support language model with more than 65536 (2^16) words • CMU-Cambridge LM Toolkit V2 is not happy about text with 65536 unique words as well. • Though a word could have counts more than 65536 (-four_byte_counts)

  3. Why 65536 was the limit? • Both Sphinx III and CMU-Cambridge LM Toolkit V2 was written in 95-99 • Time when having 64M RAM is extravagant • (Now 64G seems to be the number.) • (At that time, Pentium 166 or 200 are hot) • Programmers therefore designed clever structures to deal with memory issues • In Sphinx, DMP format was invented (WordID: 16bits) • In CMU-Cambridge LM Toolkit, 16 bits data types were used for wordID.

  4. This Talk (30-35 pages) • Describe our effort on breaking the 16 bits limit in • Sphinx 3 • CMU-Cambridge LM Toolkit V2 • Half a talk • Features not fully tested in real-life • But the talk itself is quite long. • Technical Detail of the changes. • Sphinx III – The easy part (9 pages) • CMU-Cambridge LM Toolkit – The tough part (10 pages) • The Root of the Evil (11 pages) • Why does this problem exist? Why does it persist? • What if similar problem appear? How do we solve and avoid them?

  5. Disclaimer about the Speaker • Notorious of being negative on Language Modeling Techniques • Symptom 1: Yell at others when his LM code has bugs. • Symptom 2: Yell at others <period> •  He should be forgiven because • His Master Thesis Supervisor taught him that when he was young • Prof. Ronald Rosenfeld’s taught him the same • He also read Dr. Joshua Goodman’s papers

  6. Terminology • “Probability” actually means • The estimate of the probability • Back-off weight means • When some n-gram is unseen in the training data • Back-off to (n-1)-gram “probability” times a weight • According to Manning, four-gram should be tetragram, bigram should be digram • Well, it’s lucky it doesn’t matter to us today

  7. LM Component of Sphinx III The Easy Part

  8. What Sphinx 3.6 RCI supports • ARPA LM • DMP LM • A memory efficient version of ARPA LM • Could be run in disk-mode as well • Class-based LM • Multiple LMs and LM switching dynamically • lm_convert • (new in 3.6!) Conversion tool for ARPA, DMP LM

  9. A note on the DMP format • A tree like format • Bigram is indexed by prefix unigram • Trigram is indexed by prefix bigram. • Bigram, Trigram probabilities and back-of weights • Quantized to 4 decimal point. So you see following statements in the code:

  10. Funny C statements in the Code /* HACK!! to quantize probs to 4 decimal digits */ p = p3*10000; p3 = p*0.0001; If you delete this, then the LM will be larger because quantization is not done. 

  11. Reasons why Sphinx III only supports less than 65536 words • 16 bits data structures for • Bigram • Trigram • Cache structure

  12. Bogus Reason of Why Sphinx III doesn’t more than 65536 words • A very bad misconception • “The decoding is constrained by the dictionary” • WRONG • In both flat and tree lexicon search. Only LM words are traversed. • RIGHT • Generally, decoding is constrained by the intersection of the LM word and dictionary words

  13. Several Proposed Surgery Procedure • 1, Rewrite the whole LM routine • Oops! But it takes too much time, • Old routine is very memory efficient • 2, Replace the old LM by just switching the type of data structure • Problem: All the binary LMs we generated have the old layout. • We will lose backward compatibility very badly

  14. Final Solution • lm now support two data structures: 16 and 32 bits • lm_convert and decode will support two types of binaries LM • DMP that has a 16 bit layout • DMP32 that has a 32 bit layout • Magic version number will decide which layout to use • Regression test could ensure not bad code check-ins • When to use which format is hidden from • Any one called the lm routine. (for a few exceptions)

  15. Partial Verification of the Code • The 16 bit and 32 bit code produce exactly the same decoding results for • decode • decode_anytop • (allphone’s trigram could probably left untest.) • A faked LM with more than 65536 words could be used and run in decode

  16. Current Practical Limit • The lm data structure in lm.h • Theoretically support LM with • Less than 4 billion unigram • Less than 4 billion bigram • Less than 4 billion trigram • What if we have n-gram size larger than 4 billion? • Answer: we are dead people • Further answer: it is easily fixable • Other data structure from Sphinx 3? • hash.c doesn’t return prime number large than 900001 • Further answer: it is easily fixable as well

  17. Conclusion • Technically • Sphinx III 32bit mode is not that difficult to take care. • The problem was also confined to one data structure • Thanks to the modular design of Sphinx III • Pretty easy to solve. • Sphinx III’s decision of using binary format • If I were Ravi, I will do that as well • Much faster loading time for large model.

  18. CMU-Cambridge LM Toolkit V2 The Tough Part

  19. CMU-Cambridge LM Toolkit Version 2 • LM Support of CMU-Cambridge LM Toolkit Version 2 • LM training • Parameter estimation with backoff weight computation • Support both • LM in ARPA format • LM in BINLM format • BINLM is not the same as DMP format. • bin2arpa could translate BINLM to ARPA

  20. Purpose of the toolkit • Training LM for • Speech Recognition • Statistical Machine Translation • Document Classification • Hand writing Recognition • A note: • Occasionally, speech recognition is really not everything

  21. Standard Procedure of Training • In V2 time, • David Huggins-Daines wasn’t in CMU • Training is separated into 4 stages • text2wfreq -> Find the word frequency table • wfreq2vocab • Find the vocabulary we need (smaller than the frequency table) • text2idngram • Convert the text to a stream of ngram and its count (idngram) • The ngram word id is alphabetically sorted • idngram2lm • Gather the counts, compute the discounted estimates and the backoff weights.

  22. Reasons why V2 doesn’t support 65536 words • There is one single file that typedef many data structures • But the variables are not used very often • Most variables are not typedef. • Many of them are declared as • unsigned short • int

  23. Another Issue……. • What if we have more than 4 billion n-gram this time? • e.g if n>5 • Not forgivable in LM training because • MT people are already having this problem. (unigram size is 5 million)

  24. Strategy • Spent 90% of the time to make sure the data type was declared correctly • Give up taking care of both 16-bit and 32-bit binary layout together. • Compile time switch (THIRTYTWOBITS) is provided • Reasonable because users seldom used BINLM any way • User need to use DMP format in Sphinx III • Tool chain is now completed • Number of ngram is a 64 bit number

  25. What we support now • One could trained an LM with more than 65536 words • text2wfreq, wfreq2vocab, text2idngram, idngram2lm are fixed • One could convert an LM • binlm2arpa, ngram2mgram are fixed • One could compute the perplexity of an LM and some statistics from the text • evallm, idngram2stats are fixed

  26. Other Evils of Detail • V2’s hash table is using a very bad hash function • Many collisions • Legacy from pre-90s • One could take a 4 hour nap to load the word list if we train a 500k word model. • After using Dan J. Barstein’s hash function, • the load time is acceptable (<1 min) • Binary layout was one of the most time-consuming part of development

  27. Verification • 16 bits and 32 bits code provides exactly the same results • 32 bit code could train a LM from a faked corpora with 10M unique words. • Note: we are talking about uniq words. • Both Dave perl tests and Arthur’s tests are all passed. • So, things like LM interpolation is actually working too. 

  28. Current Limitation • Theoretically support • 1.84 x10^19 ngrams. • The 4-step procedure used too much space • 100 M words training requires • 10 G harddisk • 1-2G RAM • 1G word training requires more • 100G harddisk and 20G RAM?

  29. So, we still have issue when …… • Ascending order of difficulties • What if MT people asked us to run their LM in our recognizer (1M limit)? • What if we need to run decoding for 10 languages and each with 100k words? • What if we need to train a N word corpora (N= 1 billion) and there are N*N*N trigrams? • What if Prof. Jim Baker was back? • What if there were aliens?

  30. Deliver Us From Evil Why this feature wasn’t implemented in 2001?

  31. An Important Observation • There is and implicit Development Deadlock between • Sphinx III • CMU-Cambridge LM Toolkit • SphinxTrain

  32. General Pattern (part I) • Decoder’s developer think • “Feature X is not implemented in the trainer” • “That is to say there will be no use if we implement feature X”. => Give up feature X

  33. General Pattern (part II) • Trainer’s developers think • “Feature X is not implemented in the decoder” • “That is to say there will be no use if we implement feature X”. • Give up feature X

  34. Why Feature X is not implemented in the first place? • Possible Reason 1: • In the past, someone analyze some results and conclude that Feature X is not useful • Possible Reason 2: • Because of theoretical reasons Y and Z, someone conclude that Feature X is not useful • Possible Reason 3: • Past hardware limitation

  35. In Reality…… • Feature X could turn out to be very useful, • E.g. • More than 65k words in LM • N-gram when N > 3 • Interpolation (instead of backoff) in N-gram

  36. Another Important Observation • Constant give up of new features • Eventually give up the whole software development • Look at CMU-Cambridge LM Toolkit V2

  37. How we should deal with this Problem? • 1, Know that this is a problem • (From anonymous self-help books.) • 2, We need a joint understanding of both of the decoder and the trainer(s) • Question to ask: Is it really correct to always develop the decoder first? • 3, New features of the training could always be tested in cheap ways • N-best and Lattice rescoring • Then the deadlock will be broken on one side

  38. A Unified View of Our Software CMU-Cambridge LM Toolkit SphinxTrain The Suite Sphinx Brothers {2,3,4} depends on where you live

  39. Issue 1 • Q: “Do we have the right to change the LM Toolkit?” • A: “Yes, according to the license if we open the source for research purpose, we could change, distribute the code. • Our changes is endorsed by Prof. Rosenfeld (CMU), Dr. Clarkson (Cambridge) and Prof. Robinson (Cambridge)”

  40. Issue 2 • Q: “Do we have anything new in LM?” • A: “That depends on the brilliance of our students and staffs. • Also generally brilliance of the public • They have the right to contribute • Actually, in past 10 years, • A lot of new thing were done in CMU in LM • Just no one collects them and put them together.”

  41. Issue 3 • Q: “Are you just getting yourself a lot of trouble?” • A: “The troubles are always there, we just never face it. “

  42. Digression: Project L • News: Some Folks are working on the LM toolkit now! • Project Code: L • Three key supporters • A Young Professor (or Prof. AB) • Hint:he is not exactly young • A Young Student (or DH) • A Young Staff (or AC) • Gathering code from around the world • Thanks for Professor Yannick from LIUM in advance • Thanks to contributor AT

  43. Conclusion • 32 bits data structure now supported in both Sphinx III and CMU-Cambridge LM Toolkit. • This brings up a lot of development issue • May be we should take the LM toolkit more seriously • Maintenance (a must) • New features development (if we have time)

  44. Preview of the next 2 talks • Project L • Story of The Three Young Developers • Development Progress of Sphinx 3.X (From X=3 to X=6) • What is the big picture of Sphinx?

More Related