1 / 35

ICASSP 05

ICASSP 05. Reference. Rapid Language Model Development Using External Resources for New Spoken Dialog Domains Ruhi Sarikaya 1 , Agustin Gravano 2 , Yuqing Gao 1 1 IBM, 2 Columbia University Maximum Entropy Based Generic Filter for Language Model Adaptation

gavivi
Download Presentation

ICASSP 05

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICASSP 05

  2. Reference • Rapid Language Model Development Using External Resources for New Spoken Dialog Domains • Ruhi Sarikaya1, Agustin Gravano2, Yuqing Gao1 • 1IBM, 2Columbia University • Maximum Entropy Based Generic Filter for Language Model Adaptation • Dong Yu, Milind Mahajan, PeterMau, Alex Acero • Microsoft • Language Model Estimation for Optimizing End-to End Performance of A Natural Language Call Routing System • Vaibhava Goel, Hong-Kwang Jeff Kuo, Sabine Deligne, Cheng Wu • IBM

  3. Introduction • LM adaptation consists of four steps • 1. Collection of task specific adaptation data • 2. Normalization step • Abbreviations, data and time, punctuations • 3. Analyze adaptation data and build a task specific LM • 4. Interpolate task specific LM with task independent LM

  4. Introduction • Language Modeling research concentrated in tow directions • 1. Improving the language model probability estimation • 2. Obtaining additional training material • The largest data set is the World Wide Web (WWW) • More than 4 billion pages

  5. Introduction • Using web data for language modeling • Query generation • Filtering the relevant text from the retrieved pages • The web counts are certainly less sparse than the counts in a corpus of a fixed size • The web counts are also likely to be significantly more noisy than counts obtained from a carefully cleaned and normalized corpus • Retrieve unit • Whole document v.s. sentence (utterance)

  6. Build LM for new domain • In practice when we start to build an SDS (spoken dialog system) for a new domain, the amount of in-domain data for the target domain is usually small • Definition • Static resource: corpora collected for other tasks • Dynamic resource: web data

  7. Flow diagram for the collecting relevant data

  8. Generating search Queries • Using Google as search engine • The more specific a query is the more relevant the retrieved pages are.

  9. Similarity based sentence selection • Machine translation’s BLEU (BiLingual Evaluation Understudy) • N is the maximum n-gram length, wn and pn are the corresponding weight and precision, respectively, BP is the brevity penalty • Where r and c are the lengths of the reference and candidate sentences, respectively Threshold is 0.08

  10. Experimental result • SCLM: using static corpora for language model • WWW-20 / WWW-100: predefined limit to 20 / 100 pares per sentence

  11. E-mail corpus • Dictated and non-dictated

  12. Filtering the corpus • Filtering out these non-dictated texts is not an easy job in general • Hand-crafted rules (e.g. regular expressions) • Limitations • It does not generalize well to situations which we have not encountered • Rules are usually language dependent • Developing and testing rules are very costly

  13. Maximum Entropy based filter • Consider the filtering task as a labeling problem to segment the adaptation data into two categories: • Category D (Dictated text): • Text which should be used for LM adaptation • Category N (Non-dictated text): • Text which should not be used for LM adaptation • Text is divided into a sequence of text units (such as lines) ti is the text unit, and li is the label associated with ti

  14. Label dependency • Assume that the labels of text units are independent with each other given the complete sequence of text units • We further assume that the label for a given unit depends only upon units in a surrounding window of units • k = 1:

  15. Classification Model • A MaxEnt model has the form: where is the vector of model parameters

  16. Classification Model • Pthresh = 0.5

  17. Features

  18. Space Splitting

  19. Evaluation • Uses only features RawCompact, EOS, and OOV • No space splitting

  20. Filtering is especially important and effective for the adaptation data with high percentage of non-dictated text (U2)

  21. Efficient linear combination for distant n-gram models David Langlois, Kamel Smaili, Jean-Paul Haton EUROSPEECH 2003 p409~412

  22. Introduction • Classical n-gram model • Distant language models

  23. Modelization of distance in SLM • Cache model (self-relationship) • The former deals with the self-relationship between a word present in the history and itself: if a word is frequent in the history, it has more chance to appear once again

  24. Modelization of distance in SLM (cont.) • Trigger model • the relationship between two words • It deals with couple of words v → w such that if v (the triggering word) is in the history, w (the triggered word) has more chance to appear • But, in fact, the majority of triggers are self triggers (v → v): a word triggers itself

  25. d-n-gram model • Nd(.) is the discounted count • 0-n-gram model is the classical n-gram model

  26. Evaluation • Voc: 20k words • Training set: 38M words • Development set: 2M words • Test set: 2M words • Baseline classical n-gram models:

  27. Integration of distant n-gram models • Distant n-gram model cannot be used alone. It takes into account only a part of the history • Perplexity is 717 for n=2 and d=4 • Several models with distance up to d are combined with the baseline model

  28. Improvement 7.1% Improvement 3.1% The utility of distant n-gram models decreases with the distance: a distance greater than 2 does not provide more information

  29. Distant trigram lead to an improvement, but it is less important than in distant bigram. overlap between the history of d-trigram and (d+1)-trigram

  30. Backoff smoothing b_u_z db_u_z (b_u_z)˙(db_u_z)

  31. 7.9% 11.6%

  32. Combination weight • Unique weight • The model’s weights depend on each history (the class of each sub-history)

  33. Combination of distant n-gram • In order to combine K models, M1,…,MK, a set of weight 1,…,K is defined and the combination is expressed by: • Development corpus is not sufficient to estimate a huge number of parameters • Classify histories and set a weight to each class

  34. Classification • Break the history into the several parts (sub-histories). Each sub-history is analyzed in order to estimate its importance in terms of prediction and then put into a class • Such a class is directly linked to the value of the sub-history frequency • This class gathers all sub-histories which have approximately the same frequency

  35. 8000 classes/115.4 12.8% improvement to baseline (132.4) 5.3% improvement to the single weight combination (121.9) 4000 classes/85.2 12.8% improvement to baseline (97.8) 1.5% improvement to the single weight combination (86.5)

More Related