1 / 36

Random Forests for Language Modeling

Random Forests for Language Modeling. Peng Xu, Frederick Jelinek. Outline. Basic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM) Experiments

clint
Download Presentation

Random Forests for Language Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Random Forests for Language Modeling Peng Xu, Frederick Jelinek

  2. Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram • Structured Language Model (SLM) • Experiments • Conclusions and Future Work

  3. Basic Language Modeling Estimate the source probability from a training corpus: large amount of words chosen for similarity to expected sentences Parametric conditional models history

  4. Basic Language Modeling Smooth Models: Perplexity (PPL): n-gram Models:

  5. Estimate n-gram Parameters Maximum Likelihood (ML) estimate: • Best on training data: lowest PPL Data sparseness problem: n=3, |V|=10k, a trillion words needed Zero probability for almost all test data!

  6. Neural network: represent words in real space , use exponential model Dealing with Sparsity • Smoothing: use lower order statistics Word clustering: reduce the size of V History clustering: reduce the number of histories Maximum entropy: use exponential models

  7. Smoothing Techniques Good smoothing techniques: Deleted Interpolation, Katz, Absolute Discounting, Kneser-Ney (KN) • Kneser-Ney: consistently the best [Chen & Goodman, 1998]

  8. Decision Tree Language Models Goal: history clustering by a binary decision tree (DT) • Internal nodes: a set of histories, one or two questions • Leaf nodes: a set of histories • Node splitting algorithms • DT growing algorithms

  9. {ab,ac,bc,bd} a:3 b:2 Is the first word ‘a’? Is the first word ‘b’? {ab,ac} a:2 b:1 {bc,bd} a:1 b:1 Example DT New event ‘cba’: Stuck! Training data: aba, aca, acb, bcb, bda

  10. Previous Work • DT is an appealing idea: deal with data sparseness • [Bahl, et al 1989] 20 words in histories, slightly better than 3-gram • [Potamianos and Jelinek, 1998] fair comparison, negative results on letter n-gram • Both are top-down with a stopping criterion • Why doesn’t it work in practice? • Training data fragmentation: data sparseness • No theoretically founded stopping criterion: early termination • Greedy algorithms: early termination

  11. Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram • Structured Language Model (SLM) • Experiments • Conclusions and Future Work

  12. Random Forests • [Amit & Geman 1997] shape recognition with randomized trees • [Ho 1998] random subspace • [Breiman 2001] random forests Random Forest (RF): a classifier consisting of a collection of tree-structured classifiers.

  13. Our Goal Main problems: • Data sparseness • Smoothing • Early termination • Greedy algorithms Expectations from Random Forests: • Less greedy algorithms: randomization and voting • Avoid early termination: randomization • Conquer data sparseness: voting

  14. Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram: general approach • Structured Language Model (SLM) • Experiments • Conclusions and Future Work

  15. General DT Growing Approach • Grow a DT until maximum depth using training data • Perform no smoothing during growing • Prune fully grown DT to maximize heldout data likelihood • Incorporate KN smoothing during pruning

  16. Node Splitting Algorithm Questions: about identities of words in the history Definitions: • H(p): the set of histories in a node p • position: distance from a word in the history to predicted word • i(v) : the set of histories with word v in position • split:non-emptysets Ai and Bi, consists of i(v) • L(Ai) : training data log-likelihood of a node under split Ai and Bi using relative frequencies

  17. Node Splitting Algorithm Algorithm Sketch: • For each position i • Initialization: Ai, Bi • For each i(v) Ai • Tentatively move i(v) to Bi • Calculate log-likelihood increase L(Ai- i(v)) - L(Ai) • If the increase is positive, move i(v) and modify counts • Carry out the same for each i(v) Bi • Repeat b)-c) until no move is possible • Split the node according to the best position: the increase in log-likelihood is the largest

  18. Pruning a Decision Tree Smoothing: Define: • L(p): set of all leaves rooted in p • LH(p): smoothed heldout data log-likelihood in p • LH(L(p)): smoothed heldout data log-likelihood in L(p) • potential : LH(L(p)) - LH(p) Pruning: traverse all internal nodes, prune a subtree rooted in p if potential is negative (similar to CART)

  19. Generating random forests LM: • M decision trees are grown randomly • Each DT generates a probability sequence on test data • Aggregation: Towards Random Forests Randomized question selection: • Randomized initialization: Ai, Bi • Randomized position selection

  20. Remarks on RF-LM Random Forest Language Model (RF-LM) : A collection of randomly constructed DT-LMs • A DT-LM is an RF-LM: small forest • An n-gram LM is a DT-LM: no pruning • An n-gram LM isan RF-LM! • Single compact model

  21. Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram • Structured Language Model (SLM) • Experiments • Conclusions and Future Work

  22. A Parse Tree

  23. The Structured Language Model (SLM)

  24. Partial Parse Tree

  25. SLM Probabilities Joint probability of words and parse: Word probabilities:

  26. Using RFs for the SLM Ideally: running the SLM one time Parallel approximation: running the SLM M times Aggregate M probability sequences

  27. Outline • Basic Language Modeling • Language Models • Smoothing in n-gram Language Models • Decision Tree Language Models • Random Forests for Language Models • Random Forests • n-gram • Structured Language Model (SLM) • Experiments • Conclusions and Future Work

  28. Experiments Goal: Compare with Kneser-Ney (KN) • Perplexity (PPL): • UPenn Treebank: 1 million words training, 82k test • Normalized text • Word Error Rate (WER): • WSJ text: 20 or 40 million words training • WSJ DARPA’93 HUB1 test data: 213 utterances, 3446 words • N-best rescoring: standard trigram baseline on 40 million words

  29. Experiments: trigram perplexity • Baseline: KN-trigram • No randomization: DT-trigram • 100 random DTs: RF-trigram

  30. Experiments: Aggregating • Improvements within 10 trees!

  31. Experiments: Why does it work? seen event: • KN-trigram: in training data • DT-trigram: in training data • RF-trigram: in training data for any m

  32. Experiments: SLM perplexity • Baseline: KN-SLM • 100 random DTs for each of the components • Parallel approximation • Interpolate with KN-trigram

  33. Experiments: speech recognition • Baseline: KN-trigram, KN-SLM • 100 random DTs for RF-trigram, RF-SLM-P (predictor) • Interpolate with KN-trigram (40M)

  34. Conclusions • New RF language modeling approach More general LM: RF  DT  n-gram Randomized history clustering: non-reciprocal data sharing Good performance in PPL and WER Generalize well to unseen data Portable to other tasks

  35. Future Work • Random samples of training data • More linguistically oriented questions • Direct implementation in the SLM • Lower order random forests • Larger test data for speech recognition • Language model adaptation

  36. Thank you!

More Related