1 / 26

Using CTW as a language modeler in Dasher

This article explores the use of CTW (Context Tree Weighting) as a language modeler in Dasher, with comparisons to PPM (Prediction by Partial Match). It discusses the goal of producing a generative model over strings and the requirements of sequential, fast, and adaptive models. The hierarchical Dirichlet model and binary decomposition techniques are also examined. Experimental results and conclusions are provided.

sipp
Download Presentation

Using CTW as a language modeler in Dasher

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using CTW as a language modeler in Dasher Phil Cowans, Martijn van Veen 25-04-2007 Inference Group Department of Physics University of Cambridge

  2. Language Modelling • Goal is to produce a generative model over strings • Typically sequential predictions: • Finite context models:

  3. Dasher: Language Model • Conditional probability for each alphabet symbol, given the previous symbols • Similar to compression methods • Requirements: • Sequential • Fast • Adaptive • Model is trained • Better compression -> faster text input

  4. Basic Language Model • Independent distributions for each context • Use Dirichlet prior • Makes poor use of data • intuitively expect similarities between similar contexts

  5. Basic Language Model

  6. Prediction By Partial Match • Associate a generative distribution with each leaf in the context tree • Share information between nodes using a hierarchical Dirichlet (or Pitman-Yor) prior • In practice use a fast, but generally good, approximation

  7. Hierarchical Dirichlet Model

  8. Context Tree Weighting • Combine nodes in the context tree • Tree structure treated as a random variable • Contexts associated with each leaf have the same generative distribution • Contexts associated with different leaves are independent • Dirichlet prior on generative distributions

  9. CTW: Tree model • Source structure in the model, memoryless parameters

  10. Tree Partitions

  11. Children share one distribution Children distributed independently Recursive Definition

  12. Experimental Results [256]

  13. Experimental Results [128]

  14. Experimental Results [27]

  15. Observations So Far • No clear overall winner without modification. • PPM Does better with small alphabets? • PPM Initially learns faster? • CTW is more forgiving with redundant symbols?

  16. CTW for text Properties of text generating sources: • Large alphabet, but in any given context only a small subset is used • Waste of code space, many probabilities that should be zero • Solution: • Adjust zero-order estimator to decrease probability of unlikely events • Binary decomposition • Only locally stationary • Limit the counts to increase adaptivity • Bell, Cleary, Witten 1989

  17. Binary Decomposition • Decomposition tree

  18. Binary Decomposition • Results found by Aberg and Shtarkov: • All tests with full ASCII alphabet

  19. Count halving • If one count reaches a maximum, divide both counts by 2 • Forget older input data, increase adaptivity • In Dasher: Predict user input with a model based on training text • Adaptivity even more important

  20. Counthalving: Results

  21. Counthalving: Results

  22. Results: Enron

  23. Combining PPM and CTW • Select locally best model, or weight models together • More alpha parameters for PPM, learned from data • PPM like sharing, with prior over context trees, as with CTW

  24. Conclusions • PPM and CTW have different strengths, makes sense to try combining them • Decomposition and count scaling may give clues for improving PPM • Look at performance on out of domain text in more detail

  25. Experimental Parameters • Context depth: 5 • Smoothing: 5% • PPM – alpha: 0.49, beta: 0.77 • CTW – w: 0.05, alpha: 1/128

  26. Comparing language models • PPM • Quickly learns repeating strings • CTW • Works on a set of all possible tree models • Not sensitive to parameter D, max. model depth • Easy to increase adaptivity • The weight factor (escape probability) is strictly defined

More Related