Using CTW as a language modeler in Dasher

Using CTW as a language modeler in Dasher Phil Cowans, Martijn van Veen 25-04-2007 Inference Group Department of Physics University of Cambridge

Language Modelling • Goal is to produce a generative model over strings • Typically sequential predictions: • Finite context models:

Dasher: Language Model • Conditional probability for each alphabet symbol, given the previous symbols • Similar to compression methods • Requirements: • Sequential • Fast • Adaptive • Model is trained • Better compression -> faster text input

Basic Language Model • Independent distributions for each context • Use Dirichlet prior • Makes poor use of data • intuitively expect similarities between similar contexts

Basic Language Model

Prediction By Partial Match • Associate a generative distribution with each leaf in the context tree • Share information between nodes using a hierarchical Dirichlet (or Pitman-Yor) prior • In practice use a fast, but generally good, approximation

Hierarchical Dirichlet Model

Context Tree Weighting • Combine nodes in the context tree • Tree structure treated as a random variable • Contexts associated with each leaf have the same generative distribution • Contexts associated with different leaves are independent • Dirichlet prior on generative distributions

CTW: Tree model • Source structure in the model, memoryless parameters

Tree Partitions

Children share one distribution Children distributed independently Recursive Definition

Experimental Results [256]

Observations So Far • No clear overall winner without modification. • PPM Does better with small alphabets? • PPM Initially learns faster? • CTW is more forgiving with redundant symbols?

CTW for text Properties of text generating sources: • Large alphabet, but in any given context only a small subset is used • Waste of code space, many probabilities that should be zero • Solution: • Adjust zero-order estimator to decrease probability of unlikely events • Binary decomposition • Only locally stationary • Limit the counts to increase adaptivity • Bell, Cleary, Witten 1989

Binary Decomposition • Decomposition tree

Binary Decomposition • Results found by Aberg and Shtarkov: • All tests with full ASCII alphabet

Count halving • If one count reaches a maximum, divide both counts by 2 • Forget older input data, increase adaptivity • In Dasher: Predict user input with a model based on training text • Adaptivity even more important

Counthalving: Results

Results: Enron

Combining PPM and CTW • Select locally best model, or weight models together • More alpha parameters for PPM, learned from data • PPM like sharing, with prior over context trees, as with CTW

Conclusions • PPM and CTW have different strengths, makes sense to try combining them • Decomposition and count scaling may give clues for improving PPM • Look at performance on out of domain text in more detail

Experimental Parameters • Context depth: 5 • Smoothing: 5% • PPM – alpha: 0.49, beta: 0.77 • CTW – w: 0.05, alpha: 1/128

Comparing language models • PPM • Quickly learns repeating strings • CTW • Works on a set of all possible tree models • Not sensitive to parameter D, max. model depth • Easy to increase adaptivity • The weight factor (escape probability) is strictly defined

Using CTW as a language modeler in Dasher