Information theory, MDL and human cognition. Nick Chater Department of Psychology University College London [email protected] Overview. Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Take codes as basic…
…when we know most about representation, e.g., grammarsCodes or priors? Which comes first? 2. The practical issue
Good continuation—most lines continue in the same direction in real images
And we can roughly assess the complexity of grammars (by length)
Not so clear how directly to set a prior over all grammars
(though can define a generative process in simple cases…)Simplicity/MDL viewpoint(e.g., Goldsmith, 2001)
S NP VP
VP V NP
VP V NP PP
NP Det Noun
NP NP PP
Gzip as a handy approximation!?!
Possible, if limit to computable models
Mixture of all (computable) priors, with weights, i, that decline fairly fast:
Then, this multiplicatively dominates all priors
though neutral priors will mean slow learning
m(x) are “universal” priorsThe most neutral possible prior…
But Kolmogorov complexity can be made concrete…
210 bits, modelλ-calculus
272, combinatorsCompact Universal Turing machines
Due to Jon Tromp, 2007
Not much room to hide, here!
A key result: model
K(x) = -log2m(x) o(1)
Where m is a universal prior
Analogous to the Shannon’s source coding theorem
And for any computable q,
K(x) ≤ -log2q(x) o(1)
For typical x drawn from q(x)
Any data, x, that is likely for any sensible probability distribution has low K(x)Neutral priors and Kolmogorov complexity
Li, M. & Vitanyi, P. (1997) (2nd Edition). Introduction to Kolmogorov complexity theory and its applications. Berlin: Springer.
y(x)=a125x125+a124x124+…+a0The hidden role of simplicity…
But who says how many parameters a function has got??
All the virtues of theft over honest toil
An impressive fit!
An even more impressive fit!
Sense of moral outrage
Model must be fitted post-hoc
It would be different if I’d thought of it before the data arrived (cf empirical Bayes)
But order of data acquisition has no role in Bayes
Confirmation is just the same, whenever I thought of the modelShould the “cheating” model get a huge boost from this data
The models get a spectacular boost; but is even more spectacularly unlikely…
We can discoverthat things are simpler than we though (i.e., simplicity is not quite so subjective…)
Grouped 6 + 1 vectors
Ungrouped 6 x 2 vectors
Under modelgeneral grammars predict that good sentences are not allowed
just wait til one turns up
Overgeneral grammars predict that bad sentences are actually ok
Need negative evidence---say a bad sentence, and get correctedAnd language acquisition: where it helps resolve an apparent learnability paradox
“Mere” non-occurrence of sentences is not enough…
…because almost all acceptable sentences also never occur
Backed-up by formal results (Gold, 1967; though Feldman, Horning et al)
Argument for innateness?The logical problem of language acquisition(e.g., Hornstein & Lightfoot, 1981; Pinker, 1979)
Linguistic environment grammars
Measures of learning performance
Positive evidence only; computability
SimplicityAn “ideal” learning set-up (cf ideal observers)
Method can be “scaled-down” to consider learnability of specific linguistic constructions
Generalization (strictly confusability) is an exponention function of psychological “distance”Shepard’s (1987) Universal Law
Shepard’s generalization measure
for “typical” items
Assuming items roughly the same complexity function of psychological “distance”
The universal law
A simple array