Information theory mdl and human cognition
Download
1 / 49

Information theory, MDL and human cognition - PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on

Information theory, MDL and human cognition. Nick Chater Department of Psychology University College London [email protected] Overview. Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Information theory, MDL and human cognition' - robbin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Information theory mdl and human cognition

Information theory, MDL and human cognition

Nick Chater

Department of Psychology

University College London

[email protected]


Overview
Overview

  • Bayes and MDL: An overview

  • Universality and induction

  • Some puzzles about model fitting

  • Cognitive science applications


Bayes and mdl a simplified story
Bayes and MDL: A simplified story

  • Shannon’s coding theorem.

    • For distribution Pr(A), optimal code will assign -log2Pr(A) to code event A

  • MDL model selection:

    • choose, M, that yields the shortest code for D, i.e., minimize:

    • -log2Pr(D, M)


A simple equivalence
A simple equivalence

  • Minimize:

    • -log2Pr(D, M)

  • Maximize:

    • Pr(D, M) = Pr(M|D)Pr(D)

  • Maximize:

    • Pr(M|D)

  • Just what Bayes recommends

    • (if choosing a single model)

  • Equivalence generalizes to parametric M(θ) to ‘full’ Bayes; and in other ways

    • Chater, 1996, for application to the simplicity and likelihood principles in perceptual organization


  • Codes or priors which comes first 1 the philosophical issue
    Codes or priors? Which comes first? 1. The philosophical issue

    • Bayesian viewpoint: Probabilities as basic

      • calculus for degrees of beliefs (probability theory)

      • decision theory (probabilities meet action)

      • brain as a probabilistic calculation machine (whether belief propagation, dynamic programming…)


    • Simplicity/MDL viewpoint: Codes as basic

      • Rissanen: data is all there is; distributions are a fiction

      • Code structure is primary; code interpretation is secondary

      • Probabilities defined over events; but “events” are cognitive constructs

        • Leeuwenberg & Boselie (1988)


    Codes or priors which comes first 2 the practical issue

    MDL viewpoint:

    Take codes as basic…

    …when we know most about representation, e.g., grammars

    Codes or priors? Which comes first? 2. The practical issue

    • Bayesian viewpoint:

      • Take probabilities as basic…

      • …when we know most about probability, e.g, image statistics:


    Bayesian viewpoint e g geisler et al 2001
    Bayesian viewpoint (e.g., Geisler et al, 2001)

    Good continuation—most lines continue in the same direction in real images


    Simplicity mdl viewpoint e g goldsmith 2001

    In, e.g., linguistics, representations are given by theory

    And we can roughly assess the complexity of grammars (by length)

    Not so clear how directly to set a prior over all grammars

    (though can define a generative process in simple cases…)

    Simplicity/MDL viewpoint(e.g., Goldsmith, 2001)

    S  NP VP

    VP  V NP

    VP  V NP PP

    NP  Det Noun

    NP  NP PP

    “Binding contraints”

    Gzip as a handy approximation!?!



    Overview1
    Overview

    • Bayes and MDL: An overview

    • Universality and induction

    • Some puzzles about model fitting

    • Cognitive science applications


    The most neutral possible prior

    Suppose we want a prior so neutral that it never rules out a model

    Possible, if limit to computable models

    Mixture of all (computable) priors, with weights, i, that decline fairly fast:

    Then, this multiplicatively dominates all priors

    though neutral priors will mean slow learning

    m(x) are “universal” priors

    The most neutral possible prior…


    The most neutral possible coding language
    The most neutral possible coding language model

    • Universal programming languages (Java, matlab, UTMs, etc)

    • K(x) = length of shortest program in Java, matlab, UTM, that generates x (K is uncomputable)

    • Invariance theorem

      • any languages L1, L2, c,

      • x |KL1(x)-KL2(x)| ≤c

    • Mathematically justifies talk of K(x), not KJava(x) , KMatlab(x),…


    So does this mean that choice of language doesn t matter
    So does this mean that choice of language doesn’t matter? model

    • Not quite!

      • c can be large

    • And, for any L1, c0, L2, x such that

      • |KL1(x)-KL2(x)| ≥c0

    • The problem of the one-instruction code for the entire data set…

    But Kolmogorov complexity can be made concrete…


    Compact universal turing machines

    210 bits, modelλ-calculus

    272, combinators

    Compact Universal Turing machines

    Due to Jon Tromp, 2007

    Not much room to hide, here!


    Neutral priors and kolmogorov complexity

    A key result: model

    K(x) = -log2m(x) o(1)

    Where m is a universal prior

    Analogous to the Shannon’s source coding theorem

    And for any computable q,

    K(x) ≤ -log2q(x) o(1)

    For typical x drawn from q(x)

    Any data, x, that is likely for any sensible probability distribution has low K(x)

    Neutral priors and Kolmogorov complexity


    Prediction by simplicity
    Prediction by simplicity model

    • Find shortest ‘program/explanation’ for current ‘corpus’ (binary string)

    • Predict using that program

      • Strictly, use ‘weighted sum’ of explanations, weighted by brevity


    Prediction is possible solomonoff 1978 summed error has finite bound
    Prediction is possible model(Solomonoff, 1978)Summed error has finite bound

    • sj is summed squared error between prediction and true probability on item j

    • So prediction converges [faster than 1/nlog(n)], for corpus size n

    • Computability assumptions only (no stationarity needed)


    Summary so far
    Summary so far… model

    • Simplicity/MDL - close and deep connections with Bayes

    • Defines universal prior (i.e., based on simplicity)

    • Can be made “concrete”

    • General prediction results

    • A convenient “dual” framework to Bayes, when codes are easier than probabilities

    Li, M. & Vitanyi, P. (1997) (2nd Edition). Introduction to Kolmogorov complexity theory and its applications. Berlin: Springer.


    Overview2
    Overview model

    • Bayes and MDL: An overview

    • Universality and induction

    • Some puzzles about model fitting

    • Cognitive science applications


    A problem of model selection or why simplicity won t go away
    A problem of model selection? modelOr: why simplicity won’t go away

    • Where do priors come from?

      • Well, priors can be given by hyper-priors

      • And hyper-priors by hyper-hyper-priors

      • But it can’t go on forever!

    • And we need priors over models we’re only just thought of

    • And, in some contexts, over models we haven’t yet thought of (!)

    • Code length in our representation language is a fixed basis

      • Nb. Building probabilistic models = augmenting our language with new coding schemes


    The hidden role of simplicity

    Bayesian model selection prefers model

    y(x)=a2x2+a1x+a0

    Not

    y(x)=a125x125+a124x124+…+a0

    The hidden role of simplicity…

    But who says how many parameters a function has got??


    A trick
    A trick… model

    • Convert parameters to constants

      • y(x)=a125x125+a124x124+…+a0

        • 126 parameters

      • y(x)=.003x125 + .02x124+…+3x – 24.3

        • 0 parameters

    • And hence is favoured by Bayesian (and all other) model selection criteria

    All the virtues of theft over honest toil


    Zoubin s problem for ml
    Zoubin’s problem for ML model

    • ML Gaussian is a delta function on one point



    A related problem for bayes
    A related problem for Bayes? model

    • The mixture of delta functions model (!)


    A related problem for bayes1
    A related problem for Bayes? model

    • The mixture of delta functions model (!)

    An even more impressive fit!


    Should the cheating model get a huge boost from this data

    No! model

    Sense of moral outrage

    Model must be fitted post-hoc

    It would be different if I’d thought of it before the data arrived (cf empirical Bayes)

    Yes!

    !

    But order of data acquisition has no role in Bayes

    Confirmation is just the same, whenever I thought of the model

    Should the “cheating” model get a huge boost from this data

    The models get a spectacular boost; but is even more spectacularly unlikely…


    • So we need to take care with priors! model

      • y = x

      • High prior; compact to state

      • y=.003x125 + .02x124+…+3x-24

      • Low prior; not compact to state

    • With a different representation language, could have the opposite bias

    • But we start from where we are; our actual representations

    We can discoverthat things are simpler than we though (i.e., simplicity is not quite so subjective…)


    Overview3
    Overview model

    • Bayes and MDL: An overview

    • Universality and induction

    • Some puzzles about model fitting

    • Cognitive science applications



    Here: model

    • Perceptual organization

    • Language Acquisition

    • Similarity and generalization


    Long tradition of simplicity in perception mach koffka leeuwenberg e g gestalt laws
    Long tradition of simplicity in perception model(Mach, Koffka, Leeuwenberg); e.g., Gestalt laws

    (x)

    (x,v)

    +

    (x,v)

    (x)

    +

    (x,v)

    +

    (x)

    (x)

    +

    +

    (x,v)

    (x)

    (x,v)

    +

    (x)

    (x,v)

    (v)

    Grouped 6 + 1 vectors

    Ungrouped 6 x 2 vectors


    And language acquisition where it helps resolve an apparent learnability paradox

    Under modelgeneral grammars predict that good sentences are not allowed

    just wait til one turns up

    Overgeneral grammars predict that bad sentences are actually ok

    Need negative evidence---say a bad sentence, and get corrected

    And language acquisition: where it helps resolve an apparent learnability paradox


    The logical problem of language acquisition e g hornstein lightfoot 1981 pinker 1979

    Without negative evidence can never eliminate overgeneral grammars

    “Mere” non-occurrence of sentences is not enough…

    …because almost all acceptable sentences also never occur

    Backed-up by formal results (Gold, 1967; though Feldman, Horning et al)

    Argument for innateness?

    The logical problem of language acquisition(e.g., Hornstein & Lightfoot, 1981; Pinker, 1979)


    An ideal learning set up cf ideal observers

    Linguistic environment grammars

    Measures of learning performance

    Learning method

    Positive evidence only; computability

    Statistical

    Simplicity

    An “ideal” learning set-up (cf ideal observers)


    Overgeneralization theorem chater vit nyi
    Overgeneralization Theorem grammars(Chater & Vitányi) 

    • Suppose learner has probability j of erroneously guessing an ungrammatical jth word

    • Intuitive explanation:

      • overgeneralization underloads probabilities of grammatical sentences;

      • Small probabilities implies longer code lengths


    Absence as implicit negative evidence
    Absence as implicit negative evidence grammars

    • Overgeneral grammars predict missing sentences

    • And their absence is a clue that the grammar is wrong

    Method can be “scaled-down” to consider learnability of specific linguistic constructions


    Similarity and categorization
    Similarity and categorization grammars

    • Cognitive dissimilarity: representational “distortion” required to get from x to y

      • DU(x,y) = K(y|x)

      • Not symmetrical

        • K(y|x) > K(x|y) when

        • K(y) > K(x)

          • Deletion is easy…


    Shepard s 1987 universal law

    Generalization (strictly confusability) is an exponention function of psychological “distance”

    Shepard’s (1987) Universal Law


    A derivation
    A derivation function of psychological “distance”

    Shepard’s generalization measure

    for “typical” items


    Assuming items roughly the same complexity function of psychological “distance”

    The universal law


    The asymmetry of similarity
    The asymmetry of similarity function of psychological “distance”

    • What thing is this like?



    A heuristic measure of amount of information shannon s guessing game
    A heuristic measure of amount of information: Shannon’s guessing game…

    1. Pony?

    2. Cow?

    3. Dog?

    345. Pegasus√

    1. Pony?

    2. Cow?

    3. Dog?

    345. Pegasus√

    345!


    Asymmetry of codelengths asymmetry of similarity
    Asymmetry of codelengths guessing game…asymmetry of similarity

    • Horse: guess #345 gets Pegasus. log2Pr(#345) is very small.

    • Pegasus: guess #2 gets Horse. log2(Pr(#2)) is very small.

    • So Pegasus is more like Horse, than Horse is like Pegasus

    • Many other examples of asymmetry, and many measures (search times, memory confusions…), which seem to fit this pattern


    Treisman souther 1985
    Treisman & Souther (1985) guessing game…

    A simple array


    A complex array
    A complex array guessing game…


    Summary
    Summary guessing game…

    • MDL/Kolmogorov complexity close relation with Bayes

    • Basis for a “universal” prior

    • Variety of applications to cognitive science


    ad