1 / 39

The Brain as a statistical Information Processor

The Brain as a statistical Information Processor. And you can too!. My History and Ours. 2011. 1972. 1992. BS. S. The Brain as a Statistical IP. Introduction Evidence for Statistics Bays Law Informative Priors Joint Models Inference Conclusion. Evidence for Statistics.

berit
Download Presentation

The Brain as a statistical Information Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Brain as a statistical Information Processor And you can too!

  2. My History and Ours 2011 1972 1992 BS S

  3. The Brain as a Statistical IP • Introduction • Evidence for Statistics • Bays Law • Informative Priors • Joint Models • Inference • Conclusion

  4. Evidence for Statistics Two examples that seem to indicate that the brain is indeed processing statistical information

  5. Statistics for Word Segmentation • Saffran, Aslin, Newport. “Statistical Learning in 8-Month-Old Infants” • The infants listen to strings of nonsense words with no auditory clues to word boundaries. • E.g., “bidakupa …” where “bidaku is the first word. • They learn to distinguish words from other combinations that occur (with less frequency) over word boundaries.

  6. They Pay More Attention toNon-Words Light Child Speaker

  7. Statistics in Low-level Vision • Based on Rosenholtz et. al. (2011) A B

  8. Statistics in Low-level Vision • Based on Rosenholtz et. al. (2011) A N O B E L

  9. Are summary statistics a good choice of representation? • A much better idea than spatial subsampling ~1000 pixels Original patch

  10. Are summary statistics a good choice of representation? • A rich set of statistics can capture a lot of useful information Patch synthesized to match ~1000 statistical parameters(Portilla & Simoncelli, 2000) Original patch

  11. Discrimination based on P&S stats predicts crowded letter recognition • Balas, Nakano, & Rosenholtz, JoV, 2009

  12. Bayes Law and Cognitive Science To my mind, at least, it packs a lot of information

  13. Bayes Law and Cognitive Science P(M|E) = P(M) P(E|M) P(E) M = Learned Model of the world E = Learner’s environment (sensory input)

  14. Bayes Law P(M|E) =P(M) P(E|M) P(E) It divides up responsibility correctly. It requires a generative model. (big, joint) It (obliquely) suggests that as far as learning goes we ignore the programs that use the model. But which M?

  15. Bayes Law Does not Pick M • Don’t pick M. Integrate over all of them. • Pick the M that maximizes P(M)P(E|M). • Pick the average P(M) (Gibbs sampling). P(E) = Σ P(M)P(E|M) M

  16. My Personal Opinion Don’t sweat it.

  17. Informative Priors Three examples where they are critical

  18. Parsing Visual Scenes(Sudderth, Jordan) dome sky skyscraper sky buildings trees temple bell

  19. Cut random surfaces (samples from a GP) with thresholds(as in Level Set Methods) • Assign each pixel to the first surface which exceeds threshold(as in Layered Models) Spatially Dependent Pitman-Yor Duan, Guindani, & Gelfand,Generalized Spatial DP, 2007

  20. Samples from Spatial Prior Comparison: Potts Markov Random Field

  21. Prior for Word Segmentation • Based on the work of Goldwater et. al. • Separate one “word” from the next in child-directed speech. • E.g., yuwanttusiD6bUk You want to see the book

  22. Bag of Words • Generative Story For each utterance: For each word w (or STOP) pick with probability P(w) If w=STOP break If we pick M to maximize P(E|M) the model memorizes the data. I.e., It creates one “word” which is the concatenation of all the words in that sentence.

  23. Results Using a Dirichlet Prior Precision: 61.6 Recall: 47.6 Example: youwant to see thebook

  24. Part-of-speech Induction • Primarily based on Clark (2003) • Given a sequence of words, deduce their parts of speech (e.g., DT, NN, etc.) • Generative story: For each word position (i) in the text 1) propose part-of-speech (t) p(t|t-1) 2) propose a word (w) using p(w|t)

  25. Sparse Tag Distributions • We could put a Dirichlet prior on P(w|t) • But what we really want is sparse P(t|w) • Almost all words (by type) have only one part-of-speech • We do best by only allowing this. • E.g., “can” is only a model verb (we hope!) • Putting a sparse prior on P(word-type|t) also helps.

  26. Joint Generative Modeling Two examples that show the strengths of modeling many phenomena jointly.

  27. Joint POS Tagging and Morphology • Clark pos tagger also includes something sort of like a morphology model. • It assumes POS tags are correlated with spelling. • True morphology would recognize that “ride” “riding” and “rides” share a root. • I do not know of any true joint tagging-morphology model.

  28. Joint Reference and (Named) Entity Recognition • Based on Haghighi& Klein 2010 Weiner said the problems were all Facebook’s fault. They should never have given him an account. (person) Type1 (organization) Type2 Obama, Weiner, father IBM, Facebook, company

  29. Inference Otherwise know as hardware.

  30. It is not EM • More generally it is not any mechanism that requires tracking all expectations. • Consider the word boundary. Between every two phonemes there may or may not be a boundary. abcdea|bcdeab|cdeabc|deabcd|e a|b|cde …

  31. Gibbs Sampling • Start out with random guesses. Do (roughly) forever: Pick a random point. Compute p(split) and p(join). Pick r, 0<r<1: if p(split) > r split, p(split)+p(join) else join.

  32. Gibbs has Very Nice Properties

  33. Gibbs has Very Nice Properties

  34. It is not Gibbs Either • First, the nice properties only hold for “exchangeable” distributions. It seems likely that most of the ones we care about are not (e.g., Haghighi & Klein) • But critically it assumes we have all the training data at once and go over it many times.

  35. It is Particle Filterning • Or something like it. • At the level of detail here, just think “beam search.”

  36. Parsing and CKY Information Barrier S VP NP NNS bones NP NNS Dogs VBS like

  37. It is Particle Filterning • Or something like it. • At the level of detail here, just think “beam search.” (ROOT (ROOT (NP (NNS Dogs) (Root (S (NP (NNS Dogs) (ROOT (S (NP (NNS Dogs)) (VP (VBS eat)

  38. Conclusion • The brain operates by manipulating probabilities. • World-model induction is governed by Bayes Law • This implies we have a large joint generative model • It seems overwhelmingly likely that we have a very informative prior. • Something like particle filtering is the inference/use mechanism.

  39. Thank You

More Related