1 / 41

410 likes | 505 Views

LING 696B: Phonotactics wrap-up, OT, Stochastic OT. Remaining topics. 4 weeks to go (including the day before thanksgiving): Maximum-entropy as an alternative to OT (Jaime) Rule induction (Mans) + decision tree

Download Presentation
## LING 696B: Phonotactics wrap-up, OT, Stochastic OT

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Remaining topics**• 4 weeks to go (including the day before thanksgiving): • Maximum-entropy as an alternative to OT (Jaime) • Rule induction (Mans) + decision tree • Morpho-phonological learning (Emily) and multiple generalizations (LouAnn’s lecture) • Learning and self-organization (Andy’s lecture)**Towards a parametric model of phonotactics**• Last time: simple sequence models with some simple variations • Phonological generalization needs much more than this • Different levels: Natural classes: Bach +ed= ?; onset sl/*sr, *shl/shrAlso: position, stress, syllable, … • Different ranges: seems to be unbounded Hungarian (Hayes & Londe): ablak-nak / kert-nek; paller-nak / mutagen-nek English: *sCVC, *sNVN (skok? spab? smin?)**Towards a parametric model of phonotactics**• Parameter explosion seems unavoidable • Searching over all possible natural classes? • Searching over unbounded ranges? • Data sparsity problem serious • Esp. if counting type rather than token frequency • Isolate generalization at specific positions/configurations with templates • Need theory for templates (why sCVC?) • Templates for everything? • Non-parametric/parametric boundary blurred**Towards a parametric model of phonotactics**• Critical survey of literature needed • How can phonological theory constrain parametric models of phonotactics? • Homework assignment (count as 2-3): a phonotactics literature review • E.g. V-V, C-C, V-C interaction, natural classes, positions, templates, … • Extra credit if also present ideas about how they are related to modeling**OT and phonological acquisition**• Isn’t data sparsity already a familiar issue? • Old friend: “poverty of stimulus” -- training data vastly insufficient for learning the distribution (recall: the limit sample size 0) - +**OT and phonological acquisition**• Isn’t data sparsity already a familiar issue? • Old friend: “poverty of stimulus” -- training data vastly insufficient for learning the distribution (recall: the limit sample size 0) - +**OT and phonological acquisition**• Isn’t data sparsity already a familiar issue? • Old friend: “poverty of stimulus” -- training data vastly insufficient for learning the distribution (recall: the limit sample size 0) • Maybe the view is wrong: forget distribution in a certain language, focus on universals**OT and phonological acquisition**• Isn’t data sparsity already a familiar issue? • Old friend: “poverty of stimulus” -- training data vastly insufficient for learning the distribution (recall: the limit sample size 0) • Maybe the view is wrong: forget distribution in a certain language, focus on universals • Standard OT: generalization hard-coded, abandon the huge parameter space • Justification: only consider the ones that are plausible/attested • Learning problem made easier?**OT learning: constraint demotion**• Example: English (sibilant+liquid) onset • Somewhat motivated constraints: *sh+C, *sr, Ident(s), Ident(sh). Starting equal. • Demote constraints that prefer the wrong guys *Example adapted from A. Albright**OT learning: constraint demotion**• Now, pass shleez/sleez to the learner • No negative evidence: shl never appeared in English • Conservative strategy: underlying form same as the surface by default (richness of the base)**Biased constraint demotion(Hayes, Prince & Tesar)**• Why the wrong generalization? • Faithfulness -- Ident(sh) is high, therefore allowing underlying sh to appear everywhere • In general: faithfulness high leads to “too much” generalization in OT • C.f. the subset principle • Recipe: keep faithfulness as low as possible, unless evidence suggests otherwise • Hope: learn the “most restrictive” language • What kind of evidence?**Remarks on OT approaches to phonotactics**• The issues are never-ending • Not enough to put all F low, which F is low also matters (Hayes) • Mission accomplished? -- Are we almost getting the universal set of F and M? • Even with hard-coded generalization, still takes considerable work to fill all the gaps (e.g. sC/shC, *tl/*dl) • Why does bwa sounds better than tla (Moreton)**Two worlds**• Statistical model and OT seem to ask different questions about learning • OT/UG: what is possible/impossible? • Hard-coded generalizations • Combinatorial optimization (sorting) • Statistical: among the things that are possible, what is likely/unlikely? • Soft-coded generalizations • Numerical optimization • Marriage of the two?**OT and variation**• Motivation: systematic variation that leads to conflicting generalizations • Example: Hungarian again (Hayes & Londe)**Proposals on getting OT to deal with variation**• Partial order rather than total order of constraints (Antilla) • Don’t predict what’s more likely than others • Floating constraints (historical OT people) • Can’t really tell what the range is • Stochastic OT (Boersma, Hayes) • Does produce a distribution • Moreover, a generative model • Somewhat unexpected complexity**Stochastic OT**• Want to set up a distribution to learn. But distribution over what? • GEN? -- This does not lead to conflicting generalizations from a fixed ranking • One idea: distribution over all grammars (also see Yang’s P&P framework) • How many OT grammars? --(N!) • Lots of distributions are junk, e.g. (1,2,…N)~0.5, (N,N-1,…,1)~0.5; everything else zero • Idea: constrain the distribution over N! grammars with (N-1) ranking values**Stochastic Optimality Theory:Generation**• Canonical OT C1<<C3<<C2 • Stochastic OT Sample and evaluate ordering C1 C3 C2**What is the nature of the data?**• Unlike previous generative models, here the data is relational • Candidates have been “pre-digested” as violation vectors • Candidate pairs (+ frequency) contain information about the distribution over grammars • Similar scenario: estimating numerical (0-100) grades from letter grades (A-F).**Stochastic Optimality Theory:Learning**• Canonical OT (C1>>C3) (C2>>C3) • Stochastic OT “ranking values”: G = (1, … , N) RN Ordinal data (D) ??? max {C1, C2} > C3 ~ .77 max {C1, C2} < C3 ~ .23**Gradual Learning Algorithm (Boersma & Hayes)**• Two goals • A robust method for learning standard OT(note: arbitrary noise-polluted OT ranking is a graph cut problem -- NP) • A heuristic for learning Stochastic OT • Example: mini grammar with variation**How does GLA work**• Repeat for many times (forced to stop) • Pick a winner by throwing a dice according to P(.) • Adjust constraints with a small value if the prediction doesn’t match the picked winner • Similar to training neural nets • “Propogate” error to the ranking values • Some randomness is involved in getting the error**GLA is stochastic local search**• Stochastic local search: incomplete methods, often work well in practice (esp. for intractable problems), but no guarantee • Need something that • works in general**GLA as random walk**• Fix the update values, then GLA behaves like a “drunken man”: • Probability of moving in each direction only depends on where you are • In general, does not “wander off” Ident(voi) Possible moves for GLA Ranking value of *[+voi]**Stationary distributions**• Suppose, we have a zillion GLA running around independently, and look at their “collective answer” • If they don’t wander off, than this answer does’t change much after a while -- convergence to the stationary distribution • Equivalent to looking at many runs of just one program**The Bayesian approach to learning Stochastic OT grammars**• Key idea: simulating a distribution with computer power • What is a meaningful stationary distribution? • The posterior distribution p(G|D) -- peaks at grammars that explain the data well • How to construct a random walk that will eventually reach p(G|D)? • Technique: Markov-chain Monte-Carlo**An example of Bayesian inference**• Guessing the heads-on probability of a bent coin from the outcome of coin tosses Posteriorafter seeing1 head Posteriorafter seeing10 heads Posteriorafter seeing100 heads Prior**Why Bayesian? Maximum-Likelihood difficult**• Need to deal with product of integrals! • Likelihood of d: “max {C1, C2} > C3” • No hope this can be done in a tractable way • Bayesian method gets around doing calculus all together**Data Augmentation Scheme for Stochastic OT**• Paradoxical aspect: “more is easier” • “Missing Data” (Y): the real values of constraints that generate the ranking d G – grammar d: “max {C1, C2} > C3” Idea: simulate P(G,Y|D) is easier than P(G|D) Y – missing data**Gibbs sampling for Stochastic OT**• p(G|Y,D)=p(G|Y) is easy: sampling mean from normal posterior ~ • Random number generation: P(G|Y) ~ P(Y|G)P(G) • p(Y|G,D) can also be done: fix each d, then sample Y from G , so that d holds – use rejection sampling • Another round of random generation • Gibbs sampler: iterate, and get p(G,Y|D) – works in general**Bayesian simulation: No need for integration!**• Once have samples (g,y) ~ p(G,Y|D), g ~ p(G|D) is automatic Use a few starting points to monitor convergence**Bayesian simulation: No need for integration!**• Once have samples (g,y) ~ p(G,Y|D), g ~ p(G|D) is automatic Joint: p(G,Y|D) Just keep the G’s Marginal: p(G|D)**Result: Stringency Hierarchy**• Posterior marginal of the 3 constraints Ident(voice) *VoiceObs(coda) *VoiceObs grammar used for generation**Conditional sampling of parameters p(G|Y,D)**• Given Y, G is independent of D. So p(G|Y,D) = p(G|Y) • Sampling from p(G|Y) is just regular Bayesian statistics: p(G|Y)~p(Y|G)p(G) • p(Y|G) is normal with mean \bar{y} and variance \sigma^2/m • p(G) is chosen to have infinite variance – an “uninformative” prior**Conditional sampling ofmissing data p(Y|G,d)**• Idea: decompose Y into (Y_1, …, Y_N), and sample one at a time • Example: d = “max {C1, C2} > C3” • Easier than !**Conditional sampling ofmissing data p(Y|G,d)**• form a random walk in R3 that approximates**Sampling tails of Gaussians**• Direct sampling can be very slow Need samples from tail • For efficiency: rejection sampling with exponential density envelope Envelope Target Shape of envelope optimized for minimal rejection rate**Ilokano-like grammar**• Is there a grammar that will generate p(.)? • Not obvious, since the interaction is not pair-wise. GLA always slightly off**Summary**• Two perspectives on the randomized learning algorithm • A Bayesian statistics simulation • A general stochastic search scheme • Bayesian methods often provide approximate solutions to hard computational problems • The solution is exact if allowed to run forever

More Related