420 likes | 550 Views
This session delves into the use of mixture models within category learning, especially in the context of speech perception. Building on previous discussions surrounding the G&G model and non-parametric approaches, we will explore statistical estimation techniques, such as maximum likelihood estimation and the Expectation-Maximization algorithm. Key applications like learning categories from distributions, analyzing speech segmentation, and parameter tuning are also covered. Expect to gain insights into how statistical methods can enhance our understanding of category learning processes, particularly in infants.
E N D
LING 696B: Mixture model and its applications in category learning
Recap from last time • G&G model: a self-organizing map (neural net) that does unsupervised learning • Non-parametric approach: encoding stimuli distribution by large number of connection weights
Question from last time • Scaling up to the speech that infants hear: higher dimensions? • Extending the G&G network in Problem 3 to Maye&Gerken’s data? • Speech segmentation: going beyond static vowels? • Model behavior: parameter tuning, starting points, degree of model fitting, when to stop …
Today’s agenda • Learning categories from distributions (Maye & Gerken) • Basic ideas of statistical estimation, consistency, maximum likelihood • Mixture model, learning with the Expectation-Maximization algorithm • Refinement of mixture model • Application in the speech domain
Learning categories after minimal pairs • Idea going back as early as to Jakobson, 41: • knowing /bin/~/pin/ implies [voice] as a distinctive feature • [voice] differentiates /b/ and /p/ as two categories of English • Moreover, this predicts the order in which categories are learned • Completely falsified? (small project) • Obvious objection: early words don’t include many minimal pairs
Maye & Gerken, 00 • Categories can be learned from statistics, just like learning statistics from sequences • Choice of artificial contrast: English d and (s)t • Small difference invoicing and F0 • Main difference: F1, F2 onset
Detecting d~(s)t contrast in Pegg and Werker, 97 • Most adults can do this, but not as good as a native contrast • 6-8m much better than 10-12m • (Need more than distribution learning?)
Maye & Gerken, 00 • Training on monomodal v.s. bimodal distributions Both groups heard the same number of stimuli
Maye & Gerken, 00 • Results from Maye thesis:
Maye, Gerken & Werker, 02 • Similar experiment done more carefully on infants • Preferential looking time • Alternating and non-alternating trials alternating alternating Non-alter
Maye, Gerken & Werker, 02 • Bimodal-trained infants look longer at alternating trials than non-alternating Not significant Difference significant
Reflections • The dimension in which bimodal differs monomodal is abstract • Shape of distribution also hard to characterize • Adults/infants are not told what categories are there to learn • Neither do they know how many categories to learn • Machine learning does not have satisfying answers to all these questions
Statistical estimation • Basic setup: • The world: distributions p(x; ), is set of free parameters “all models may be wrong, but some are useful” • Given parameter , p(x; ) tells us how to calculate the probability of x (also referred to as the “likelihood” p(x|) ) • Observations: X = {x1, x2, …, xN} generated from some p(x; ). N is the number of observations • Model-fitting: based on some examples X, make guesses (learning, inference) about
Statistical estimation • Example: • Assuming people’s height follows normal distributions (mean, var) • p(x; ) = the probability density function of normal distribution • Observation: measurements of people’s height • Goal: estimate parameters of the normal distribution
Statistical estimation:Hypothesis space matters • Example: curve fitting with polynomials
Criterion of consistency • Many model fitting criteria • Least squares • Minimal classification errors • Measures of divergence, etc. • Consistency: as you get more and more data x1, x2, …, xN (N -> infinite), your model fitting procedure should produce an estimate that is closer and closer to the true that generates X.
Maximum likelihood estimate (MLE) • Likelihood function: examples xi are independent of one another, so • Among all the possible values of , choose the so that L() is the biggest Consistent! L()
MLE for Gaussian distributions • Parameters: mean and variance • Distribution function: • MLE for mean and variance • Exercise: derive this result in 2 dimensions
Mixture of Gaussians • An extension of Gaussian distributions to handle data containing categories • Example: mixture of 2 Gaussian distributions • More concrete example: height of male and female follow two distributions, but we don’t know the gender from which measurement is made
Mixture of Gaussians • More parameters • Parameters of the two Gaussians: (1, 1) and (2, 2) -- two categories • The “mixing” proportion: 0 1 • How are data generated? • Throw a coin with heads-on probability • If head is on, generate an example from the first Gaussian, otherwise generate from the second
Maximum likelihood:Supervised learning • Seeing data x1, x2, …, xN (height) as well as their category membership y1, y2, …, yN (male or female) • MLE : • For each Gaussian, estimate based on members of category, e.g. • = (number of 1) / N
Maximum likelihood:Unsupervised learning • Only seeing data x1, x2, …, xN , no idea about category membership or • Must estimatebased on X only • Key idea: relate this problem to the supervised learning
The K-means algorithm • Clustering algorithm for designing “codebooks” (vector quantization) • Goal: dividing data into K clusters and representing each cluster by its center • First: random guesses about cluster membership (among 1,…,K)
The K-means algorithm • Then iterate • Update the center of each cluster by the mean of data belonging to the cluster • Re-assign each datum to the cluster based on the shortest distance to the cluster centers • After some iterations, this will not change any more
K-means demo • Data generated from mixture of 2 Gaussians with mixing proportion 0.5
Why does K-means work? • In the beginning, the centers are poorly chosen, so the clusters overlap a lot • But if centers are moving away from each other, then clusters tend to separate better • Vice versa, if clusters are well-separated, then the centers will stay away from each other • Intuitively, these two steps “help each other”
/t/? /d/? [?] [?] [?] Expectation-Maximization algorithm • Replacing the “hard” assignments in K-means with “soft” assignments • Hard: (0, 1) or (1, 0) • Soft: (p( /t/ | x), p( /d/ | x)), e.g. (0.5, 0.5) = ?
/t/0 /d/0 [?] [?] [?] Expectation-Maximization algorithm • Initial guesses 0 = 0.5
/t/0 /d/0 Expectation-Maximization algorithm • Expectation step: Sticking in “soft” labels -- a pair (wi, 1-wi) 0 = 0.5 [?] [0.5 t0.5 d] [?]
/t/0 /d/0 Expectation-Maximization algorithm • Expectation step • Expectation step: label each example 0 = 0.5 [?] [0.5 t0.5 d] [0.3 t 0.7 d]
/t/0 /d/0 Expectation-Maximization algorithm • Expectation step • Expectation step: label each example 0 = 0.5 [0.1 t 0.9 d] [0.5 t0.5 d] [0.3 t 0.7 d]
/t/1 /d/0 Expectation-Maximization algorithm • Maximization step: going back to update the model with Maximum-Likelihood, weighted by soft labels [0.1 t 0.9 d] [0.5 t0.5 d] [0.3 t 0.7 d]
/t/1 /d/1 Expectation-Maximization algorithm • Maximization step • Maximization step: going back to update the model with Maximum-Likelihood , weighted by soft labels [0.1 t0.9 d] [0.5 t0.5 d] [0.3 t0.7 d]
/t/1 /d/1 Expectation-Maximization algorithm • Maximization step: going back to update the model with Maximum-Likelihood 1 = 0.3 = (0.5+0.3+0.1)/3 [0.1 t 0.9 d] [0.5 t0.5 d] [0.3 t 0.7 d]
Common intuition behind K-means and EM • The labels are important, yet not observable – “hidden variables” / “missing data” • Strategy: make probability-based guesses, and iteratively guess – update until converge • K-means: hard guess 1,…,K • EM: soft guess (w1,…,wK), w1+…+wK=1
Thinking of this as an exemplar-based model • Johnson (1997)'s exemplar model of categories: • When a new stimulus comes in, its membership is jointly determined by all pre-memorized exemplars.-- This is the E – step • After a new stimulus is memorized, the “weight” of each exemplar is updated. -- This is the M – step
Convergence guarantee of EM • E-step: finding a lower bound of L() L() E: choosing this
Convergence guarantee of EM • M-step: finding the maximum of this lower bound L() M: finding the maximum Always <= L()
Convergence guarantee of EM • E-step again L()
Local maxima What if you start Here?
Overcoming local maxima:Multiple starting points Multiple starting points
Overcoming local maxima:Model refinement • Guess 6 at once is hard, but 2 is easy; • Hill climbing strategy: starting with 2, then 3, 4, ... • Implementation: split the cluster with the maximum gain in likelihood; • Intuition: discriminate within the biggest pile.