A Bayesian approach to word segmentation: Theoretical and experimental results

A Bayesian approach to word segmentation: Theoretical and experimental results Sharon Goldwater Department of Linguistics Stanford University

Word segmentation • One of the first problems infants must solve when learning language. • Infants make use of many different cues. • Phonotactics, allophonic variation, metrical (stress) patterns, effects of coarticulation, and statistical regularities in syllable sequences. • Statistics may provide initial bootstrapping. • Used very early (Thiessen & Saffran, 2003). • Language-independent.

Modeling statistical segmentation • Previous work often focuses on how statistical information (e.g., transitional probabilities) can be used to segment speech. • Bayesian approach asks what information should be used by a successful learner. • What statistics should be collected? • What assumptions (by the learner) constrain possible generalizations?

Outline 1. Computational model and theoretical results • What are the consequences of using different sorts of information for optimal word segmentation? (joint work with Tom Griffiths and Mark Johnson) 2. Modeling experimental data • Do humans behave optimally? (joint work with Mike Frank, Vikash Mansinghka, Tom Griffiths, and Josh Tenenbaum)

Statistical segmentation • Work on statistical segmentation often discusses transitional probabilities(Saffran et al. 1996; Aslin et al. 1998, Johnson & Jusczyk, 2001). • P(syli | syli-1) is often lower at word boundaries. • What do TPs have to say about words? • A word is a unit whose beginning predicts its end, but it does not predict other words. • A word is a unit whose beginning predicts its end, and it also predicts future words. Or…

Interpretation of TPs • Most previous work assumes words are statistically independent. • Experimental work: Saffran et al. (1996), many others. • Computational work: Brent (1999). • What about words predicting other words? tupiro golabu bidaku padoti golabubidakugolabutupiropadotibidakupadotitupi…

Questions • If a learner assumes that words are independent units, what is learned (from more realistic input)? • Unigram model: Generate each word independently. • What if the learner assumes that words are units that help predict other units? • Bigram model: Generate each word conditioned on the previous word. Approach: use a Bayesian ideal observer model to examine the consequences of each assumption.

Bayesian learning • The Bayesian learner seeks to identify an explanatory linguistic hypothesis that • accounts for the observed data. • conforms to prior expectations. • Focus is on the goal of computation, not the procedure (algorithm) used to achieve the goal.

Bayesian segmentation • In the domain of segmentation, we have: • Data: unsegmented corpus (transcriptions). • Hypotheses: sequences of word tokens. • Optimal solution is the segmentation with highest prior probability. = 1 if concatenating words forms corpus, = 0 otherwise. Encodes unigram or bigram assumption (also others).

Brent (1999) • Describes a Bayesian unigram model for segmentation. • Prior favors solutions with fewer words, shorter words. • Problems with Brent’s system: • Learning algorithm is approximate (non-optimal). • Difficult to extend to incorporate bigram info.

A new unigram model (Dirichlet process) Assume word wi is generated as follows: 1. Is wi a novel lexical item? Fewer word types = Higher probability

A new unigram model (Dirichlet process) Assume word wi is generated as follows: 2.If novel, generate phonemic form x1…xm : If not, choose lexical identity of wi from previously occurring words: Shorter words = Higher probability Power law = Higher probability

Unigram model: simulations • Same corpus as Brent (Bernstein-Ratner, 1987): • 9790 utterances of phonemically transcribed child-directed speech (19-23 months). • Average utterance length: 3.4 words. • Average word length: 2.9 phonemes. • Example input: yuwanttusiD6bUk lUkD*z6b7wIThIzh&t &nd6dOgi yuwanttulUk&tDIs ...

Example results

Comparison to previous results • Proposed boundaries are more accurate than Brent’s, but fewer proposals are made. • Result: word tokens are less accurate. Precision: #correct / #found= [= hits / (hits + false alarms)] Recall: #found / #true = [= hits / (hits + misses)] F-score: an average of precision and recall.

What happened? • Model assumes (falsely) that words have the same probability regardless of context. • Positing amalgams allows the model to capture word-to-word dependencies. P(D&t) = .024 P(D&t|WAts) = .46 P(D&t|tu) = .0019

What about other unigram models? • Brent’s learning algorithm is insufficient to identify the optimal segmentation. • Our solution has higher probability under his model than his own solution does. • On randomly permuted corpus, our system achieves 96% accuracy; Brent gets 81%. • Formal analysis shows undersegmentation is the optimal solution for any (reasonable) unigram model.

Bigram model (hierachical Dirichlet process) Assume word wi is generated as follows: • Is (wi-1,wi) a novel bigram? • If novel, generate wi using unigram model (almost). If not, choose lexical identity of wi from words previously occurring after wi-1.

Example results

Quantitative evaluation • Compared to unigram model, more boundaries are proposed, with no loss in accuracy: • Accuracy is higher than previous models:

Summary • Two different assumptions about what defines a word are consistent with behavioral evidence. • Different assumptions lead to different results. • Beginning of word predicts end of word: Optimal solution undersegments, finding common multi-word units. • Word also predicts next word: Segmentation is more accurate, adult-like.

Remaining questions • Is unigram segmentation sufficient to start bootstrapping other cues (e.g., stress)? • How prevalent are multi-word chunks in infant vocabulary? • Are humans able to segment based on bigram statistics? • Is there any evidence that human performance is consistent with Bayesian predictions?

Testing model predictions Goal: compare our model (and others) to human performance in a Saffran-style experiment. • Problem: all models have near-perfect accuracy on experimental stimuli. • Solution: compare changes in model performance relative to humans as task difficulty is varied. tupiro golabu bidaku padoti golabubidakugolabutupiropadotibidakupadotitupiro…

Experimental method Examine segmentation performance under different utterance length conditions. • Example lexicon: lagi dazu tigupi bavulu kabitudu kipavazi • Conditions:

Procedure • Training: adult subjects listened to synthesized utterances in one length condition. • No pauses between syllables within utterances. • 500 ms pauses between utterances. • Testing: 2AFC between words and part-word distractors. Lexicon: lagi dazu tigupi bavulu kabitudu kipavazi lagitigupibavulukabitudulagikipavazi dazukipavazibavululagitigupikabitudu kipavazitigupidazukabitudulagitigupi …

Human performance

Model comparison • Evaluated six different models. • Each model trained and tested on same stimuli as humans. • To simulate 2AFC, produce a score s(w)for each word in choice pair and use Luce choice rule: • Compute best linear fit of each model to human data, then calculate correlation.

Models used • Three local statistic models, all similar to transitional probabilities (TP) • Segment at minima of P(syli | syli-1). • s(w)= minimum TP in w. • Swingley (2005) • Builds lexicon using local statistic and frequency thresholds. • s(w)= max threshold at which w appears in lexicon. • PARSER (Perruchet and Vinter, 1998) • Incorporates principles of lexical competition and memory decay. • s(w)= P(w)as defined by model. • GGJ (our Bayesian model) • s(w)= P(w)as defined by model.

Results: linear fit

Results: words vs. part-words

Summary • Statistical segmentation is more difficult when utterances contain more words. • Gradual decay in performance is predicted by Bayesian model, but not by others tested. • Bayes predicts difficulty is primarily due to effects of competition. • In longer utterances, correct words are less probable because more other possibilities exist. • Local statistic approaches don’t model competition.

Continuing work Experiments with other task modifications will further test our model’s predictions. • Vary the length of exposure to training stimulus: • Bayes: longer exposure => better performance. • TPs: no effect of exposure. • Vary the number of lexical items: • Bayes: larger lexicon => worse performance. • TPs: larger lexicon => better performance.

Conclusions Computer simulations and experimental work suggest that • Unigram assumption causes ideal learners to undersegment fluent speech. • Human word segmentation may approximate Bayesian ideal learning.

Bayesian segmentation Input data: Some hypotheses: whatsthat thedoggie wheresthedoggie ... whatsthat thedoggie wheresthedoggie whats that the doggie wheres the doggie wh at sth at thedo ggie wh eres thedo ggie w h a t s t h a t t h e d o g g i e w h e r e s t h e d o g g i e

Search algorithm Model defines a distribution over hypotheses. We use Gibbs sampling to find a good hypothesis. • Iterative procedure produces samples from the posterior distribution of hypotheses. • A batch algorithm (but online algorithms are possible, e.g., particle filtering). P(h|d) h

Gibbs sampler 1. Consider two hypotheses differing by a single word boundary: 2. Calculate probabilities of words that differ, given current analysis of all other words. • Model is exchangeable: probability of a set of outcomes does not depend on ordering. 3. Sample one of the two hypotheses according to the ratio of probabilities. whats.that the.doggie wheres.the.doggie whats.that the.dog.gie wheres.the.doggie

Models used • Transitional probabilities (TP) • Segment at minima of P(syli | syli-1). • s(w)= minimum TP in w. (Equivalently, use product). • Smoothed transitional probabilities • Avoid 0 counts by using add-λ smoothing. • Mutual information (MI) • Segment where MI between syllables is lowest. • s(w)= minimum MI in w. (Equivalently, use sum).

Models used • Swingley (2005) • Builds a lexicon, including syllable sequences above some threshold for both MI and n-gram frequency. • s(w)= max threshold at which w appears in lexicon. • PARSER (Perruchet and Vinter, 1998) • Lexicon-based model incorporating principles of lexical competition and memory decay. • s(w)= P(w)as defined by model. • GGJ (our Bayesian model) • s(w)= P(w)as defined by model.

Results: linear fit

Continuing work Comparisons to human data and other models: • Which words/categories are most robust? • Compare to Frequent Frames predictions (Mintz, 2003). • Compare to corpus data from children’s production. Modeling cue combination: • Integrate morphology into syntactic model. • Model experimental work on cue combination in category learning.

A Bayesian approach to word segmentation: Theoretical and experimental results