1 / 27

Distributional Cues to Word Boundaries: Context Is Important

Distributional Cues to Word Boundaries: Context Is Important. Sharon Goldwater Stanford University. Tom Griffiths UC Berkeley. Mark Johnson Microsoft Research/ Brown University. Word segmentation. One of the first problems infants must solve when learning language.

gage-pace
Download Presentation

Distributional Cues to Word Boundaries: Context Is Important

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/ Brown University

  2. Word segmentation • One of the first problems infants must solve when learning language. • Infants make use of many different cues. • Phonotactics, allophonic variation, metrical (stress) patterns, effects of coarticulation, and statistical regularities in syllable sequences. • Statistics may provide initial bootstrapping. • Used very early (Thiessen & Saffran, 2003). • Language-independent.

  3. Distributional segmentation • Work on distributional segmentation often discusses transitional probabilities(Saffran et al. 1996; Aslin et al. 1998, Johnson & Jusczyk, 2001). • What do TPs have to say about words? • A word is a unit whose beginning predicts its end, but it does not predict other words. • A word is a unit whose beginning predicts its end, and it also predicts future words. Or…

  4. Interpretation of TPs • Most previous work assumes words are statistically independent. • Experimental work: Saffran et al. (1996), many others. • Computational work: Brent (1999). • What about words predicting other words? tupiro golabu bidaku padoti golabubidakugolabutupiropadotibidakupadotitupirobidakugolabutupiropadotibidakutupiro…

  5. Questions • If a learner assumes that words are independent units, what is learned (from more realistic input)? • What if the learner assumes that words are units that help predict other units? Approach: use a Bayesian “ideal observer” model to examine the consequences of making these different assumptions. What kinds of words are learned?

  6. Two kinds of models • Unigram model: words are independent. • Generate a sentence by generating each word independently. look .1 that .2 at .4 … look .1 that .2 at .4 … look .1 that .2 at .4 … look at that

  7. Two kinds of models • Bigram model: words predict other words. • Generate a sentence by generating each word, conditioned on the previous word. look .4 that .2 at .1 … look .1 that .3 at .5 … look .1 that .5 at .1 … look at that

  8. Bayesian learning • The Bayesian learner seeks to identify an explanatory linguistic hypothesis that • accounts for the observed data. • conforms to prior expectations. • Focus is on the goal of computation, not the procedure (algorithm) used to achieve the goal.

  9. Bayesian segmentation • In the domain of segmentation, we have: • Data: unsegmented corpus (transcriptions). • Hypotheses: sequences of word tokens. • Optimal solution is the segmentation with highest prior probability. = 1 if concatenating words forms corpus, = 0 otherwise. Encodes unigram or bigram assumption (also others).

  10. Brent (1999) • Describes a Bayesian unigram model for segmentation. • Prior favors solutions with fewer words, shorter words. • Problems with Brent’s system: • Learning algorithm is approximate (non-optimal). • Difficult to extend to incorporate bigram info.

  11. A new unigram model Assumes word wi is generated as follows: 1. Is wi a novel lexical item? Fewer word types = Higher probability

  12. A new unigram model Assume word wi is generated as follows: 2.If novel, generate phonemic form x1…xm : If not, choose lexical identity of wi from previously occurring words: Shorter words = Higher probability Power law = Higher probability

  13. Advantages of our model

  14. Unigram model: simulations • Same corpus as Brent: • 9790 utterances of phonemically transcribed child-directed speech (19-23 months). • Average utterance length: 3.4 words. • Average word length: 2.9 phonemes. • Example input: yuwanttusiD6bUk lUkD*z6b7wIThIzh&t &nd6dOgi yuwanttulUk&tDIs ...

  15. Example results

  16. Comparison to previous results • Proposed boundaries are more accurate than Brent’s, but fewer proposals are made. • Result: word tokens are less accurate. Precision: #correct / #found Recall: #found / #true F-score: an average of precision and recall.

  17. What happened? • Model assumes (falsely) that words have the same probability regardless of context. • Positing amalgams allows the model to capture word-to-word dependencies. P(D&t) = .024 P(D&t|WAts) = .46 P(D&t|tu) = .0019

  18. What about other unigram models? • Brent’s learning algorithm is insufficient to identify the optimal segmentation. • Our solution has higher probability under his model than his own solution does. • On randomly permuted corpus, our system achieves 96% accuracy; Brent gets 81%. • Formal analysis shows undersegmentation is the optimal solution for any (reasonable) unigram model.

  19. Bigram model Assume word wi is generated as follows: • Is (wi-1,wi) a novel bigram? • If novel, generate wi using unigram model. If not, choose lexical identity of wi from words previously occurring after wi-1.

  20. Example results

  21. Quantitative evaluation • Compared to unigram model, more boundaries are proposed, with no loss in accuracy: • Accuracy is higher than previous models:

  22. Conclusion • Different assumptions about what defines a word lead to different segmentations. • Beginning of word predicts end of word: Optimal solution undersegments, finding common multi-word units. • Word also predicts next word: Segmentation is more accurate, adult-like. • Important to consider how transitional probabilities and other statistics are used.

  23. Constraints on learning • Algorithms can impose implicit constraints. • Implication: learning process prevents the learner from identifying the best solutions. • Specifics of algorithm are critical, but hard to determine their effect. • Prior imposes explicit constraints. • State general expectations about the nature of language. • Assume humans are good at learning.

  24. Algorithmic constraints • Venkataraman (2001) and Batchelder (2002) describe unigram model-based approaches to segmentation, with no prior. • Venkataraman algorithm penalizes novel words. • Batchelder algorithm penalizes long words. • Without algorithmic constraints, these models would memorize every utterance whole (insert no word boundaries).

  25. Remaining questions • Are multi-word chunks sufficient as an initial bootstrapping step in humans? (cf. Swingley, 2005) • Do children go through a stage with many chunks like these? (cf. MacWhinney, ??) • Are humans able to segment based on bigram statistics?

More Related