1 / 74

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 23 4 /10/2013. Recommended reading. Zellis Harris. 1954. From phoneme to morpheme.

akio
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing • Lecture 23 • 4/10/2013

  2. Recommended reading • Zellis Harris. 1954. From phoneme to morpheme. • Jenny R. Saffran, Richard N. Aslin, and Elissa L. Newport. 1996. Statistical learning by 8-month-old infants. Science, 274, 1926-1928. • Timothy Gambell and Charles Yang. 2005. Word segmentation: quick but not dirty. MS. • Daniel Hewlett and Paul Cohen. 2011. Word segmentation as general chunking. Proceedings of CoNLL. • Rie K. Ando and Lillian Lee. 2003. Mostly-unsupervised statistical segmentation of Japanese kanji sequences. Natural Language Engineering, 9(2).

  3. Outline • Introduction to unsupervised learning • Word segmentation by letter successors • Word segmentation by transitional probability • Word segmentation using vocabulary bootstrapping and stress information • Word segmentation as general chunking • Word segmentation in Chinese and Japanese

  4. Types of machine learning • Supervised learning • Data is annotated for the label to be predicted • Learn a mapping from features to labels • Semi-supervised learning • Partially annotated data • Unsupervised learning • Data is not annotated for the concept to be learned • Use features to group similar data together

  5. Why unsupervised learning? • Annotated data is expensive • Unsupervised learning can be used to discover structure in new data sets • Categories learned in an unsupervised manner may be useful as features in supervised learning • Gold-standard annotations do not necessarily directly reflect statistical properties of data • e.g. Nonterminal rewriting in parsing • Model child language acquisition • Children learn in an unsupervised manner

  6. Applications of unsupervised learning in NLP • Unsupervised induction of: • Word segmentation • Morphology • POS categories • Word collocations • Semantic categories of words • Paraphrase discovery • etc. • “Induction” = discover from scratch

  7. Computational approaches to unsupervised learning • Algorithms: k-means clustering, agglomerative clustering, mutual information clustering, singular value decomposition, probabilistic models • Computational issues: representation, search space, Minimal description length, data sparsity, Zipf’s law

  8. Linguistic issues: learning bias • Unsupervised learning is interesting from a linguistic point of view because it involves both rationalist and empiricist approaches to language • Empiricist • Knowledge is obtained from experience (=data) • Rationalist • Knowledge results from the capacities of the mind • Learning bias: learner is predisposed to acquire certain end results due to how it was programmed • Learning cannot be entirely “knowledge-free”

  9. Linguistic issues: language specificity • Empiricist perspective: opposed to using linguistics in NLP, for the sake of adhering to linguistic theory • View language as just one of many kinds of data • Apply general-purpose learning algorithms that are applicable to other kinds of data • Language-specific learning algorithms are not necessary • If successful, strengthens claims that linguistic theory isn’t needed

  10. First application of unsupervised learning: word segmentation • Word segmentation problem: howdoyousegmentacontinuousstreamofwords? • Use statistical regularities between items in sequence to discover word boundaries • Look at 5 different approaches, from different fields • Old-school Linguistics • Psychology • Computational Linguistics • Artificial Intelligence • Applied NLP

  11. Applications of word segmentation • Speech recognition • Break acoustic signal (which is continuous) into phonemes / morphemes / words • Languages written without spaces • Asian languages • Decipher ancient texts • Language acquisition • How children identify words from continuous speech • Identify morphemes in sign language

  12. Outline • Introduction to unsupervised learning • Word segmentation by letter successors • Word segmentation by transitional probability • Word segmentation using vocabulary bootstrapping and stress information • Word segmentation as general chunking • Word segmentation in Chinese and Japanese

  13. Zellig Harris • “Structuralist” linguist • Pre-Chomsky; was Chomsky’s advisor • Proposed automatic methods for a linguist to discover the structure of a language • Theories are based on, and account for observed data only • Do not propose abstract representations • Don’t use introspection

  14. Harris 1954: Letter successors • Have a sequence of phonemes, don’t know where the boundaries are • Idea: morpheme/word boundaries occur where there are many possible letter successors • Resembles entropy, but is more primitive • Example: successors of he’s: • he’s crazy • he’s quiet • he’s careless

  15. (from A. Albright) • Segment he’s quicker: hiyzqwIker • # of letter successors at each position • hI: 14 • hIy: 29 • hIyz: 29 • hIyzk: 11 • Propose boundary at local maximum • hIy

  16. Backtracking • When the successor count drops to zero, go back to the previous peak and treat it as the start of a chunk

  17. Segmentation algorithm • Calculate successor counts, with backtracking at zero successors • Segment at local maxima in successor counts • Results: • Lines are segmentation choices • Solid = true word boundary, dotted = morpheme boundary

  18. Outline • Introduction to unsupervised learning • Word segmentation by letter successors • Word segmentation by transitional probability • Word segmentation using vocabulary bootstrapping and stress information • Word segmentation as general chunking • Word segmentation in Chinese and Japanese

  19. Children figure out word segmentation • Normal conversation: continuous flow of speech, no pauses between words • One task in language acquisition is to figure out the word boundaries • How do children do it? • Bootstrap through isolated words • Phonetic/phonological constraints • Statistical approach: transitional probability

  20. 1. Bootstrap through isolated words • Old idea: • Children hear isolated words • doggy • Use these words to segment more speech • baddoggy • Problems: • How to recognize single-word utterances? • In English, only 9% of utterances are single words • Not number of syllables: spaghetti

  21. 2. Phonetic/phonological constraints • Phonotactics • Some sound combinations not allowable in English • zl, mb, tk • Could hypothesize word boundary here • However, could occur word-internally: embed • Articulatory cues • Aspirated vs. unaspirated t • tab vs. cat • Could use this knowledge to mark word boundaries • Problems: • This is from the adult point of view; how do children acquire this knowledge?

  22. 3. Use statistics: transitional probability • Transitional probability (same as conditional probability) • TP(AB) = p(AB) / p(A) B TP(AB) A C TP(AC) D TP(AD) • Idea: TP of syllables signals word boundaries • High TP within words, low TP across words • Example • pre.tty ba.by • TP(pretty) > TP(ttyba) • A child could use TP statistics to segment words

  23. Saffran, Aslin, & Newport 1996 • Test whether children can track statistics of transitional probabilities • 8 months old • Artificial language • 4 consonants (p,t,b,d), 3 vowels (a, i, u) • 12 syllables (pa, ti, bu, da, etc.) • 6 words: babupu, bupada, dutaba, patubi, pidabu, tutibu • TP: 1.0 within words; 0.33 across words. • No effect of co-articulation, stress, etc. • Stimuli • 2 minutes of continuous stream of words • monotone voice, synthesized speech • bidakupadotigolabubidaku...

  24. High TP High TP Low TP High TP High TP Low TP High TP High TP babupubu pa da du taba Word boundaries at dips in transitional probability

  25. Testing children • Test stimuli • Same syllables • Novel words whose TPs are different from training stimuli • Test preference for highly frequent (training) or rare (test) words • Results Mean Listening times (seconds) Familiar Novel Matched-pairs t test 6.77 7.60 t(23) = 2.3, p < 0.03 • Conclusion: • Children are sensitive to transitional probabilities of syllables because they show a preference for novel stimuli

  26. Head turning procedure Child light speakers

  27. Conclusions • Supports idea that children learn to segment words through transitional probability statistics • Frequently used as an argument against innate knowledge in language acquisition • “Results raise the possibility that infants possess experience dependent mechanisms that may be powerful enough to support not only word segmentation but also the acquisition of other aspects of language.” • Other research shows that it’s not unique to humans • Monkeys can do this, too

  28. Outline • Introduction to unsupervised learning • Word segmentation by letter successors • Word segmentation by transitional probability • Word segmentation using vocabulary bootstrapping and stress information • Word segmentation as general chunking • Word segmentation in Chinese and Japanese

  29. Computational model of acquisition of word segmentation • Saffranet al. showed that children are sensitive to transitional probabilities • But does it mean that this is how children do it? • Test with a computational model • Precisely defined input/output and algorithm • Apply to a corpus: discover words from continuous sequence of phoneme symbols

  30. Data • Portion of English CHILDES corpus • Transcriptions of adult / child speech • 226,178 words • 263,660 syllables • Corpus preparation • Take adult speech • Look up words in CMU pronunciation dictionary, which has stress indicated • cat K AE1 T • catapult K AE1 T AH0 P AH0 L T • Apply syllabification heuristics • Remove spaces between words

  31. Gambell & Yang 2005: test TP • Test syllable TP, without stress • Propose word boundaries at local minima in TP • i.e., propose word boundary between AB and CD if TP(AB) > TP(BC) < TP(CD) • Results • Precision: 41.6% • Recall: 23.3% • TP doesn’t work!

  32. Problems with Saffran et al. study • Artificial language is too artificial • Very small vocabulary • All words are 3 syllables • TPs used by Saffran et al. are 1 and 0.33 only. • Why TP doesn’t work • Sparse data: 54,448 different syllable pairs • TP requires multisyllable words! • Single-syllable words: no within-word TP • In corpus, single-syllable word followed by single-syllable word 85% of time

  33. TP is weak as a cognitive model • Computationally complex • Huge amount of TPs to keep track of • Can’t be psychologically plausible • TP doesn’t work on corpus  kids can’t be using just TP • Not linguistically motivated

  34. Gambell & Yang 2005:Use stress for segmentation • Unique Stress Constraint: • A word can bear, at most, one primary stress • Assumed innate, part of Universal Grammar • Darth-Va-der S1 S2 W S = strong, W = weak • Segment between stressed syllables: [Darth] [Va-der]

  35. Use stress for segmentation • Automatically identifies single-syllable stressed words • What about Chew-ba-cca? W S W • And in a sequence with one or more weak syllables between two strong syllables, where is the word boundary? S W WW S

  36. Model 1: SL + USC(SL = Statistical Learning = TP) • Input: transcribed speech with stress • Training: calculate transitional probabilities • Testing • Scan sequence of syllables • If two strong syllables, propose a word boundary • If multiple weak syllables between strong, propose word boundary where TP is lowest • Performance • Precision = 73.5%, Recall = 71.2%

  37. Models 2 and 3: vocabulary bootstrapping • Bootstrapping • Use known words to segment unknown ones • Iterative process that builds up vocabulary • 3 cases for segmenting S W S • [ S W ] S [ S W ] is a known word • S [ W S ] [ W S ] is a known word • S W S unknown • No transitional probability!

  38. Models 2 and 3 • Problem: • S W WW S no known words • Model 2: Algebraic agnostic • Just skip these cases • Might segment later if word identified from a later iteration • Model 3: Algebraic random • Randomly choose word boundary

  39. Conclusion • Utilizes linguistic knowledge from UG • Unique Stress Constraint • Result: no massive storage of TPs is necessary • Problems • How does child identify stress??? • What about unstressed words? • Function words are often reduced

  40. Outline • Introduction to unsupervised learning • Word segmentation by letter successors • Word segmentation by transitional probability • Word segmentation using vocabulary bootstrapping and stress information • Word segmentation as general chunking • Word segmentation in Chinese and Japanese

  41. Word segmentation as a general chunking problem • Algorithms for segmentation can also be applied to non-linguistic data • Voting Experts algorithm (Paul Cohen, U of A) • Word segmentation can be accomplished by algorithms that are not specific to language • Don’t need to utilize language-specific information such as “stress”

  42. Segmentation of robot behavior(non-linguistic data) • Robot wandered around a room for 30 minutes, examining objects • Robot had 8 different actions: • MOVE-FORWARD • TURN • COLLISION-AVOIDANCE • VIEW-INTERESTING-OBJECT • RELOCATE-INTERSTING-OBJECT • SEEK-INTERESTING-OBJECT • CENTER-CHASIS-ON-OBJECT • CENTER-CAMERA-ON-OBJECT • Segment into 5 different episodes, based on actions at each time step: • FLEEING • WANDERING • AVOIDING • ORBITING-OBJECT • APPROACHING-OBJECT

  43. Characteristics of temporal chunks • Sequences are highly predictible within chunks, and unpredictible between chunks

  44. Expert #1: segment according to frequency of substrings • Split a sequence so as to maximize the empirical frequency of subsequences, and a high proportion of splits will be word boundaries relative to an equal number of random splits • Example: split ‘THE’ and ‘AT’ THECATSATONTHEMATTOEATHERFOOD

  45. Expert #2: segment according to boundary entropy • Split a sequence so as to maximize the empirical uncertainty of the next subsequence, and a high proportion of the splits will be word boundaries relative to an equal number of random splits • Example: after ‘AT’ entropy of next symbol is highest THECATSATONTHEMATTOEATHERFOOD

  46. Count # of following letters with a trie • (from P. Cohen)

  47. Voting • Each of the experts makes a vote at each point in the sequence • Segment where the number of votes is highest THECATSATONTHEMATTOEATHERFOOD 2 votes 2 votes 2 votes 2 votes

More Related