FUL : Incorporating phonological theory into ASR Aditi lahiri (Prof in Oxford)

FUL: Incorporating phonological theory into ASR Aditi lahiri (Prof in Oxford) Henning Reetz (Prof in Frankfurt) presented by Jacques Koreman Jacques Koreman (ISK), presntation speech group IET at NTNU

Acknowledgment and responsibilities • Some of the slides (in Times New Roman) were made available by Henning Reetz • The ideas are all Aditi’s and Henning’s • Their (mis)representation is mine…

What is FUL, and why is it interesting? FUL stands for featurally underspecified lexicon. This presentation addresses its main characteristics: • Underspecified features are omitted from the underlying representation • Non-stochastic approach, in contrast to any current techniques in ASR • Psychological reality proven by psycholinguistic and other evidence

An example of underspecification Underspecification can help to deal with assimilation,as for instance in spontaneous speech green bag green grassoften realised as greembag greeng grass while lame dog long dayis never realised as lane dog lon day Why? Because /n/ is underspecified for place and can therefore borrow a place features from its neighbour while /m/ is [LABIAL]

FUL featural specification The specification of features is constrained by universal properties and language-specific requirements: for German [ABRUPT] and [CORONAL] (cf. ”green”) are not specified in the lexicon. • FUL uses monovalent, not binary features • V and C share the same place features The type of features are very much under debate: binary or monovalent, fully specified or underspecified, V and C features together or separate, feature names? On the next slide, the latest version of the feature hierarchy in FUL is shown.

Latest version FUL feature hierarchy

Lexical entries/access in FUL • Entries contain underspecified representations.As opposed to standard full, binary specification! • Each morpheme has a unique representation.Diametrically opposed view of dealing with variation in the signal compared to exemplar-based modelling! • Rough signal parameters mapped onto phonological features (no segments, syllables or other intermediate representation)Unlike detailed acoustic analysis in other systems! • Features used to directly access the lexicon using a non-stochastic, ternary matching procedure.Human speech processing as opposed to pattern matching?

ASR on the basis of a FUL How does ASR with FUL work? Slides 9-18 explain the recognition steps in the FUL system. Why does ASR with FUL work? After that, evidence for the approach from human speech processing will be presented.

match no mismatch mismatch Word lexicon Representation with phonological features Segments Prosody Morphology Syntax Semantic Overview of the FUL system Acoustic signal (stream of samples) Acoustic front end Stream of phonological features Matching process Word candidates Phonological & syntactic parsing

Acoustic signal (stream of samples, waveform) LPC 20ms window 1 ms step rate FFT • • • Stream of formants and spectral shape parameter Heuristics (e.g. [high] := F1<450 Hz) 1 ms step rate Stream of features (labial, nasal, low,...) Heuristics (e.g. length > 5 ms) synchronise features Stream of corrected and synchronised features LPC Acoustic font end Could maybe also be landmarks… end

Acoustic font end... parameter extraction speech signal formants heuristics, e.g. [high] := F1 < 450 Hz

• • [son] • • [high] [low] Acoustic font end….. to features Phonological features

0 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1 0 0 0 1 • • [son] • • [high] [low] Acoustic font end….. features filtered/synchronised Phonological features, filtered and synchronised

Lexicon search with underspecified features O E { o ´ e v p b s z 0 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1 0 0 0 1 b t d f v U o O ´ S Z t d • • • • • [son] • • [high] [low] Acoustic font end….. lexical access with features p f u a s b a i s t S p i t s ´

labial nasal labial labial nasal nasal /m/ /m/ labial labial no mismatch consonantal consonantal /p/ /p/ nasal nasal no mismatch /n/ /n/ no mismatch strident strident /s/ /s/ • • • • • • The crunch of FUL: ternary matching features, computed from signal at one instance in time [m] features, stored in the lexicon

coronal labial nasal nasal labial nasal /m/ features, computed from signal at another instance in time nasal nasal nasal no mismatch no mismatch /n/ /n/ /n/ [n] labial nasal /m/ labial nasal /m/ The crunch of FUL: ternary matching features, computed from signal at one instance in time [m] features, stored in the lexicon features, stored in the lexicon

catch! verb, imp., ....} I catch verb, 1st sg., ....} {„fangen“ we catch verb, 1st pl.+ inf., ....} {„fang an“ start! verb, imp., ....} {„fange an“ I start verb, 1st sg., ....} {„fangen an“ westart verb, 1st pl., ....} {„fang auf“ catch! verb, imp., ....} /fa/ {„fang!“ {„fange“ Morphological extension of underspecif. • • •

The crunch of FUL: ternary matching • Mismatches cause words in the lexicon to be dropped from the list. • No-mismatches or matches do not, but lead to different scores for the word candidates by comparing the number of features derived from the signal with those specified in the lexicon:An im-probable system? 2 matching features score = features in lexicon x features in signal

An im-probable system? Evidence. FUL stands for featurally underspecified lexicon. • Underspecified features are omitted from the underlying representation • Non-stochastic approach, in contrast to any current techniques in ASR • Psychological reality proven by psycholinguistic and other evidence

Evidence for underspecification: semantic priming in lexical decision • Crossmodal experiment (German): • hear prime: Honig (honey) Hammer (hammer) • seetarget: Biene (bee)Nagel (nail) • Subjects’ task: lexical decision • Pseudo-word Ho[m]ig primes Biene, but Ha[n]er does not prime Nagel • Conclusion: [n] underspecified for place in lexicon leads to no-mismatch for Ho[m]ig, but [m] in lexicon is labial, thus mismatch for Ha[n]er

Evidence for underspecification: semantic priming in EEG • The N400 is an event-related potential (ERP) component typically elicited by unexpected linguistic stimuli. • It is characterized as a negative deflection peaking ca. 400ms after stimulus presentation. • In models of speech comprehension, N400 is often associated with the semantic integration of words in sentence context; its finding is interpreted as pointing to the activation of a process working on semantics in the general time frame.

Evidence for underspecification: semantic priming in EEG • word target: Hor[d]e (horde) Pro[b]e (test)pseudo-word target: Hor[b]e (horde) Pro[d]e (test) • Subjects’ task: speeded lexical decision • Similar RTs for words and pseudo-words, but more errors in lexical decision for Hor[b]e(no-mismatch for Hor[d]e) than for Pro[d]e (mismatch on Pro[b]e) Also large negative peak for Pro[d]e but not for Hor[b]e (which behaved similarly to real words). • Conclusion: [d] underspecified for place in lexicon, but [b] specified as [LABIAL]

Evidence for underspecification: vowel listening in MEG experiment • standard (continuous): [o:]deviant (played once): [ø:] • Subjects’ task: just listen…. • Asymmetrical MisMatch Negativity (MMN) effect (perception of change) for [o:]- [ø:] greater than for [ø:]- [o:] : higher amplitude difference ca. 180 ms from onset of deviant and earlier effect. Similar effects for other pairs. • Conclusion: Results fit with underspecification

Evidence for underspecification And there is more evidence • from CVC gating experiments in English and Bengali, where a non-nasalised oral vowel could lead to both oral and nasal responses when the CV is heard (Lahiri & Marslen-Wilson, 1991,1992) • from priming experiments, suggesting there are two kinds of [o:] in German,one which is specified for [labial,dorsal] (Boote-Bötchen as primes for Boot), the other specified only for [labial] (Söhne-Söhnchen as primes for Sohn) • from language change in Miogliola (Northern Italian), wher two types of [n] were shown to exist, one [coronal], the other unspecified for place (Ghini, 2001). ….and more

Conclusions • FUL is an implementation of phonological theory in ASR. • FUL is firmly grounded in psycholinguistic experiments and observations on language change. • FUL recognition is robust against variation in speech, but does not contain mechanisms to normalize for variation not directly related to the linguistic content (as we possibly do when we begin to understand a speaker better when we first meet him/talk to him on the phone), nor to use this information.

References This presentation was mainly based on • (a draft version of)Lahiri, A. & Reetz, H. (2002). "Underspecified recognition", in C. Gussenhoven & N. Warner (eds.) Laboratory Phonology 7. Berlin: Mouton, 637-675. • Lahiri, A. & Reetz, H. (submitted to J. Phon.). ”Distinctive features: phonological underspecification in processing”. • See also: http://ling.uni-konstanz.de/pages/ proj/sfb471/ publ/d-3.html

FUL : Incorporating phonological theory into ASR Aditi lahiri (Prof in Oxford)