160 likes | 201 Views
This research delves into new speech representations focusing on articulatory features to better model pronunciation variation. The study investigates the use of distance metrics in articulatory feature space to enhance speech decoding.
E N D
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACELouis ten Bosch
Contents • Introduction • Objectives • Articulatory Features • Speech Material • Experimental details • set-up • Results • Questions, future plans
Introduction • Speech is usually represented in terms of sequences from a limited set of phone-like symbols (ASR, synthesis, annotation) • ‘Beads-on-a-string’ paradigm (Ostendorf, 1999; etc) • Powerful as meta description • Weak to describe articulatory variation, pronunciation variation • Research on new descriptions & models of speech • Many proposals for new signal representations (continuity preserving, auditorily inspired) and new models (neural models, long-span models, parallel models) • Here: articulatory features (AF)
Objectives • To obtain alternative representations that intrinsically better model variation in speech • Focus on articulatory/pronunciation variation • To investigate the relation between better representations and decoding
Articulatory Features (AFs) • AF advantages are twofold: • Allow feature asynchrony • Deal with ‘incompleteness’: incomplete nasalization, voicing • Intrinsically better modelling of continuous processes • Assumed to better model fine phonetic details (FPD) • FPD mediate human speech processing (lexical access) • [together with indexical information]
Distance Metric in AF Space • Each utterance is a path in AF space • Distance metric in AF space defines ‘speed’ along path • Compare with delta-features in ASR • Speed peak detection impose intrinsic temporal structure • Which distances to use? • Three types (L1, L2, cosine) • How relates this ‘intrinsic’ temporal structure with external temporal structure e.g. phone boundaries?
Speech Material • IFAcorpus (Dutch, read + prepared, 8 speakers, 6 used for training and development, 2 for test) • Many different rich annotation levels
Alignment Results • Nbr of hits (detected -> observed) versus time window size: Wesenick & Kipp ‘96
Asynchrony and Phonetic Classes Average (in number of frames) and standard deviation of the difference (diff.) between cosine-peak location and manual boundary. Only the transitions with extreme negative and positive distances are shown. Manner transition avg. (st.dev.) Fricative-fricative -0.57 (1.6) Vowel-vowel -0.31 (1.8) …. Silence-approximant 0.49 (1.8) Approx.-stop 0.63 (1.6) Vowel-silence 0.64 (2.1) Nasal-approx 0.66 (1.0)
Open questions 1 • To what extent the type of distance (L1, L2, cosine) distinguishes fine detail in the alignment with manual segmentation? • For distances close to 0, all metrics will provide about the same result • The metrics deviate for larger distances, thereby putting more weight to different types of distinctions • This means that event parsing along the AF trajectory may result into essentially different segmentations along the trajectory for different metrics.
Open questions 2 • What about the cue trading (by using weights)? • Difficult, depends on phone • What about the precise quantification of asynchrony? • The variation of observed AF vectors around a canonical AF vector = feature asynchrony + the variation in the classifier output
Near-future plans • Exploit phenomena described here in terms of design principles for alternative procedures for data-driven annotation and unit selection • Design word recognition framework based on AF representation of speech • Study usability for memory-prediction models