Download
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
An Elitist Approach to Articulatory-Acoustic Feature Classification in English and in Dutch PowerPoint Presentation
Download Presentation
An Elitist Approach to Articulatory-Acoustic Feature Classification in English and in Dutch

An Elitist Approach to Articulatory-Acoustic Feature Classification in English and in Dutch

128 Views Download Presentation
Download Presentation

An Elitist Approach to Articulatory-Acoustic Feature Classification in English and in Dutch

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. An Elitist Approach to Articulatory-Acoustic Feature Classification in English and in Dutch Steven Greenberg, Shawn Chang and Mirjam Wester International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng steveng@icsi.berkeley.edu http://www.icsi.berkeley.edu/~shawnc shawnc@icsi.berkeley.edu Mirjam Wester A2 RT, Department of Language and Speech Nijmegen University, Netherlands http://www.lands.let.kun.nl/Tspublic/wester wester@let.kun.nl

  2. Acknowledgements and Thanks Automatic Feature Classification and Analysis Joy Hollenback, Lokendra Shastri, Rosaria Silipo Research Funding U.S. National Science Foundation U.S. Department of Defense

  3. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

  4. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

  5. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization • Phonetic and Prosodic Annotation Material is of Limited Quantity

  6. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization • Phonetic and Prosodic Annotation Material is of Limited Quantity • Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis

  7. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization • Phonetic and Prosodic Annotation Material is of Limited Quantity • Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis • Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce

  8. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization • Phonetic and Prosodic Annotation Material is of Limited Quantity • Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis • Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce • Hand labeling and segmentation is time consuming and expensive

  9. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization • Phonetic and Prosodic Annotation Material is of Limited Quantity • Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis • Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce • Hand labeling and segmentation is time consuming and expensive • It is difficult to find qualified transcribers and training can be arduous

  10. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization • Phonetic and Prosodic Annotation Material is of Limited Quantity • Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis • Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce • Hand labeling and segmentation is time consuming and expensive • It is difficult to find qualified transcribers and training can be arduous • Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation

  11. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization • Phonetic and Prosodic Annotation Material is of Limited Quantity • Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis • Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce • Hand labeling and segmentation is time consuming and expensive • It is difficult to find qualified transcribers and training can be arduous • Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation • Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries

  12. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization • Phonetic and Prosodic Annotation Material is of Limited Quantity • Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis • Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce • Hand labeling and segmentation is time consuming and expensive • It is difficult to find qualified transcribers and training can be arduous • Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation • Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries • Phone classification error is ca. 30-50%

  13. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization • Phonetic and Prosodic Annotation Material is of Limited Quantity • Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis • Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce • Hand labeling and segmentation is time consuming and expensive • It is difficult to find qualified transcribers and training can be arduous • Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation • Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries • Phone classification error is ca. 30-50% • Speech recognition systems do not currently deal with prosody

  14. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization • Phonetic and Prosodic Annotation Material is of Limited Quantity • Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis • Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce • Hand labeling and segmentation is time consuming and expensive • It is difficult to find qualified transcribers and training can be arduous • Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation • Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries • Phone classification error is ca. 30-50% • Speech recognition systems do not currently deal with prosody • Automatic Transcription is Likely to Aid in the Development of Speech Recognition and Synthesis Technology

  15. Motivation for Automatic Transcription • Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech • There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization • Phonetic and Prosodic Annotation Material is of Limited Quantity • Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis • Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce • Hand labeling and segmentation is time consuming and expensive • It is difficult to find qualified transcribers and training can be arduous • Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation • Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries • Phone classification error is ca. 30-50% • Speech recognition systems do not currently deal with prosody • Automatic Transcription is Likely to Aid in the Development of Speech Recognition and Synthesis Technology • And therefore is worth the effort to develop

  16. Road Map of the Presentation • Introduction (Steven Greenberg) • Motivation for developing automatic phonetic transcription systems • Rationale for the current focus on articulatory-acoustic features (AFs) • The development corpus - NTIMIT • Justification for using NTIMIT for development of AF classifiers

  17. Road Map of the Presentation • Introduction (Steven Greenberg) • Motivation for developing automatic phonetic transcription systems • Rationale for the current focus on articulatory-acoustic features (AFs) • The development corpus - NTIMIT • Justification for using NTIMIT for development of AF classifiers • The ELITIST Approach and Its Application to English (Shawn Chang) • The baseline system • The ELITIST approach • Manner-specific classification for place of articulation features

  18. Road Map of the Presentation • Introduction (Steven Greenberg) • Motivation for developing automatic phonetic transcription systems • Rationale for the current focus on articulatory-acoustic features (AFs) • The development corpus - NTIMIT • Justification for using NTIMIT for development of AF classifiers • The ELITIST Approach and Its Application to English (Shawn Chang) • The baseline system • The ELITIST approach • Manner-specific classification for place of articulation features • Application of the ELITIST Approach to Dutch (Mirjam Wester) • The training and testing corpus - VIOS • The nature of cross-linguistic transfer of articulatory-acoustic features • The ELITIST approach to frame selection as applied to the VIOS corpus • Improvement of place-of-articulation classification using manner-specific training in Dutch

  19. Road Map of the Presentation • Introduction (Steven Greenberg) • Motivation for developing automatic phonetic transcription systems • Rationale for the current focus on articulatory-acoustic features (AFs) • The development corpus - NTIMIT • Justification for using NTIMIT for development of AF classifiers • The ELITIST Approach and Its Application to English (Shawn Chang) • The baseline system • The ELITIST approach • Manner-specific classification for place of articulation features • Application of the ELITIST Approach to Dutch (Mirjam Wester) • The training and testing corpus - VIOS • The nature of cross-linguistic transfer of articulatory-acoustic features • The ELITIST approach to frame selection as applied to the VIOS corpus • Improvement of place-of-articulation classification using manner-specific training in Dutch • Conclusions and Future Work(Steven Greenberg) • Development of fully automatic phonetic and prosodic transcription systems • An empirically oriented discipline based on annotated corpora

  20. Part One INTRODUCTION Motivation for Developing Automatic Phonetic Transcription Systems Rationale for the Current Focus on Articulatory-Acoustic Features Description of the Development Corpus – NTIMIT Justification for Using the NTIMIT Corpus

  21. Provides Detailed, Empirical Material for the Study of Spoken Language Such data provide an important basis for scientific insight and understanding Facilitates development of new models for spoken language Corpus Generation - Objectives

  22. Provides Detailed, Empirical Material for the Study of Spoken Language Such data provide an important basis for scientific insight and understanding Facilitates development of new models for spoken language Provides Training Material for Technology Applications Automatic speech recognition, particularly pronunciation models Speech synthesis, ditto Cross-linguistic transfer of technology algorithms Corpus Generation - Objectives

  23. Provides Detailed, Empirical Material for the Study of Spoken Language Such data provide an important basis for scientific insight and understanding Facilitates development of new models for spoken language Provides Training Material for Technology Applications Automatic speech recognition, particularly pronunciation models Speech synthesis, ditto Cross-linguistic transfer of technology algorithms Promotes Development of NOVEL Algorithms for Speech Technology Pronunciation models and lexical representations for automatic speech recognition speech synthesis Multi-tier representations of spoken language Corpus Generation - Objectives

  24. Our Focus Corpus-Centric View of Spoken Language Our Focus in Today’s Presentation is on Articulatory Feature Classification Other levels of linguistic representation are also extremely important to annotate

  25. Rationale for Articulatory-Acoustic Features • Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the Lowest (i.e., Phonetic) Tier of Spoken Language • AFs can be combined in a variety of ways to specify virtually any speech sound found in the world’s languages

  26. Rationale for Articulatory-Acoustic Features • Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the Lowest (i.e., Phonetic) Tier of Spoken Language • AFs can be combined in a variety of ways to specify virtually any speech sound found in the world’s languages • AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments

  27. Rationale for Articulatory-Acoustic Features • Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the Lowest (i.e., Phonetic) Tier of Spoken Language • AFs can be combined in a variety of ways to specify virtually any speech sound found in the world’s languages • AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments • AFs are Systematically Organized at the Level of the Syllable • Syllables are a basic articulatory unit in speech

  28. Rationale for Articulatory-Acoustic Features • Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the Lowest (i.e., Phonetic) Tier of Spoken Language • AFs can be combined in a variety of ways to specify virtually any speech sound found in the world’s languages • AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments • AFs are Systematically Organized at the Level of the Syllable • Syllables are a basic articulatory unit in speech • The pronunciation patterns observed in casual conversation are systematic at the AF level, but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments

  29. Rationale for Articulatory-Acoustic Features • Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the Lowest (i.e., phonetic) Tier of Spoken Language • AFs can be combined in a multitude of ways to specify virtually any speech sound found in the world’s languages • AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments • AFs are Systematically Organized at the Level of the Syllable • Syllables are a basic articulatory unit in speech • The pronunciation patterns observed in casual conversation are systematic at the AF level, but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments • AFs are Potentially More Effective in Speech Recognition Systems • More accurate and flexible pronunciation models (tied to syllabic and lexical units)

  30. Rationale for Articulatory-Acoustic Features • Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the Lowest (i.e., phonetic) Tier of Spoken Language • AFs can be combined in a multitude of ways to specify virtually any speech sound found in the world’s languages • AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments • AFs are Systematically Organized at the Level of the Syllable • Syllables are a basic articulatory unit in speech • The pronunciation patterns observed in casual conversation are systematic at the AF level, but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments • AFs are Potentially More Effective in Speech Recognition Systems • More accurate and flexible pronunciation models (tied to syllabic and lexical units) • Are generally more robust under acoustic interference than phonetic segments

  31. Rationale for Articulatory-Acoustic Features • Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the Lowest (i.e., phonetic) Tier of Spoken Language • AFs can be combined in a multitude of ways to specify virtually any speech sound found in the world’s languages • AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments • AFs are Systematically Organized at the Level of the Syllable • Syllables are a basic articulatory unit in speech • The pronunciation patterns observed in casual conversation are systematic at the AF level, but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments • AFs are Potentially More Effective in Speech Recognition Systems • More accurate and flexible pronunciation models (tied to syllabic and lexical units) • Are generally more robust under acoustic interference than phonetic segments • Relatively few alternative features for various AF dimensions makes classification inherently more robust than phonetic segments

  32. Rationale for Articulatory-Acoustic Features • Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the Lowest (i.e., phonetic) Tier of Spoken Language • AFs can be combined in a multitude of ways to specify virtually any speech sound found in the world’s languages • AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments • AFs are Systematically Organized at the Level of the Syllable • Syllables are a basic articulatory unit in speech • The pronunciation patterns observed in casual conversation are systematic at the AF level, but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments • AFs are Potentially More Effective in Speech Recognition Systems • More accurate and flexible pronunciation models (tied to syllabic and lexical units) • Are generally more robust under acoustic interference than phonetic segments • Relatively few alternative features for various AF dimensions makes classification inherently more robust than phonetic segments • AFs are Potentially More Effective in Speech Synthesis Systems • More accurate and flexible pronunciation models (tied to syllabic and lexical units)

  33. Sentences Read by Native Speakers of American English Quasi-phonetically balanced set of materials Wide range of dialect variability , both genders, variation in speaker age Relatively low semantic predictability “She washed his dark suit in greasy wash water all year” Primary Development Corpus – NTIMIT

  34. Sentences Read by Native Speakers of American English Quasi-phonetically balanced set of materials Wide range of dialect variability , both genders, variation in speaker age Relatively low semantic predictability “She washed his dark suit in greasy wash water all year” Corpus Manually Labeled and Segmented at the Phonetic-Segment Level The precision of phonetic annotation provides an excellent training corpus Corpus was annotated at MIT Primary Development Corpus – NTIMIT

  35. Sentences Read by Native Speakers of American English Quasi-phonetically balanced set of materials Wide range of dialect variability , both genders, variation in speaker age Relatively low semantic predictability “She washed his dark suit in greasy wash water all year” Corpus Manually Labeled and Segmented at the Phonetic-Segment Level The precision of phonetic annotation provides an excellent training corpus Corpus was annotated at MIT A Large Amount of Annotated Material Over 2.5 hours of material used for training the classifiers 20 minutes of material used for testing Primary Development Corpus – NTIMIT

  36. Sentences Read by Native Speakers of American English Quasi-phonetically balanced set of materials Wide range of dialect variability , both genders, variation in speaker age Relatively low semantic predictability “She washed his dark suit in greasy wash water all year” Corpus Manually Labeled and Segmented at the Phonetic-Segment Level The precision of phonetic annotation provides an excellent training corpus Corpus was annotated at MIT A Large Amount of Annotated Material Over 2.5 hours of material used for training the classifiers 20 minutes of material used for testing Relatively Canonical Pronunciation Ideal for Training AF Classifiers Formal pronunciation patterns provides a means of deriving articulatory features from phonetic-segment labels via mapping rules (cf. Proceedings paper for details) Primary Development Corpus – NTIMIT

  37. Sentences Read by Native Speakers of American English Quasi-phonetically balanced set of materials Wide range of dialect variability , both genders, variation in speaker age Relatively low semantic predictability “She washed his dark suit in greasy wash water all year” Corpus Manually Labeled and Segmented at the Phonetic-Segment Level The precision of phonetic annotation provides an excellent training corpus Corpus was annotated at MIT A Large Amount of Annotated Material Over 2.5 hours of material used for training the classifiers 20 minutes of material used for testing Relatively Canonical Pronunciation Ideal for Training AF Classifiers Formal pronunciation patterns provides a means of deriving articulatory features from phonetic-segment labels via mapping rules (cf. Proceedings paper for details) NTIMIT is a Telephone Pass-band Version of the TIMIT Corpus Sentential material passed through a channel between 0.3 and 3.4 kHz Provides capability of transfer to other telephone corpora (such as VIOS) Primary Development Corpus – NTIMIT

  38. Part Two THE ELITIST APPROACH The Baseline System for Articulatory-Acoustic Feature Classification The ELITIST Approach to Systematic Frame Selection for AF Classification Improving Place-of-Articulation Classification Using Manner-Specific Training

  39. The Baseline System for AF Classification • Spectro-Temporal Representation of the Speech Signal • Derived from logarithmically compressed, critical-band energy pattern • 25-ms analysis windows (i.e., a frame) • 10-ms frame-sampling interval (i.e., 60% overlap between adjacent frames)

  40. The Baseline System for AF Classification • Spectro-Temporal Representation of the Speech Signal • Derived from logarithmically compressed, critical-band energy pattern • 25-ms analysis windows (i.e., a frame) • 10-ms frame-sampling interval (i.e., 60% overlap between adjacent frames) • Multilayer Perceptron (MLP) Neural Network Classifiers • Single hidden layer of 200-400 units, trained with back-propagation • Nine frames of context used in the input

  41. The Baseline System for AF Classification • An MLP Network for Each Articulatory Feature (AF) Dimension • A separate network trained on voicing, place and manner of articulation, etc. • Training targets were derived from hand-labeled phonetic transcripts and a fixed phone-to-AF mapping • “Silence” was a feature included in the classification of each AF dimension • All of the results reported are for FRAME accuracy (not segmental accuracy)

  42. The Baseline System for AF Classification • Focus on Articulatory Feature Classification Rather than Phone Identity • Provides a more accurate means of assessing MLP-based classification system

  43. Baseline System Performance Summary • Classification of Articulatory Features Exceeds 80% – Except for Place • Objective – Improve Classification across All AF Dimensions, but Particularly on Place-of-Articulation NTIMIT Corpus

  44. Not All Frames are Created Equal • Correlation Between Frame Position and Classification Accuracy for MANNER of articulation features: • The 20% of the frames closest to the segmentBOUNDARIESare73%correct • The 20% of the frames closest to the segmentCENTERare90%correct

  45. Not All Frames are Created Equal • Correlation Between Frame Position and Classification Accuracy for MANNER of articulation features: • The 20% of the frames closest to the segment BOUNDARIES are 73% correct • The 20% of the frames closest to the segment CENTER are 90% correct • Correlation between frame position within a segment and classifier output for MANNER features: • The 20% of the frames closest to the segmentBOUNDARIEShave a mean maximum output (“confidence”) level of0.797 • The 20% of the frames closest to the segmentCENTERhave a mean maximum output (“confidence”) level of0.892 • This dynamic range of 0.1 (in absolute terms) is HIGHLY significant

  46. Not All Frames are Created Equal • Manner Classification is Best for Frames in the Phonetic-Segment Center

  47. Not All Frames are Created Equal • Manner Classification is Best for Frames in the Phonetic-Segment Center • MLP Network Confidence Level is Highly Correlated with Frame Accuracy

  48. Not All Frames are Created Equal • Manner Classification is Best for Frames in the Phonetic-Segment Center • MLP Network Confidence Level is Highly Correlated with Frame Accuracy • The Most Confidently Classified Frames are Generally More Accurate

  49. Selecting a Threshold for Frame Selection • The Correlation Between Neural Network Confidence Level and Frame Position within the Phonetic Segment Can Be Exploited to Enhance Articulatory Feature Classification • This insight provides the basis for the “Elitist” approach

  50. Selecting a Threshold for Frame Selection • The Most Confidently Classified Frames are Generally More Accurate