Acoustic and Linguistic Characterization of Spontaneous Speech

Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute of Technology Tokyo, Japan

Introduction(1/2) Background • Present speech recognition technology • High recognition accuracy for read speech • Rather poor accuracy for spontaneous speech • Improvement of recognition accuracy for spontaneous speech is necessary. • What are the differences between spontaneous and read speech? • Why is the recognition accuracy for spontaneous speech low? What are differences?

Introduction(2/2) Goals • Statistical and quantitative analysis of acoustic and linguistic differences between spontaneous and read speech. • Investigation of acoustic and linguistic characteristics which affect speech recognition performance in spontaneous speech.

Corpus of Spontaneous Japanese (CSJ) • A large-scale spontaneous speech corpus • Roughly 7M words (morphemes) with a total speech length of 650 hours • Orthographic and phonetic transcription are manually given. • Speaking styles Academic presentations(AP) • Live recordings of academic presentations • The fields of Engineering, social science, and humanities Extemporaneous presentations(EP) • Studio recordings of paid layman speakers’ speech • Small audience and relatively relaxed atmosphere • More informal than AP Dialogue speech(D) • Interview, task oriented dialogue, and free dialogue Read speech(R) • Reading transcription of AP or EP by the same speaker

Disfluency ratio • Filled pauses (F), word fragments (W), and reduced articulation or mispronunciation (M) • Approximately one-tenth of the words are disfluencies in the spontaneous speech in the CSJ. • The ratio of F is significantly higher than that of W and M.

Acoustic characteristics

Acoustic feature extraction • 39-dimensional feature vectors • 12-dimensional MFCC, log-energy, and their first and second derivatives • 25 ms-length window shifted every 10 ms • CMS is applied to each utterance. • HMMs • Mono-phone HMMs with a single Gaussian mixture • Left-to-right topology with three self-loops • Trained using samples of every combination of phonemes, speakers, and utterance styles • Acoustic features for each phoneme • Mean and variance vectors of 12-dimensional MFCC at the 2nd state of the HMM • Target phonemes • 31 Japanese phonemes (10 vowels and 21 consonants)

Reduction ratio • Quantitative analysis of the spectral space reduction for spontaneous speech • Definition • m p(X) is the mean vector of a phoneme p uttered with a speaking style X. • m p(R) is the mean vector of read speech. • Av: average over all phonemes • || ||: Euclidean norm/distance Speaking styleX Phoneme p Center of the distribution of all phonemes Read speech

Reduction ratio averaged over 10 speakers • MFCC space is reduced for almost all the phonemes, and this is most significant for dialogue utterances. redp(X) = 1

Reduction ratio averaged over vowels and consonants • Reduction of the distribution of spontaneous speech in comparison with read speech is observed for all the speaking styles, and this is most significant for dialogue speech.

Between-phoneme distances • The reduction of the MFCC distance between each phoneme pair is measured by using Mahalanobis distance. MFCC space r a r a n Phoneme cepstrum n k k m u m u Mahalanobis distance between each phoneme pair

Mahalanobis distance • Mahalanobis distance Dij(X) between phoneme i and j : • K: dimension of MFCC vector (K = 12) • m ikands ik2: k th elements of the mean and variance vector of MFCC for phoneme i uttered with a speaking style X.

Cumulative frequency of distribution of Mahalanobis distances • Mahalanobis distances between every phoneme pair for each speaking style • Mahalanobis distance between phonemes decreases as the spontaneity of utterances increases. • The more spontaneous the utterances become, the more reduced the cepstrum space becomes. Increase of spontaneity

Relationship between phoneme distances and phoneme accuracy (1/2) • Investigation of relationship between mean phoneme distances and phoneme recognition accuracy • Acoustic model • A common model for all speaking styles • Trained on the data from 100 males and 100 females for AP and 150 males and 150 females for EP (about 2M phoneme samples, respectively) • Language model • Phoneme network constrained by phoneme-class probabilities

Relationship between phoneme distances and phoneme accuracy (2/2) • Strong correlation between mean phoneme distance and phoneme accuracy • Reduction of the distances between phonemes is a major factor contributing to the degradation of spontaneous speech recognition accuracy. Correlation coefficient 0.97

Linguistic characteristics

Written text and spontaneous speech corpora Mainichi newspaper(NP) • Written text corpus News commentary(NC) • Transcription of utterances spoken based on prepared text Academic presentations(AP) (in CSJ) Extemporaneous presentations(EP) (in CSJ) Dialogue(D) (in CSJ)

Part-of-speech observation frequency Noun Fillers • The frequency of nouns is much higher in the newspaper corpus than in the spontaneous speech. • The frequency of fillers is much higher in the dialogue than in news commentary and presentations.

Perplexity matrix • Trigrams are built as statistical language models for each speaking style, and test-set perplexity is measured for every combination of the styles. • Test-set perplexity for spontaneous speech is roughly five times larger than that for written newspaper texts. Perplexity matrix Diagonal elements

Distance matrix for visualization • Visualization of relationships between the language models • Symmetrization of the perplexity matrix as follows: Symmetri-zation PP matrix (PP(aij)) Distance matrix (D(dij)) Visualization

Correction • Equation (3) in the paper is wrong. • Correct equation (3) is as follows:

Difference between language models • Relationship between the language models projected onto a two-dimensional space derived from the distance matrix using MDS (Multidimensional scaling) method • Newspaper text and dialogue are situated at two extreme positions. • Presentations and news commentary are situated in between.

Relationship between perplexity and word accuracy (1/2) • Investigation of relationship between test-set perplexity and word accuracy • Acoustic model • A common model for all speaking styles • Trained on the data from 10 males and 10 females for each speaking style (about 750K phoneme samples) • Language models • Separate models for each speaking style

Relationship between perplexity and word accuracy (2/2) • The test-set perplexity (diagonal elements in the PP matrix) and word accuracy • Experimental results indicate that they have a high correlation of –0.98 between the test-set perplexity and recognition accuracy across different speaking styles. Correlation coefficient –0.98

Conclusion(1/2) • Clarified differences of acoustic and linguistic characteristics between spontaneous speech and read speech. • Acoustic characteristics • Spectral distribution of spontaneous speech is reduced in comparison with that of read speech. • The more spontaneous, the smaller the distances between phonemes. • There is a high correlation between the mean phoneme distance and the phoneme recognition accuracy. • Spontaneous speech can be characterized by the reduction of spectral space in comparison with that of read speech, and this is one of the major factors contributing to the decrease in phoneme recognition accuracy.

Conclusion(2/2) • Linguistic characteristics • The perplexity for language models of spontaneous speech is significantly higher than that for written text. • Spontaneous speech frequently includes ungrammatical phenomena and linguistic variations, including repetitions and repairs • There is a high correlation between the test-set perplexity and the word recognition accuracy. • Increment of the test-set perplexity of spontaneous speech is one of the major factors contributing to the decrease in word recognition accuracy

Future research • Analysis over wider ranges of spontaneous speech using utterances other than those included in the CSJ • Is the relationship between phoneme distances and phoneme recognition accuracy general? • Is the relationship between test-set perplexity and word recognition accuracy general? • How to incorporate filled pauses, repairs, hesitations, repetitions, partial words, and disfluencies for spontaneous speech • Investigations of how we can use these results obtained in this paper for improving recognition performance of spontaneous speech • Creating methods for adapting acoustic and language models to spontaneous speech

Thank you very much for your kind attention! E-mail: masa@furui.cs.titech.ac.jp

Acoustic and Linguistic Characterization of Spontaneous Speech

Acoustic and Linguistic Characterization of Spontaneous Speech

Presentation Transcript

Linguistic Stress in Language and Speech

Reconstructing Spontaneous Speech

Acoustic/Prosodic and Lexical Correlates of Charismatic Speech

Acoustic/Prosodic and Lexical Correlates of Charismatic Speech

Development of coarticulatory patterns in spontaneous speech

Identification of prosodic near-minimal Pairs in Spontaneous Speech

Linguistic knowledge for Speech recognition

AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

Robust Translation of Spontaneous Speech: A Multi-Engine Approach

Understanding Variation of VOT in spontaneous speech

Phonological Priming in Spontaneous Speech Production

Linguistic Variation: Speech Communities

Speech is bimodal essentially. Acoustic and Visual cues.

The statistical analysis of acoustic correlates of speech rhythm

Acoustic Cues to Emotional Speech

Musical Modality in Spontaneous and Acted Speech

Phonological Priming in Spontaneous Speech Production

The statistical analysis of acoustic correlates of speech rhythm

Acoustic Modeling for Speech Recognition

Acoustic/Prosodic and Lexical Correlates of Charismatic Speech

Speech Information at Acoustic Landmarks