Understanding Spoken Language
Download
1 / 95

Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Scie - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng (contains electronic versions of papers and links to data)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Scie' - Audrey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Understanding Spoken Languageusing

Statistical and Computational Methods

Steven Greenberg

International Computer Science Institute

1947 Center Street, Berkeley, CA 94704

http://www.icsi.berkeley.edu/~steveng

(contains electronic versions of papers and links to data)

Patterns of Speech Sounds in Unscripted Communication -

Production, Perception, Phonology. Akademie Sankelmark, October 8-11, 2000



How I Learned to Stop Worrying

and Use

The Canonical Form


Disclaimer

I am a Phonetician - NOT!

(many thanks for the invite)


No Scientist is an Island …

IMPORTANT COLLEAGUES

PHONETIC TRANSCRIPTION OF SPONTANEOUS SPEECH (SWITCHBOARD)

Candace Cardinal, Rachel Coulston, Dan Ellis, Eric Fosler, Joy Holllenback, John Ohala, Colleen Richey

STATISTICAL ANALYSIS OF PRONUNCIATION VARIATION

Eric Fosler, Leah Hitchcock, Joy Hollenback

ARTICULATORY-ACOUSTIC BASIS OF CONSONANT RECOGNITION

Leah Hitchcock, Rosaria Silipo

AUTOMATIC PHONETIC TRANSCRIPTION OF SPONTANEOUS SPEECH

Shawn Chang, Lokendra Shastri


Germane Publications

STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELING

Fosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic speech recognition. Proceedings of the International Congress of Phonetic Sciences, San Francisco.

Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany .

Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176.

Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32.

Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27.

PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY

Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77.

Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8.

Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest

AUTOMATIC PHONETIC TRANSCRIPTION AND SEGMENTATION

Chang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech (American English). Proc. Int. Conf. Spoken Lang. Proc., Beijing.

Shastri, L. Chang, S. and Greenberg, S. (1999) Syllable detection and segmentation using temporal flow neural networks. Proceedings of the International Congress of Phonetic Sciences, San Francisco, pp. 1721-1724.

http://www.icsi.berkeley.edu/~steveng



Language - The Traditional Perspective

The “classical” view of spoken language posits a quasi-arbitrary relation between the lower and higher tiers of linguistic organization

Phonetic orthography


Language - A Syllable-Centric Perspective

A more empirical perspective of spoken language focuses on the syllable as the interface between “sound” and “meaning”

Within this framework the relationship between the syllable and the higher and lower tiers is non-arbitrary and systematic statistically


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

    • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

    • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time

    • Nuclei and codas are expressed canonically only 60% of the time


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

    • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time

    • Nuclei and codas are expressed canonically only 60% of the time

    • Nuclei tend to be realized as vowels different from the canonical form


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

    • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time

    • Nuclei and codas are expressed canonically only 60% of the time

    • Nuclei tend to be realized as vowels different from the canonical form

    • Codas are often deleted entirely


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

    • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time

    • Nuclei and codas are expressed canonically only 60% of the time

    • Nuclei tend to be realized as vowels different from the canonical form

    • Codas are often deleted entirely

    • Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

    • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time

    • Nuclei and codas are expressed canonically only 60% of the time

    • Nuclei tend to be realized as vowels different from the canonical form

    • Codas are often deleted entirely

    • Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position

    • Therefore, it is important to model spoken language at the syllabic level


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

    • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time

    • Nuclei and codas are expressed canonically only 60% of the time

    • Nuclei tend to be realized as vowels different from the canonical form

    • Codas are often deleted entirely

    • Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position

    • Therefore, it is important to model spoken language at the syllabic level

  • THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

    • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time

    • Nuclei and codas are expressed canonically only 60% of the time

    • Nuclei tend to be realized as vowels different from the canonical form

    • Codas are often deleted entirely

    • Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position

    • Therefore, it is important to model spoken language at the syllabic level

  • THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY

    • It may be unrealistic to assume that any phonetic transcription based exclusively on segments (such as the IPA) is truly capable of capturing the important phonetic detail of spontaneous material


Take Home Messages

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

    • Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time

    • Nuclei and codas are expressed canonically only 60% of the time

    • Nuclei tend to be realized as vowels different from the canonical form

    • Codas are often deleted entirely

    • Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position

    • Therefore, it is important to model spoken language at the syllabic level

  • THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY

    • It may be unrealistic to assume that any phonetic transcription based exclusively on segments (such as the IPA) is truly capable of capturing the important phonetic detail of spontaneous material


Take Home Messages

  • PHONETIC PROPERTIES OF SPONTANEOUS SPEECH REFLECT INFORMATION CONTENT

Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany .


Road Map

  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH


Road Map

  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

    • Provides the basis for the statistical analyses of spontaneous material


Road Map

  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

    • Provides the basis for the statistical analyses of spontaneous material

  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:

    • Phonetic segments


Road Map

  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

    • Provides the basis for the statistical analyses of spontaneous material

  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:

    • Phonetic segments

    • Words


Road Map

  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

    • Provides the basis for the statistical analyses of spontaneous material

  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:

    • Phonetic segments

    • Words

    • Syllables


Road Map

  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

    • Provides the basis for the statistical analyses of spontaneous material

  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:

    • Phonetic segments

    • Words

    • Syllables

    • Articulatory-acoustic features


Road Map

  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

    • Provides the basis for the statistical analyses of spontaneous material

  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:

    • Phonetic segments

    • Words

    • Syllables

    • Articulatory-acoustic features

  • PERCEPTUAL EVIDENCE

    • The articulatory-acoustic basis of consonant recognition


Road Map

  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

    • Provides the basis for the statistical analyses of spontaneous material

  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:

    • Phonetic segments

    • Words

    • Syllables

    • Articulatory-acoustic features

  • PERCEPTUAL EVIDENCE

    • The articulatory-acoustic basis of consonant recognition

    • Not all articulatory-acoustic features are created equal - place-of-articulation cues appear to be most important for consonant recognition


Road Map

  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

    • Provides the basis for the statistical analyses of spontaneous material

  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:

    • Phonetic segments

    • Words

    • Syllables

    • Articulatory-acoustic features

  • PERCEPTUAL EVIDENCE

    • The articulatory-acoustic basis of consonant recognition

    • Not all articulatory-acoustic features are created equal - place-of-articulation cues appear to be most important for consonant recognition

  • COMPUTATIONAL METHODS

    • Automatic methods for phonetic transcription based on articulatory-acoustic features


Road Map

  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

    • Provides the basis for the statistical analyses of spontaneous material

  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:

    • Phonetic segments

    • Words

    • Syllables

    • Articulatory-acoustic features

  • PERCEPTUAL EVIDENCE

    • The articulatory-acoustic basis of consonant recognition

    • Not all articulatory-acoustic features are created equal - place-of-articulation cues appear to be most important for consonant recognition

  • COMPUTATIONAL METHODS

    • Automatic methods for phonetic transcription based on articulatory-acoustic features

    • Is the most likely means through which it will be possible to generate sufficient empirical data with which to rigorously test hypotheses germane to spoken language



Phonetic Transcription of Spontaneous English

  • TELEPHONE DIALOGUES OF 5-10 MINUTES DURATION - SWITCHBOARD

  • AMOUNT OF MATERIAL MANUALLY TRANSCRIBED    

    • 3 hours labeled at the phone level and segmented at the syllabic level (this material was later phonetically segmented by automatic methods)

    • 1 hour labeled and segmented at the phonetic-segment level

  • DIVERSITY OF MATERIAL TRANSCRIBED

    • Spans speech of both genders (ca. 50/50%) reflecting a wide range of American dialectal variation (6 regions + “army brat”), speaking rate and voice quality

  • TRANSCRIBED BY WHOM?

    • 7 undergraduates and 1 graduate student, all enrolled at UC-Berkeley. Most of the corpus was transcribed by three individuals out of the original eight

    • Supervised by Steven Greenberg and John Ohala

  • TRANSCRIPTION SYSTEM

    • A variant of Arpabet, with phonetic diacritics such as:_gl,_cr, _fr, _n, _vl, _vd

  • HOW LONG DOES TRANSCRIPTION TAKE? (Don’t Ask!)

    • 388 times real time for labeling and segmentation at the phonetic-segment level

    • 150 times real time for labeling phonetic segments and segmenting syllables

  • HOW WAS LABELING AND SEGMENTATION PERFORMED?

    • Using a display of the signal waveform, spectrogram, word transcription and “forced alignments” (estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations

  • DATA AVAILABLE AT - http://www.icsi/berkeley.edu/real/stp


A Brief Tour of

Pronunciation Variation

in

Spontaneous American English


Cumulative Word Frequency in English

Focus on 100 most common words

The 10 most common words account for 27% of the corpus

The 100 most common words account for 67% of the corpus

The 1000 most common words account for 92% of the corpus

Thus, most informal dialogues are composed of a relatively small number of common words.

However, it is the infrequent words that typically provide the precision and detail required for complex information transfer

92%

67%

27%

Computed from the Switchboard corpus (American English telephone dialogues)


N

Pronunciation

N

Pronunciation

How Many Pronunciations of “And”?


N

Pronunciation

N

Pronunciation

How Many Pronunciations of “And”?


MCP

%Total

Most Common

Pronunciation

Rank

Word

N

#Pron

How Many Different Pronunciations?


MCP

%Total

Most Common

Pronunciation

Rank

Word

N

#Pron

How Many Different Pronunciations?


MCP

%Total

Most Common

Pronunciation

Rank

Word

N

#Pron

How Many Different Pronunciations?


MCP

%Total

Most Common

Pronunciation

Rank

Word

N

#Pron

How Many Different Pronunciations?


MCP

%Total

Most Common

Pronunciation

Rank

Word

N

#Pron

How Many Different Pronunciations?


English is (sort of) like Chinese ….

95% of the words contain just ONE or TWO syllables ….

81% of the word tokens are monosyllabic

Of the 100 most common words, 90 are one syllable

in length

Only 22% of the words in the lexicon are one syllable long

Hence, there is a decided preference for monosyllablic words in informal discourse


Syllable and. Word Frequencies are Similar

Words and syllables exhibit similar distributions over the 300 most common elements, accounting for 80% of the corpus

The similarity of their distributions is a consequence of most words consisting of just a single syllable


Word Frequency in Spontaneous English

Word frequency as a function of word rank approximates a 1/f distribution, particularly after rank-order 10

Word frequency is logarithmically related to rank order in the corpus (I.e., the 10th most common word occurs ca. 10 times more frequently than the 100th most common word, etc.

Computed from the Switchboard corpus (American English telephone dialogues)


Information Affects Pronunciation

The faster the speaking rate the more likely that the pronunciation deviates from canonical

However, the effect is much more pronounced for the 100 most common words than for more infrequent words

From Fosler, Greenberg and Morgan (1999); Greenberg and Fosler (2000)


English Syllable Structure is (sort of) Like Japanese

Most syllables are simple in form (no consonant clusters)

87% of the pronunciations are simple syllabic forms

84% of the canonical corpus is composed of simple syllabic forms

n= 103, 054


Complex Syllables are Important, Though

Thus, despite English’s reputation for complex syllabic forms, only ca. 15% of the syllable tokens are actually complex

There are many “complex” syllable forms (consonant clusters, but all occur relatively infrequently

Complex codas are not as frequently realized in actual pronunciation as their canonical representation

Complex onsets tend to preserve the canonical pronunciation in realize their canonical representation

n= 17,760


Syllable-Centric Pronunciation

Codas tend to be pronounced canonically more frequently in formal speech than in spontaneous dialogues

Onsets are pronounced

canonically far more often than nuclei or codas

Percent Canonically Pronounced

(Read Sentences)

“Cat” [k ae t]

[k] = onset

[ae] = nucleus

[t] = coda

Syllable Position

(Spontaneous speech)

n= 120,814


Complex Onsets are Highly Canonical

Complex onsets are pronounced more canonically than simple onsets despite the greater potential for deviation from the standard pronunciation

Percent Canonically Pronounced

(Read Sentences)

(Spontaneous speech)

Syllable Onset Type


Speaking Style Affects Codas

Codas are much more likely to be realized canonically in formal than in spontaneous speech

Percent Canonically Pronounced

Syllable Coda Type


Onsets (but not Codas) Affect Nuclei

The presence of a syllable onset has a substantial impact on the realization of the nucleus

Percent Canonically Pronounced


Syllable-Centric Feature Analysis

  • Place of articulation deviates most in nucleus position

  • Manner of articulation deviates most in onset and coda position

  • Voicing deviates most in coda position

Phonetic deviation

along a SINGLE feature

Place is VERY unstable in nucleus position

Place deviates very little from canonical form in the onset and coda. It is a STABLE AF in these positions


Articulatory PLACE Feature Analysis

  • Place of articulation is a “dominant” feature in nucleus position only

  • Drives the feature deviation in the nucleus for manner and rounding

Phonetic deviation

across SEVERAL features

Place “carries” manner and rounding in the nucleus


Articulatory MANNER Feature Analysis

  • Manner of articulation is a “dominant” feature in onset and coda position

  • Drives the feature deviation in onsets and codas for place and voicing

Phonetic deviation

across SEVERAL features

Manner drives place and voicing deviations in the onset and coda

Manner is less stable in the coda than in the onset


Articulatory VOICING Feature Analysis

  • Voicing is a subordinate feature in all syllable positions

  • Its deviation pattern is controlled by manner in onset and coda positions

Phonetic deviation

across SEVERAL features

Voicing is unstable in coda position and is dominated by manner


LIP-ROUNDING Feature Analysis

  • Lip-rounding is a subordinate feature

  • Its deviation pattern is driven by the place feature in nucleus position

Phonetic deviation

across SEVERAL features

Rounding is stable everywhere except in the nucleus where its deviation pattern is driven by place


Perceptual Evidence

for the

Importance of Place (and Manner) of Articulation Features




















Correlation - AFs/Consonant Recognition

Consonant recognition is almost perfectly correlated with place of articulation performance

This correlation suggests that the place feature is based on cues distributed across the entire speech bandwidth, in contrast to other features

Manner is also highly correlated with consonant recognition, voicing and rounding less so


Automatic Phonetic Transcription

of Spontaneous Speech


Automatic Phonetic Transcription

  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC    ALIGNMENT DATA TO TRAIN NEW SYSTEMS


Automatic Phonetic Transcription

  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC    ALIGNMENT DATA TO TRAIN NEW SYSTEMS

    • These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries


Automatic Phonetic Transcription

  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC    ALIGNMENT DATA TO TRAIN NEW SYSTEMS

    • These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA


Automatic Phonetic Transcription

  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC    ALIGNMENT DATA TO TRAIN NEW SYSTEMS

    • These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA

    • Manual labeling and segmentation typically requires 150-400 times real time to perform


Automatic Phonetic Transcription

  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC    ALIGNMENT DATA TO TRAIN NEW SYSTEMS

    • These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA

    • Manual labeling and segmentation typically requires 150-400 times real time to perform

  • WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE MATERIAL


Automatic Phonetic Transcription

  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC    ALIGNMENT DATA TO TRAIN NEW SYSTEMS

    • These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA

    • Manual labeling and segmentation typically requires 150-400 times real time to perform

  • WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE MATERIAL

    • Such material will be extremely useful for developing pronunciation models and new algorithms for ASR


Automatic Phonetic Transcription

  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC    ALIGNMENT DATA TO TRAIN NEW SYSTEMS

    • These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA

    • Manual labeling and segmentation typically requires 150-400 times real time to perform

  • WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE MATERIAL

    • Such material will be extremely useful for developing pronunciation models and new algorithms for ASR

  • THE ALPS SYSTEM CURRENTLY LABELS SPONTANEOUS MATERIALS    (OGI Numbers Corpus) WITH ca. 83% ACCURACY


Automatic Phonetic Transcription

  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC    ALIGNMENT DATA TO TRAIN NEW SYSTEMS

    • These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA

    • Manual labeling and segmentation typically requires 150-400 times real time to perform

  • WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE MATERIAL

    • Such material will be extremely useful for developing pronunciation models and new algorithms for ASR

  • THE ALPS SYSTEM CURRENTLY LABELS SPONTANEOUS MATERIALS    (OGI Numbers Corpus) WITH ca. 83% ACCURACY

    • The algorithms used are capable of achieving ca. 93% accuracy with only minor changes to the models



Spectro-Temporal Profile (STeP)

  • STePs provide a simple, accurate means of delineating the acoustic    properties associated with phonetic features and segments

Vocalic


Spectro-temporal Profile (STeP)

  • STePs incorporate information about the instantaneous modulation    spectrum distributed across the (tonotopic) frequency axis and can be    used for training neural networks.

Fricative


Label Accuracy per Frame

  • Frames away from the boundary are labeled very accurately


Sample Transcription Output

  • The automatic system performs very similarly to manual transcription in terms of both labels and segmentation

    • 11 ms average concordance in segmentation

    • 83% concordance with respect to phonetic labels



Grand Summary

  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL   SPEAKING STYLES    

    • Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

    • Automatic methods will eventually supply badly needed data for more complete analyses and evaluation

  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

    • Onsets are pronounced in canonical (i. e., dictionary) fashion 85-90% of the time

    • Nuclei and codas are expressed canonically only 60% of the time

    • Nuclei tend to be realized as vowels different from the canonical form

    • Codas are often deleted entirely

    • Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position

    • Therefore, it is important to model spoken language at the syllabic level

  • THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY

    • It may be unrealistic to assume that any phonetic transcription based exclusively on segments (such as the IPA) is truly capably of capturing the important phonetic detail of spontaneous material


That’s All, Folks

Many Thanks for Your Time and Attention


ad