distinctive feature detection for automatic speech recognition l.
Skip this Video
Loading SlideShow in 5 Seconds..
Distinctive Feature Detection For Automatic Speech Recognition PowerPoint Presentation
Download Presentation
Distinctive Feature Detection For Automatic Speech Recognition

Loading in 2 Seconds...

play fullscreen
1 / 42

Distinctive Feature Detection For Automatic Speech Recognition - PowerPoint PPT Presentation

  • Uploaded on

Distinctive Feature Detection For Automatic Speech Recognition. Jun Hou Prof. Lawrence Rabiner Dr. Sorin Dusan CAIP, ECE Dept., Rutgers University Sep.13, 2004. Outline. The history of Automatic Speech Recognition Current Feature Detection Technologies

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Distinctive Feature Detection For Automatic Speech Recognition' - Rita

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
distinctive feature detection for automatic speech recognition

Distinctive Feature Detection For Automatic Speech Recognition

Jun Hou

Prof. Lawrence Rabiner

Dr. Sorin Dusan

CAIP, ECE Dept., Rutgers University

Sep.13, 2004

  • The history of Automatic Speech Recognition
  • Current Feature Detection Technologies
  • ASAT – Automatic Speech Attribute Transcription
  • Distinctive Feature Detection, as a part of ASAT
  • Proposed Work schedule
the evolution of speech recognition

Figure 1 S-curve limits ASR technology advances (C.-H. Lee)

The Evolution of Speech Recognition
  • Data-driven (1980’s, 1990’s and 2000’s) vs. knowledge-driven (1960’s, 1970’s)
  • The gap between Human Speech Recognition (HSR) and Automatic Speech Recognition (ASR) is still very large
  • Is HMM the end of the line? Or is there somewhere else to go?
problems with signals to be recognized
Problems with Signals To Be Recognized
  • No two utterances of the same linguistic content are ever the same (often they are not even close in their waveforms or spectral characteristics)
    • Speaker variation
    • Speaking style
    • Background environment
    • etc.
statistical methods6
Statistical Methods
  • Top-down approach. Higher level knowledge guides the processing primarily at the lower levels.
  • Incremental discrimination to get refined results (e.g., better stop consonant discrimination)
  • Utterance verification – Confidence measures to approximately estimate the reliability of the result, often on a word-by-word basis
  • Errors inevitable, mainly when the measured features fall into the overlapped region of the different pdfs
  • Data driven => Sensitive to training data, both the amount and type
  • Robustness problem – Sensitive to speaking environment and transmission characteristics of the medium
  • No explicit use of acoustic, or phonetic knowledge
  • No clear calculation of the required size of the training data set
  • High computational cost when the size of statistical patterns is large
hmm issues





Figure 3 HMM diagram

HMM Issues
  • Sequential model
  • Assumes frame independence – blindly treat frames with equal importance; more or less okay when using cepstral features
  • No higher level (linguistic) knowledge used in acoustic modeling
  • Etc.
ann issues


Output layer


Hidden layer(s)


Input layer

Figure 4 ANN diagram

ANN Issues
  • No meaningful representation of the internal nodes
  • Lots of uncertainty as to what processing is happening
  • Computationally expensive
  • Hard to train; virtually impossible to guarantee convergence at true minimum solution
  • Etc.
knowledge based methods
Knowledge Based Methods
  • Bottom-up approach. Uses acoustic-phonetic knowledge at all levels of processing.
  • Temporal features are critical in discriminating some speech sounds, e.g., VOT in stop detection
  • Spectral features are critical in discriminating other speech sounds, e.g., fricatives from spectral energy concentrations
  • Learn information in temporal and spectral domains using both static and dynamic features
problems with knowledge based methods
Problems with Knowledge-Based Methods
  • The knowledge of the acoustic properties of phonetic units is not complete. Hard to cover all the rules.
  • The knowledge of phonetic properties of acoustic units is not complete.
  • Pronunciation models explain the formation of waveforms from vocal tract shapes, but no clear reverse knowledge exists.
  • The choice of features is not optimal in a well defined and meaningful sense.
  • The design of sound classifiers is not optimal.
  • No well-defined automatic tuning methods exist.
feature extraction ali et al
Feature Extraction-Ali et al
  • Feature Extraction (Jakobson)

1. Total energy

2. Spectral Center of Gravity (SCG)

3. Duration

4. Low, medium and high frequency energy

5. Formant transitions

6. Silence detection

7. Voicing detection

8. Rate of change of energy in various frequency bands

9. Rate of change of SCG

10. Most prominent peak frequency

11. Rate of change of the most prominent peak frequency

12. Zero-crossing rate

  • Auditory-Based Front End Processing
feature extraction
Feature Extraction
  • Utterance Segmentation (silence, obstruents, sonorants)
  • Fine Utterance Classification into Four categories
    • Sonorants – fine identification
    • Stops – voiced and unvoiced
    • Fricatives – voiced and unvoiced
    • Silence
  • Excellent performance for stops and fricatives
feature extraction13

Figure 5 Block diagram of the System

Figure 6 Block diagram of the front-end

Feature Extraction
feature extraction14
Feature Extraction
  • Fricative classification
  • Voicing detection
    • DUP – The Duration of the Unvoiced Portion
  • Place of articulation detection
    • MDP - The Most Dominant Peak from the synchrony detector
    • MNSS - The Maximum Normalized Spectral Slope
    • SCG - The Spectral Center of Gravity
    • MDSS - The Most Dominant Spectral Slope
    • DRHF - The Dominant Relative to the Highest Filters
feature extraction15
Feature Extraction
  • Stop detection
  • Voicing detection
    • Prevoicing
    • VOT
    • Closure duration
  • Place of articulation detection
    • BF - Burst Frequency
    • The second formant of the following vowel
    • MNSS
    • DRHF, LINP (most prominent peak of the synchrony response after being laterally inhibited by the higher 10 filters)
    • Formant transitions before and after the stop
    • The voicing decision
landmark detection
Landmark Detection
  • Landmark Detection – Junija, et al., PhD Thesis Proposal
  • Manner landmarks are used, whereas place and voicing are extracted using the locations provided by the manner landmarks
  • Two manner landmarks
    • Defined by abrupt change, e.g., burst landmark for stop consonants, vowel onset point
    • Defined by the most prominent manifestation of a manner phonetic feature, e.g., a point of maximum low energy in a vowel
  • Three steps:
    • Location of manner landmarks
    • Analysis of landmarks for place and voicing phonetic features
    • Matching phonetic features to features of words or sentence representations
landmark detection17

Table 1 Broad manner classification of English phonemes

Landmark Detection
  • Recognition of 5 broad classes
    • Vowel
    • Stop
    • Fricative
    • Sonorant consonant
    • Silence
  • Use Support Vector Machines (SVM) to segment TIMIT data into binary classes
  • Results of 2 different feature organizations are reported:
    • Parallel – discriminate each feature against all other features
    • Hierarchical – distinguish the features using a probabilistic hierarchy
landmark detection18
Landmark Detection

Table 2 Landmarks extracted for each of the manner classes and knowledge based acoustic measurements

landmark detection19
Landmark Detection

Table 3 Acoustic Parameters used in broad class segmentation

landmark detection20

Figure 8 Hierarchical SVM organization

Figure 7 Parallel SVM organization

Landmark Detection
  • Compare the organizations of SVMs
landmark detection21

Table 4 Results of parallel SVM organization

Table 5 Results of hierarchical SVM organization

Landmark Detection
  • Compare classification results
landmark detection22
Landmark Detection


  • Combine landmarks with acoustic parameters
  • The gap between correctness and accuracy is due to the insertions mainly of sonorant consonants and stops
  • Performance gap between hierarchical SVM and parallel SVM architectures is due to ??? – possibly: wrong classification in the upper level in the hierarchical architecture causes error propagation to the lower level
  • Isolated or connected word recognition
      • Use Finite State Automata (FSA) to constrain the segmentation paths
      • Doesn’t allow the use of a probabilistic language model
landmark detection ann
Landmark Detection– ANN
  • Benoit Launay, et al.
  • Train Artificial Neural Network to map short-term spectral features to the posterior probability of some distinctive features
  • Feed features into HMM
asat automatic speech attribute transcription

Figure 9 Bottom-up ASAT based on speech attribute detection, event merging and evidence verification (C.-H. Lee)


ASAT – Automatic Speech Attribute Transcription
  • Knowledge-based, data driven approach
distinctive feature detection

Attributes Combination:





Attribute Detector 1

Feature 1

Attribute Detector 2

Feature Detector 1

Attribute Detector 3

Attribute Detector 4

Feature 2

Feature Detector 2

Attribute Detector 5

Speech Signal

Attribute Detector 6


Attribute Detector 7

Attribute Detector 8

Feature N

Feature Detector N


Attribute Detector M

Figure 10 Distinctive Feature Detection

Distinctive Feature Detection

5. What outputs?

6. How to compute them?

1. What Attributes?

2. How to measure them?

4. How to combine the attributes to form features?

3. What Features?

attributes and features in asat issues to be resolved
Attributes and Features in ASAT – Issues to be Resolved
  • Q1: What attributes?
  • Q2: How to measure them?
  • Q3: What features?
  • Q4: How to combine the attributes to form features?
  • Q5: What outputs?
  • Q6: How to compute the outputs?
  • Q7: Why use them?
q1 what attributes
Q1: What attributes?
  • MFCC and their derivatives, Energy in specific spectral ranges, Zero Crossing Rate, Formant Frequency, ratio of spectral peaks, etc.
  • Different set of attributes for each feature
  • VOT, energy onset, energy offset, etc.
  • Refer to those attributes in Ali’s paper
  • Find other indicative attributes in spectral graph, cepstral graph, etc.
  • Find other significant characteristics in waveforms
  • Find characteristics inside/between the time and frequency domains
q2 how to measure them
Q2: How to measure them?
  • Observe and analyze the speech signal in both time and frequency domain, e.g., filter bank analysis
  • Data mining of meaningful “patterns”
  • Experiments needed to find distinguishing attributes for each acoustic feature
  • Enhance distinctive attributes, eliminate confusing attributes – better ways to measure things
  • Find the relations of attributes inside a frame, e.g., between prominent attributes, weak attributes.
  • Calculate correlation between attributes in succeeding frames
  • Calculate information redundancy for different attributes
q2 how to measure them29
Q2: How to measure them?
  • Topology of attribute organization
  • Parallel Organization – ASAT Organization
  • Graph Organization
  • Hierarchical – Junija et al. (features)
  • Eliminate redundancy in computation
  • One attribute may trigger the test of existence of other attributes
  • Combined organization-i.e., sequential and graph methods combined
q3 what features
Q3: What features?
  • Features available in current acoustic-phonetic area: binary distinctive features
  • Distinctive features are related to:
    • Voicing
      • vocal folds vibrates or not
    • Place of articulation
      • The particular articulator that is used (glottis, soft palate, lips, etc.)
    • Manner of articulation
      • How that articulator is used to produce the sound
q3 what features31
Q3: What features?
  • Initial list of twelve pairs of distinctive features
    • 1. Vocalic/non-vocalic
    • 2. Consonantal/non-consonantal
    • 3. Interrupted/continuent
    • 4. Checked/unchecked
    • 5. Strident/mellow
    • 6. Voiced/unvoiced
    • 7. Compact/diffuse
    • 8 .Grave/acute
    • 9. Flat/plain
    • 10. Sharp/plain
    • 11. Tense/lax
    • 12. Nasal/oral
  • English is characterized by 9 pairs of these features
q3 what features32
Q3: What features?
  • Need to detect all relevant features to perform automatic speech recognition at the phonetic level
  • Acoustic-phonetic features are intuitively plausible, but there might exist other good features obtained from data mining and/or clustering techniques
  • We can optimize (how we do it is unclear) and obtain the minimum necessary set of speech distinctive features
  • May use attributes directly and together with features when calculating the outputs from the detectors
q4 how to compute or estimate the features
Q4: How to compute or estimate the features?
  • Develop combination methods and optimize them to get better combination of attributes to form meaningful features, and select the best features for phonemes and possibly larger acoustic units
  • Possible combination algorithms:
    • Linearly weighted average
    • ANN
    • K-L
    • Fuzzy integral seems promising, compared with ANN (cf. Chang & Greenberg’s paper)
  • Prominent attributes characterize features. The existence of some particular attributes may help to further define the feature or features.
q5 what outputs
Q5: What outputs?
  • Study the acoustic-phonetic theories and establish models that best describe the production of sound signals
    • Study each acoustic class and find their differences and relations
  • Modified features? Phonemes? Phoneme-like units?
q6 how to compute the outputs
Q6: How to compute the outputs?
  • Study acoustical variation during pronunciation, find common characteristics and distinguishing characteristics for acoustic-phonetic variations
  • Score the outputs of the feature detectors using probabilities or likelihood measures of the presence of these distinctive features
  • Other methods???
q7 why use them
Q7: Why use them?
  • We have no other choice at this time
  • These attributes and features may be far from optimal, but they are well motivated by acoustic-phonetic theories
  • Will consider other ideas, as they are developed
  • Evaluation criteria for attributes, features
    • Mutual information (cf. Hasegawa-Johnson’s paper)
    • Entropy (e.g., traditional Shannon Entropy, Rényi Entropy, cf. Cachin’s paper)
    • Perplexity, like that used in language modeling
    • False acceptance rate, false rejection rate
    • Other criteria???
  • Use those criteria to find correlations between attributes, as well as between features
  • Gradually minimize the mutual information between attributes/features, e.g., Gradient Descent, and get the minimum sets of attributes and features
segmentation of speech
Segmentation of Speech
  • Study how humans segment different portions of speech, e.g., spectrum reading
  • Multiple segmentations are possible, and thus we might want to search through a range of segmentation candidates to find the best result
  • Collect the segments with high confidence scores
  • Use other knowledge sources to help clarify the segments with poor scores
training and testing
Training and Testing
  • Database – TIMIT and/or Vic corpus
  • Divide the database into separate training and testing sets
  • Training
    • (1) On the training set
    • (2) On the training set + testing set – is this meaningful or proper
    • Find the difference between (1) and (2), and the generalization ability of the features to out of task data
  • Test performance on the testing set
training and testing40
Training and Testing
  • Training
    • Study differences between isolated words, connected words, continuous and spontaneous speech
    • Try not to depend solely on the training data, but instead find rules that adapt the data and can be applied to more general environments
    • Try not to defuse the model as more data is added
training and testing41
Training and Testing
  • Testing
    • Find reasons why the detectors failed
    • Observe error patterns
    • Did the error patterns emerge due to different reasons? If so, re-examine previous steps, and combine the different information sources in ways that are less sensitive to the observed error patterns
work schedule
Work Schedule
  • First year:
  • Set up the structure for the ASAT system
  • Define the most reasonable starting set of acoustic attributes and phonetic features
  • Look at a range of ways of combining evidence from the acoustic attributes to create the phonetic features
  • Evaluate the baseline performance of the system on a given training and testing set of date – most probably using TIMIT
  • Baseline alternative approaches, especially front ends, including auditory models and standard speech recognition features