Distinctive Feature Detection For Automatic Speech Recognition Jun Hou Prof. Lawrence Rabiner Dr. Sorin Dusan CAIP, ECE Dept., Rutgers University Sep.13, 2004
Outline • The history of Automatic Speech Recognition • Current Feature Detection Technologies • ASAT – Automatic Speech Attribute Transcription • Distinctive Feature Detection, as a part of ASAT • Proposed Work schedule
Figure 1 S-curve limits ASR technology advances (C.-H. Lee) The Evolution of Speech Recognition • Data-driven (1980’s, 1990’s and 2000’s) vs. knowledge-driven (1960’s, 1970’s) • The gap between Human Speech Recognition (HSR) and Automatic Speech Recognition (ASR) is still very large • Is HMM the end of the line? Or is there somewhere else to go?
Problems with Signals To Be Recognized • No two utterances of the same linguistic content are ever the same (often they are not even close in their waveforms or spectral characteristics) • Speaker variation • Speaking style • Background environment • etc.
Figure 2 State-of-the-art HMM-based systems (C.-H. Lee) Statistical Methods • Typical approaches: HMM and ANN
Statistical Methods • Top-down approach. Higher level knowledge guides the processing primarily at the lower levels. • Incremental discrimination to get refined results (e.g., better stop consonant discrimination) • Utterance verification – Confidence measures to approximately estimate the reliability of the result, often on a word-by-word basis • Errors inevitable, mainly when the measured features fall into the overlapped region of the different pdfs • Data driven => Sensitive to training data, both the amount and type • Robustness problem – Sensitive to speaking environment and transmission characteristics of the medium • No explicit use of acoustic, or phonetic knowledge • No clear calculation of the required size of the training data set • High computational cost when the size of statistical patterns is large
ti-1 ti ti+1 ti+2 Figure 3 HMM diagram HMM Issues • Sequential model • Assumes frame independence – blindly treat frames with equal importance; more or less okay when using cepstral features • No higher level (linguistic) knowledge used in acoustic modeling • Etc.
…… Output layer …… Hidden layer(s) …… Input layer Figure 4 ANN diagram ANN Issues • No meaningful representation of the internal nodes • Lots of uncertainty as to what processing is happening • Computationally expensive • Hard to train; virtually impossible to guarantee convergence at true minimum solution • Etc.
Knowledge Based Methods • Bottom-up approach. Uses acoustic-phonetic knowledge at all levels of processing. • Temporal features are critical in discriminating some speech sounds, e.g., VOT in stop detection • Spectral features are critical in discriminating other speech sounds, e.g., fricatives from spectral energy concentrations • Learn information in temporal and spectral domains using both static and dynamic features
Problems with Knowledge-Based Methods • The knowledge of the acoustic properties of phonetic units is not complete. Hard to cover all the rules. • The knowledge of phonetic properties of acoustic units is not complete. • Pronunciation models explain the formation of waveforms from vocal tract shapes, but no clear reverse knowledge exists. • The choice of features is not optimal in a well defined and meaningful sense. • The design of sound classifiers is not optimal. • No well-defined automatic tuning methods exist.
Feature Extraction-Ali et al • Feature Extraction (Jakobson) 1. Total energy 2. Spectral Center of Gravity (SCG) 3. Duration 4. Low, medium and high frequency energy 5. Formant transitions 6. Silence detection 7. Voicing detection 8. Rate of change of energy in various frequency bands 9. Rate of change of SCG 10. Most prominent peak frequency 11. Rate of change of the most prominent peak frequency 12. Zero-crossing rate • Auditory-Based Front End Processing
Feature Extraction • Utterance Segmentation (silence, obstruents, sonorants) • Fine Utterance Classification into Four categories • Sonorants – fine identification • Stops – voiced and unvoiced • Fricatives – voiced and unvoiced • Silence • Excellent performance for stops and fricatives
Figure 5 Block diagram of the System Figure 6 Block diagram of the front-end Feature Extraction
Feature Extraction • Fricative classification • Voicing detection • DUP – The Duration of the Unvoiced Portion • Place of articulation detection • MDP - The Most Dominant Peak from the synchrony detector • MNSS - The Maximum Normalized Spectral Slope • SCG - The Spectral Center of Gravity • MDSS - The Most Dominant Spectral Slope • DRHF - The Dominant Relative to the Highest Filters
Feature Extraction • Stop detection • Voicing detection • Prevoicing • VOT • Closure duration • Place of articulation detection • BF - Burst Frequency • The second formant of the following vowel • MNSS • DRHF, LINP (most prominent peak of the synchrony response after being laterally inhibited by the higher 10 filters) • Formant transitions before and after the stop • The voicing decision
Landmark Detection • Landmark Detection – Junija, et al., PhD Thesis Proposal • Manner landmarks are used, whereas place and voicing are extracted using the locations provided by the manner landmarks • Two manner landmarks • Defined by abrupt change, e.g., burst landmark for stop consonants, vowel onset point • Defined by the most prominent manifestation of a manner phonetic feature, e.g., a point of maximum low energy in a vowel • Three steps: • Location of manner landmarks • Analysis of landmarks for place and voicing phonetic features • Matching phonetic features to features of words or sentence representations
Table 1 Broad manner classification of English phonemes Landmark Detection • Recognition of 5 broad classes • Vowel • Stop • Fricative • Sonorant consonant • Silence • Use Support Vector Machines (SVM) to segment TIMIT data into binary classes • Results of 2 different feature organizations are reported: • Parallel – discriminate each feature against all other features • Hierarchical – distinguish the features using a probabilistic hierarchy
Landmark Detection Table 2 Landmarks extracted for each of the manner classes and knowledge based acoustic measurements
Landmark Detection Table 3 Acoustic Parameters used in broad class segmentation
Figure 8 Hierarchical SVM organization Figure 7 Parallel SVM organization Landmark Detection • Compare the organizations of SVMs
Table 4 Results of parallel SVM organization Table 5 Results of hierarchical SVM organization Landmark Detection • Compare classification results
Landmark Detection Discussion • Combine landmarks with acoustic parameters • The gap between correctness and accuracy is due to the insertions mainly of sonorant consonants and stops • Performance gap between hierarchical SVM and parallel SVM architectures is due to ??? – possibly: wrong classification in the upper level in the hierarchical architecture causes error propagation to the lower level • Isolated or connected word recognition • Use Finite State Automata (FSA) to constrain the segmentation paths • Doesn’t allow the use of a probabilistic language model
Landmark Detection– ANN • Benoit Launay, et al. • Train Artificial Neural Network to map short-term spectral features to the posterior probability of some distinctive features • Feed features into HMM
Figure 9 Bottom-up ASAT based on speech attribute detection, event merging and evidence verification (C.-H. Lee) NEW! ASAT – Automatic Speech Attribute Transcription • Knowledge-based, data driven approach
Attributes Combination: Linear, ANN, K-L, etc. Attribute Detector 1 Feature 1 Attribute Detector 2 Feature Detector 1 Attribute Detector 3 Attribute Detector 4 Feature 2 Feature Detector 2 Attribute Detector 5 Speech Signal Attribute Detector 6 …… Attribute Detector 7 Attribute Detector 8 Feature N Feature Detector N …… Attribute Detector M Figure 10 Distinctive Feature Detection Distinctive Feature Detection 5. What outputs? 6. How to compute them? 1. What Attributes? 2. How to measure them? 4. How to combine the attributes to form features? 3. What Features?
Attributes and Features in ASAT – Issues to be Resolved • Q1: What attributes? • Q2: How to measure them? • Q3: What features? • Q4: How to combine the attributes to form features? • Q5: What outputs? • Q6: How to compute the outputs? • Q7: Why use them?
Q1: What attributes? • MFCC and their derivatives, Energy in specific spectral ranges, Zero Crossing Rate, Formant Frequency, ratio of spectral peaks, etc. • Different set of attributes for each feature • VOT, energy onset, energy offset, etc. • Refer to those attributes in Ali’s paper • Find other indicative attributes in spectral graph, cepstral graph, etc. • Find other significant characteristics in waveforms • Find characteristics inside/between the time and frequency domains
Q2: How to measure them? • Observe and analyze the speech signal in both time and frequency domain, e.g., filter bank analysis • Data mining of meaningful “patterns” • Experiments needed to find distinguishing attributes for each acoustic feature • Enhance distinctive attributes, eliminate confusing attributes – better ways to measure things • Find the relations of attributes inside a frame, e.g., between prominent attributes, weak attributes. • Calculate correlation between attributes in succeeding frames • Calculate information redundancy for different attributes
Q2: How to measure them? • Topology of attribute organization • Parallel Organization – ASAT Organization • Graph Organization • Hierarchical – Junija et al. (features) • Eliminate redundancy in computation • One attribute may trigger the test of existence of other attributes • Combined organization-i.e., sequential and graph methods combined
Q3: What features? • Features available in current acoustic-phonetic area: binary distinctive features • Distinctive features are related to: • Voicing • vocal folds vibrates or not • Place of articulation • The particular articulator that is used (glottis, soft palate, lips, etc.) • Manner of articulation • How that articulator is used to produce the sound
Q3: What features? • Initial list of twelve pairs of distinctive features • 1. Vocalic/non-vocalic • 2. Consonantal/non-consonantal • 3. Interrupted/continuent • 4. Checked/unchecked • 5. Strident/mellow • 6. Voiced/unvoiced • 7. Compact/diffuse • 8 .Grave/acute • 9. Flat/plain • 10. Sharp/plain • 11. Tense/lax • 12. Nasal/oral • English is characterized by 9 pairs of these features
Q3: What features? • Need to detect all relevant features to perform automatic speech recognition at the phonetic level • Acoustic-phonetic features are intuitively plausible, but there might exist other good features obtained from data mining and/or clustering techniques • We can optimize (how we do it is unclear) and obtain the minimum necessary set of speech distinctive features • May use attributes directly and together with features when calculating the outputs from the detectors
Q4: How to compute or estimate the features? • Develop combination methods and optimize them to get better combination of attributes to form meaningful features, and select the best features for phonemes and possibly larger acoustic units • Possible combination algorithms: • Linearly weighted average • ANN • K-L • Fuzzy integral seems promising, compared with ANN (cf. Chang & Greenberg’s paper) • Prominent attributes characterize features. The existence of some particular attributes may help to further define the feature or features.
Q5: What outputs? • Study the acoustic-phonetic theories and establish models that best describe the production of sound signals • Study each acoustic class and find their differences and relations • Modified features? Phonemes? Phoneme-like units?
Q6: How to compute the outputs? • Study acoustical variation during pronunciation, find common characteristics and distinguishing characteristics for acoustic-phonetic variations • Score the outputs of the feature detectors using probabilities or likelihood measures of the presence of these distinctive features • Other methods???
Q7: Why use them? • We have no other choice at this time • These attributes and features may be far from optimal, but they are well motivated by acoustic-phonetic theories • Will consider other ideas, as they are developed
Evaluation • Evaluation criteria for attributes, features • Mutual information (cf. Hasegawa-Johnson’s paper) • Entropy (e.g., traditional Shannon Entropy, Rényi Entropy, cf. Cachin’s paper) • Perplexity, like that used in language modeling • False acceptance rate, false rejection rate • Other criteria??? • Use those criteria to find correlations between attributes, as well as between features • Gradually minimize the mutual information between attributes/features, e.g., Gradient Descent, and get the minimum sets of attributes and features
Segmentation of Speech • Study how humans segment different portions of speech, e.g., spectrum reading • Multiple segmentations are possible, and thus we might want to search through a range of segmentation candidates to find the best result • Collect the segments with high confidence scores • Use other knowledge sources to help clarify the segments with poor scores
Training and Testing • Database – TIMIT and/or Vic corpus • Divide the database into separate training and testing sets • Training • (1) On the training set • (2) On the training set + testing set – is this meaningful or proper • Find the difference between (1) and (2), and the generalization ability of the features to out of task data • Test performance on the testing set
Training and Testing • Training • Study differences between isolated words, connected words, continuous and spontaneous speech • Try not to depend solely on the training data, but instead find rules that adapt the data and can be applied to more general environments • Try not to defuse the model as more data is added
Training and Testing • Testing • Find reasons why the detectors failed • Observe error patterns • Did the error patterns emerge due to different reasons? If so, re-examine previous steps, and combine the different information sources in ways that are less sensitive to the observed error patterns
Work Schedule • First year: • Set up the structure for the ASAT system • Define the most reasonable starting set of acoustic attributes and phonetic features • Look at a range of ways of combining evidence from the acoustic attributes to create the phonetic features • Evaluate the baseline performance of the system on a given training and testing set of date – most probably using TIMIT • Baseline alternative approaches, especially front ends, including auditory models and standard speech recognition features