130 likes | 257 Views
This project explores the capabilities of HTK (Hidden Markov Model Toolkit) for speech processing, focusing on feature extraction methods such as Linear Prediction Analysis, Cepstral Analysis, and Mel-Scaling. The outline includes details on feature extraction scripts, output types, and ideal solutions for processing various audio formats. Future development aims to enhance the script for broader audio input support and specific corpus annotations, enabling the generation of generic feature types from diverse corpora while accommodating different audio file types and annotations.
E N D
Speech Processing Using HTK Trevor Bowden 12/08/2008
Outline • Concept of Project • HTK Feature Extraction Capabilities • Details of Feature Extraction Script • Future Development
Concept of Project • Explore HTK Feature Extraction Capabilities • Feature Output Types • Additional Feature Parameters • Ideal Solution • Derive Any Feature Type from Any Corpus
HTK Feature Extraction Models • Linear Prediction Analysis • Cepstral Analysis Hamming Window Hamming Window FFT() Log()
HTK Feature Extraction Capabilities • Feature Extraction Methods • Linear Prediction Analysis • Cepstral Analysis • Mel-Scaling • Perceptual Linear Prediction Analysis • Additional Feature Information • Signal Energy • Derivative Information
Linear Prediction Analysis • Vocal Tract Transfer Function • Transfer Function Coefficients Solution • Autocorrelation Matrices • Autocorrelation of Speech • Amplitude of Model
Cepstral Analysis • Logarithmic Spectral Domain (Cepstral Domain) • Allows for Separation of Convolved Signals
Mel-Scaling • Perception of sound by the human mind is non-linear in that the mind perceives a non-linear scale of pitches to be equally spaced in the frequency domain.
Perceptual Linear Prediction Analysis • Perceptual linear prediction is a combination of both linear prediction and Cepstral analysis. • The spectrum of the speech data is first converted using the Mel scale. • The data is then cubed and linear prediction coefficients are computed. • From these coefficients Cepstral analysis is performed.
Signal Energy and Derivatives • Signal Energy • Delta Coefficients • Acceleration Coefficients • Third Differential Coefficients
Speech Processing of the AMI Corpus • Ideal Solution Yields Generic Feature Types from Generic Corpus • Corpora Have Varying Audio File Types and Varying Organizational Structures • Corpora Have Varying Methods for Annotation
Speech Processing of the AMI Corpus • Project Solution Yields Generic Feature Types from Corpora with Riff Format WAV Audio Files • Two Main Functions of Script • Traverse Corpus Directory Tree • Generate List of Audio Files • Produce Feature Data • Using User-Defined Configuration File
Future Development • Expand Script to Handle Audio Inputs of Any File Type • Include Processing for Specific Corpus Annotations