120 likes | 204 Views
Explore a novel approach for analyzing pitch variation using a vector-valued representation inspired by vanishing-point perspective. This spectrum provides a detailed insight into pitch changes over time, offering a new perspective for prosodic processing. The methodology involves computing the spectral content using a 512-point FFT, followed by projections onto infinity points for perspective. Through continuous support over a specific interval and linear interpolation for undefined frequencies, this method enhances pitch variation analysis without the need for complex peak identification or estimation techniques. Initial experiments indicate potential applications in human turn-taking behavior studies, speaker verification, and tonal language recognition tasks.
E N D
THE FUNDAMENTAL FREQUENCY VARIATION SPECTRUM FONETIK 2008 KornelLaskowski, MattiasHeldner and Jens Edlund interACT, Carnegie Mellon University, Pittsburgh PA, USA Centre for Speech Technology, KTH Stockholm, Sweden Speaker:Hsiao-Tsung
Introduction • While speech recognition systems have long ago transitioned from formant localization to spectral (vector-valued) formant representations. • Prosodic processing continues to rely squarely on a pitch tracker’s ability to identify a peak, corresponding to the fundamental frequency(f0) of the speaker. • Even if a robust, local, analytic, statistical estimate of absolute pitch were available, applications require a representation of pitch variation and go to considerable additional effort to identify a speaker-dependent quantity for normalization
The Fundamental Frequency Variation Spectrum • Instantaneous variation in pitch is normally computed by determining a single scalar, the F0, at two temporally adjacent instants and forming their difference.
The Fundamental Frequency Variation Spectrum • we propose a vector-valued representation of pitch variation, inspired by vanishing-point perspective(透視) • While the standard inner productbetween two vectors can be viewed as thesummation of pair-wise products with pairs selectedby orthonormal projection onto a point atinfinity F: signal’s spectral content (512-point FFT)
The Fundamental Frequency Variation Spectrum • the proposed vanishing-point productinduces a 1-point perspective projection onto apoint at
The Fundamental Frequency Variation Spectrum • The FFV spectrum is then given by • is undefined over the interval [-T0, +T0]
The Fundamental Frequency Variation Spectrum • A support for which is continuous over • In practice, we compute using magnitude rather than complex spectra
The Fundamental Frequency Variation Spectrum • and are 512-point Fourier transforms, computed every 8 ms. • However, the discrete transforms FL and FR are in general not defind at the corresponding dilate frequencies . • We resort to linear interpolation using the coefficients
The Fundamental Frequency Variation Spectrum Energy independent
Filterbank slowly changing Rapidly changing
Discussion • Initial experiments along these lines show that such HMMs, when trained on dialogue data, corroborate research on human turn-taking behavior in conversations. • does not require peak identification, dynamic time warping, median filtering, landmark detection, linearization, or mean pitch estimation and subtraction • Immediate next steps include fine-tuning the filter banks and the HMM topologies, and testing the results on other tasks where pitch movements are expected to play a role, such as the attitudinal coloring of short feedback utterances, speaker verification, and automatic speech recognition for tonal languages.