Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller). Speaker Odyssey 2008. Previous work Characteristic acoustic features. Jitter and Shimmer (C. Müller et. al.) Phonetic cues (S. Schoetz) Cepstral coefficients.
Speaker Odyssey 2008
Motivation and intuition behind this work
Features such as cepstral coefficients characterize the exact content of the signal. Much of this
Information is not useful for age/gender classification, e.g. we can identify age/gender from a speech
in a foriegn language that we do not understand.
Therefore, features which characterize slowly varying temporal envelope should be more advantageous.
P FramesMel Cepstrum Modulation Spectrum features(V. Tyagi et. al.)
n: time instant
k: cepstral coefficient index
q: Modulation freuency index
P: Context Window Width (11 frames)
German SpeechDat Corpus
A human-labelling experiment on a subset of test data yielded ~55%
Overall classification accuracy.
Both systems are based on GMM
(Gaussian Mixture Model) acoustic
model and maximumLikelihood classifier.
Both systems have equal dimension (21) of
Feature vectors and hence same number
Performance of MCMS features as
function of duration and in/out-domain data.
Classification accuracy saturates at 3 modulation
frequencies (3-14 Hz) and starts dropping after 4
Modulation frequencies. This also explains why MFCC
Features perform worse than MCMS features.
Modulation Frequency response of first 3 MCMS filters.
These 3 filters provide complimentary
Information. For speech recognition, 7 filters (3-22 Hz)
provide best performance.