Deep Learning in Speech Processing: Potentials and Challenges

Deep Learning in Speech Processing: Potentials and Challenges Dong Yu Microsoft Research

Why Speech Recognition is Hard • Sequential multi-class problem • Variability • Articulation differences • Environment differences • Variability • Within each frame • Along the whole sequence • Variability • Mapping between different layers • Across different dialog contexts Dong Yu

Potentials of Deep Learning • Derive robust and discriminative features • Directly from waveforms and/or spectrums • With many layers’ of transformations • Compact with regard to the number of parameters • Sparseness with regard to the number of active features • Distributed with regard to the information storage • Learn and incorporate long-range dependencies • At different levels: semantic, syntactic, pronunciation • Discovered automatically • Learn to know when the context is important and when is not Dong Yu

Challenges • Basic theory • Why greedy layer-wise pre-training helps? • Is there better way to pre-train the models? • Basic model • How to integrate generative and discriminative abilities? • How to represent sequential patterns ? • How to discover the linguistic hierarchy? • How to combine the supervised, unsupervised, and lightly-supervised learning? • Special considerations • Is it robust to mismatched test conditions? • Can we scale the learning process up to > 2000 hours of speech? Dong Yu

Deep Learning in Speech Processing: Potentials and Challenges