A Linked-HMM for Robust Voicing and Speech Detection

A Linked-HMM for Robust Voicing and Speech Detection Presented by: Emiliano Miluzzo

why the mic is important as a sensor for a people-centric sensing approach?

In few words… • Linked-HMM for simultaneous and robust voicing and speech detection

In few words… • Linked-HMM for simultaneous and robust voicing and speech detection • Targeting different experimental settings: low-sampling rates, far-field mic, ambient noise.

In few words… • Linked-HMM for simultaneous and robust voicing and speech detection • Targeting different experimental settings: low-sampling rates, far-field mic, ambient noise. • Features independent of energy.

In few words… • Linked-HMM for simultaneous and robust voicing and speech detection • Targeting different experimental settings: low-sampling rates, far-field mic, ambient noise. • Features independent of energy. • Exploit speech patterns, usually combinations of talking and silence segments.

What’s nice about the paper • The first paper presenting the application of linked-HMM for speech and voice detection.

What’s nice about the paper • The first paper presenting the application of linked-HMM for speech and voice detection. • “simple” algorithm: forward-backward algorithm, features extraction

What’s nice about the paper • The first paper presenting the application of linked-HMM for speech and voice detection. • “simple” algorithm: forward-backward algorithm, features extraction. • Experimental evaluation of some of the aspects of the proposed algorithms.

What’s nice about the paper • The first paper presenting the application of linked-HMM for speech and voice detection. • “simple” algorithm: forward-backward algorithm, features extraction. • Experimental evaluation of some of the aspects of the proposed algorithms. • I learned something useful, namely how to get rid of the impact of constant source contribution (fan, wind blowing, etc.). 

How about the cons? • Fairly dense of concepts for a short paper.

How about the cons? • Fairly dense of concepts for a short paper. • Consequently, often lack of clear explanations.

How about the cons? • Fairly dense of concepts for a short paper. • Consequently, often lack of clear explanations. • Generally applicable, to mobile devices such as cell phones for example?

How about the cons? • Fairly dense of concepts for a short paper. • Consequently, often lack of clear explanations. • Generally applicable, to mobile devices such as cell phones for example? • Training with too few different individuals (just 2) – this is a supervised ML method!!

How about the cons? • Fairly dense of concepts for a short paper. • Consequently, often lack of clear explanations. • Generally applicable, to mobile devices such as cell phones for example? • Training with too few different individuals (just 2) – this is a supervised ML method!! • Not clear experimental protocol – what does “noisy conditions” mean?? • Is comparison in Fig. 3 enough to show the improvement over HMM?

Is the noise autocorrelation always effective? • What if the noise is generated by a high energy periodic noisy signal such as a motor?

Is the noise autocorrelation always effective? • What if the noise is generated by a high energy periodic noisy signal such as a motor? • This suggests that the proposed technique might …..

Is the noise autocorrelation always effective? • What if the noise is generated by a high energy periodic noisy signal such as a motor? • This suggests that the proposed technique might work better in indoor environment whereas performs more poorly on mobile devices?

Is the noise autocorrelation always effective? • What if the noise is generated by a high energy periodic noisy signal such as a motor? • This suggests that the proposed technique might work better in indoor environment whereas performs more poorly on mobile devices? • Not clear how variations of one of the features (particularly, noisy autocorrelation) would impact the overall classification result.

Few questions • How does the algorithm differentiate a singer singing a song from an actual conversation?

Few questions • How does the algorithm differentiate a singer singing a song from an actual conversation? • Maybe checking if the spectral content of the voicing part changes over time is an indication of multiple people talking

Few questions • How does the algorithm differentiate a singer singing a song from an actual conversation? • Maybe checking if the spectral content of the voicing part changes over time is an indication of multiple people talking • Does the system distinguish conversations from a pair of speakers A versus the pair of speakers B?

Few questions • How does the algorithm differentiate a singer singing a song from an actual conversation? • Maybe checking if the spectral content of the voicing part changes over time is an indication of multiple people talking • Does the system distinguish conversations from a pair of speakers A versus the pair of speakers B? • Same as above plus knowledge of the device owner voice spectral pattern would help to filter out outliers

Overall • Nice technique that could be applied to a broad set of scenarios, in my opinion mainly where computational resources are available and not many sources of (periodic) noise are present. In these cases the error is small.

Overall • Nice technique that could be applied to a broad set of scenarios, in my opinion mainly where computational resources are available and not many sources of (periodic) noise are present. In these cases the error is small. • Not sure about its applicability to mobile devices for real-time speech detection. Some of the aspects might be re-used though.

Overall • Nice technique that could be applied to a broad set of scenarios, in my opinion mainly where computational resources are available and not many sources of (periodic) noise are present. In these cases the error is small. • Not sure about its applicability to mobile devices for real-time speech detection. Some of the aspects might be re-used though. • Can a mobile-devices oriented scheme tradeoff accuracy versus speed?

A Linked-HMM for Robust Voicing and Speech Detection

A Linked-HMM for Robust Voicing and Speech Detection

Presentation Transcript

Robust Speech Feature

Robust Speech recognition

Trills and Voicing

Speech recognition using HMM

Speech Recognition and HMM Learning

Parallel Pair-HMM SNP Detection

Robust Speech Feature

A novel irregular voice model for HMM-based speech synthesis

A Robust Background Subtraction and Shadow Detection

Enhanced Speech Models for Robust Speech Recognition

Protein homology detection by HMM–HMM comparison Johannes Söding

HMM-BASED PATTERN DETECTION

A Bayesian Approach to HMM-Based Speech Synthesis

Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments

Fast and Robust Ellipse Detection

Prosodic Constraints for Robust Speech Recognition

Protein homology detection by HMM–HMM comparison Johannes Söding

A Feature Weighting Method for Robust Speech Recognition

Statistical and Signal Processing Approaches for Voicing Detection