0 likes | 9 Views
This study explores discriminative fine-tuning schemes for pre-trained language models (BERT and GPT2) in the context of speech recognition rescoring. The approach involves optimizing the minimum word-error rate criterion and utilizing discriminative training to improve the rescoring process for ASR. The experimental setup includes LibriSpeech data, Whisper tiny ASR model, and BERT/GPT2 rescoring models with results showcasing the effectiveness of the proposed methods.
E N D
Discriminative Speech Recognition Rescoring with Pre-trained Language Models Prashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko Amazon Alexa AI, USA ASRU 2023
Introduction • Discriminative training, directly optimizing the minimum word-error-rate (MWER) criterion typically improves rescoring • They propose and explore several discriminative fine-tuning schemes for pre-trained LMs (BERT and GPT2)
Non-Discriminative ASR Rescoring New score of ??ℎASR hypothesis: ??= log????? + ?log?A??|?? Likelihood score from rescoring LM sequence probability from 1st pass acoustic model
Non-Discriminative ASR Rescoring: GPT2 New score of ??ℎASR hypothesis: ??= log????? + ?log?A??|?? Likelihood score from rescoring LM sequence probability from 1st pass acoustic model GPT2: ? ????? = ? ??,?|??,1,…,??,?−1 ?=1
Non-Discriminative ASR Rescoring: BERT New score of ??ℎASR hypothesis: ??= log????? + ?log?A??|?? Likelihood score from rescoring LM sequence probability from 1st pass acoustic model BERT: pseudo-log-likelihood ? ????? = − log?(??,?|??,\t) ?=1
Discriminative training for ASR • Minimum word error rate (MWER) loss • Minimize ASR expected word error rate ??????,?∗= ? ℰ ?,?∗ ?(?|?)ℰ ?,?∗ = ? edit distance between the ASR hypothesis ? and groundtruth transcript ?∗ probability of ASR hypothesis ? given s????? ? • approximated by restricting the sequence probability over the n-best hypothesis ? ??????,?∗= ?(??|?)ℰ ??,?∗ ?=1
Discriminative ASR Rescoring: GPT2 + MWER • The previous MWER loss for discriminatively training can be applied for 2nd pass rescoring model • Minimize rescoring model expected word error rate probability of choosing ASR hypothesis ?? after rescoring ? ??????,?∗= ?(??|?)ℰ ??,?∗ ?=1 ? ??? ? ???ℰ ??,?∗ = ?=1 ?=1 New score of ??ℎASR hypothesis after rescoring: ? ??= log( ? ??,?|??,1,…,??,?−1) + ?log?A??|?? ?=1
Discriminative ASR Rescoring: BERT + MWER • The previous MWER loss for discriminatively training can be applied for 2nd pass rescoring model • Minimize rescoring model expected word error rate ? probability of choosing ASR hypothesis ?? after rescoring ??????,?∗= ?(??|?)ℰ ??,?∗ ?=1 ? ??? ? ???ℰ ??,?∗ = ?=1 ?=1 New score of ??ℎASR hypothesis after rescoring: ? ??= − log?(??,?|??,\t) + ?log?A??|?? ?=1
Discriminative ASR Rescoring: RescoreBERT ? ??????,?∗= ?(??|?)ℰ ??,?∗ ?=1 ? ??? ? ???ℰ ??,?∗ = ?=1 ?=1 New score of ??ℎASR hypothesis after rescoring: ??= log????? + ?log?A??|??
Discriminative ASR Rescoring: RescoreGPT ? ??????,?∗= ?(??|?)ℰ ??,?∗ ?=1 ? ??? ? ???ℰ ??,?∗ = ?=1 ?=1 New score of ??ℎASR hypothesis after rescoring: ??= log????? + ?log?A??|??
Discriminative ASR Rescoring: Attention Pooling Q?? log????? = ??(softmax ?) ?? Q = ??? ? = ??? ? = ??? ? = ℎ1,…,ℎ? : hidden output embeddings ?, ??, ??, ??: learnable weights
Dataset and Experimental Setup • Data • LibriSpeech 1000 hours • ASR • Whisper tiny (Transformer, 39M) • Generate 10best • Rescoring model • BERT (110M) • GPT2 (117M)