Discriminative Speech Recognition Rescoring with Pre-trained Language Models

Discriminative Speech Recognition Rescoring with Pre-trained Language Models Prashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko Amazon Alexa AI, USA ASRU 2023

Introduction • Discriminative training, directly optimizing the minimum word-error-rate (MWER) criterion typically improves rescoring • They propose and explore several discriminative fine-tuning schemes for pre-trained LMs (BERT and GPT2)

Non-Discriminative ASR Rescoring New score of ??ℎASR hypothesis: ??= log????? + ?log?A??|?? Likelihood score from rescoring LM sequence probability from 1st pass acoustic model

Non-Discriminative ASR Rescoring: GPT2 New score of ??ℎASR hypothesis: ??= log????? + ?log?A??|?? Likelihood score from rescoring LM sequence probability from 1st pass acoustic model GPT2: ? ????? = ? ??,?|??,1,…,??,?−1 ?=1

Non-Discriminative ASR Rescoring: BERT New score of ??ℎASR hypothesis: ??= log????? + ?log?A??|?? Likelihood score from rescoring LM sequence probability from 1st pass acoustic model BERT: pseudo-log-likelihood ? ????? = − log?(??,?|??,\t) ?=1

Discriminative training for ASR • Minimum word error rate (MWER) loss • Minimize ASR expected word error rate ??????,?∗= ? ℰ ?,?∗ ?(?|?)ℰ ?,?∗ = ? edit distance between the ASR hypothesis ? and groundtruth transcript ?∗ probability of ASR hypothesis ? given s????? ? • approximated by restricting the sequence probability over the n-best hypothesis ? ??????,?∗= ?(??|?)ℰ ??,?∗ ?=1

Discriminative ASR Rescoring: GPT2 + MWER • The previous MWER loss for discriminatively training can be applied for 2nd pass rescoring model • Minimize rescoring model expected word error rate probability of choosing ASR hypothesis ?? after rescoring ? ??????,?∗= ?(??|?)ℰ ??,?∗ ?=1 ? ??? ? ???ℰ ??,?∗ = ?=1 ?=1 New score of ??ℎASR hypothesis after rescoring: ? ??= log( ? ??,?|??,1,…,??,?−1) + ?log?A??|?? ?=1

Discriminative ASR Rescoring: BERT + MWER • The previous MWER loss for discriminatively training can be applied for 2nd pass rescoring model • Minimize rescoring model expected word error rate ? probability of choosing ASR hypothesis ?? after rescoring ??????,?∗= ?(??|?)ℰ ??,?∗ ?=1 ? ??? ? ???ℰ ??,?∗ = ?=1 ?=1 New score of ??ℎASR hypothesis after rescoring: ? ??= − log?(??,?|??,\t) + ?log?A??|?? ?=1

Discriminative ASR Rescoring: RescoreBERT ? ??????,?∗= ?(??|?)ℰ ??,?∗ ?=1 ? ??? ? ???ℰ ??,?∗ = ?=1 ?=1 New score of ??ℎASR hypothesis after rescoring: ??= log????? + ?log?A??|??

Discriminative ASR Rescoring: RescoreGPT ? ??????,?∗= ?(??|?)ℰ ??,?∗ ?=1 ? ??? ? ???ℰ ??,?∗ = ?=1 ?=1 New score of ??ℎASR hypothesis after rescoring: ??= log????? + ?log?A??|??

Discriminative ASR Rescoring: Attention Pooling Q?? log????? = ??(softmax ?) ?? Q = ??? ? = ??? ? = ??? ? = ℎ1,…,ℎ? : hidden output embeddings ?, ??, ??, ??: learnable weights

Dataset and Experimental Setup • Data • LibriSpeech 1000 hours • ASR • Whisper tiny (Transformer, 39M) • Generate 10best • Rescoring model • BERT (110M) • GPT2 (117M)

Results

Discriminative Speech Recognition Rescoring with Pre-trained Language Models