Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR - Yahoo! JAPAN R&D

Publications

CONFERENCE (INTERNATIONAL) Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR

Takashi Maekaku, Yuya Fujita, Yifan Peng (Carnegie Mellon University), Shinji Watanabe (Carnegie Mellon University)

The 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022)

September 19, 2022

Transformer-based encoder-decoder models have so far been widely used for end-to-end automatic speech recognition. However, it has been found that the self-attention weight matrix could be too peaky and biased toward the diagonal component. Such attention weight matrix contains little useful context information, which may result in poor speech recognition performance. Therefore, we propose the following two attention weight smoothing methods based on the hypothesis that an attention weight matrix whose diagonal components are not peaky can capture more context information. One is a method to linearly interpolate the attention weight using a learnable truncated prior distribution. The other uses the attention weight from a previous layer as a prior distribution given that lower layer weights tend to be less peaky and diagonal. Experiments on LibriSpeech and Wall Street Journal show that the proposed approach achieves 2.9% and 7.9% relative improvement, respectively, over a vanilla Transformer model.

Speech Processing

Paper : Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR (external link)