CONFERENCE (INTERNATIONAL) NON-AUTOREGRESSIVE END-TO-END AUTOMATIC SPEECH RECOGNITION INCORPORATING DOWNSTREAM NATURAL LANGUAGE PROCESSING
Motoi Omachi, Yuya Fujita, Shinji Watanabe (Carnegie Mellon University), Tianzi Wang (Johns Hopkins University)
2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2022)
April 27, 2022
We propose a fast and accurate end-to-end (E2E) model, which executes automatic speech recognition (ASR) and downstream natural language processing (NLP) simultaneously. The proposed approach predicts a single-aligned sequence of transcriptions and linguistic annotations such as part-of-speech (POS) tags and named entity (NE) tags from speech. We use non-autoregressive (NAR) decoding instead of autoregressive (AR) decoding to reduce execution time since NAR can output multiple tokens in parallel across time. We use the connectionist temporal classification (CTC) model with mask-predict, i.e., Mask-CTC, to predict the single-aligned sequence accurately. Mask-CTC improves performance by joint training of CTC and a conditioned masked language model and refining output tokens with low confidence conditioned on reliable output tokens and audio embeddings. The proposed method jointly performs the ASR and downstream NLP task, i.e., POS or NE tagging, in a NAR manner. Experiments using the Corpus of Spontaneous Japanese and Spoken Language Understanding Resource Package show that the proposed E2E model can predict transcriptions and linguistic annotations with consistently better performance than vanilla CTC using greedy decoding and 15--97x faster than Transformer-based AR model.