CONFERENCE (INTERNATIONAL) Simultaneous Detection and Localization of a Wake-Up Word using Multi-Task Learning of the Duration and Endpoint
Takashi Maekaku, Yusuke Kida, Akihiko Sugiyama
The 20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019)
September 19, 2019
This paper proposes a novel method for simultaneous detection and localization of a wake-up word using multi-task learning of the duration and endpoint. An onset of the wake-up word is estimated by going back in time by an estimated duration of the wake-up word from an estimated endpoint. Accurate endpoint estimation is achieved by training the network to fire only at the endpoint in contrast to the entire wake-up word. The accurate endpoint naturally leads to an accurate onset, when it is used as a basis to calculate an onset with an estimated duration that reflects the whole acoustic information over the entire wake-up word. Experimental results with real-environment data show that a relative improvement in accuracy of 41% for onset estimation and 38% for endpoint estimation are achieved compared to a baseline method.