カンファレンス (国際) Simultaneous Detection and Localization of a Wake-Up Word using Multi-Task Learning of the Duration and Endpoint
Takashi Maekaku, Yusuke Kida, Akihiko Sugiyama
The 20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019)
This paper proposes a novel method for simultaneous detection and localization of a wake-up word using multi-task learning of the duration and endpoint. An onset of the wake-up word is estimated by going back in time by an estimated duration of the wake-up word from an estimated endpoint. Accurate endpoint estimation is achieved by training the network to fire only at the endpoint in contrast to the entire wake-up word. The accurate endpoint naturally leads to an accurate onset, when it is used as a basis to calculate an onset with an estimated duration that reflects the whole acoustic information over the entire wake-up word. Experimental results with real-environment data show that a relative improvement in accuracy of 41% for onset estimation and 38% for endpoint estimation are achieved compared to a baseline method.