カンファレンス (国際) Robust DNN-based VAD augmented with phone entropy based rejection of background speech
Yuya Fujita, Ken-ichi Iso
The 17th Annual Conference of the International Speech Communication Association (InterSpeech 2016)
We propose a DNN-based voice activity detector augmented by entropy based frame rejection. DNN-based VAD classifies a frame into speech or non-speech and achieves significantly higher VAD performance compared to conventional statistical model-based VAD. We observed that many of the remaining er- rors are false alarms caused by background human speech, such as TV / radio or surrounding peoples’ conversations. In order to reject such background speech frames, we introduce an entropy- based confidence measure using the phone posterior probability output by a DNN-based acoustic model. Compared to the target speaker’s voice background speech tends to have relatively un- clear pronunciation or is contaminated by other types of noises so its entropy becomes larger than audio signals with only the target speaker’s voice. Combining DNN-based VAD and the en- tropy criterion, we reject speech frames classified by the DNN- based VAD as having an entropy larger than a threshold value. We have evaluated the proposed approach and confirmed greater than 10% reduction in Sentence Error Rate.