カンファレンス (国際) A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses
Hisashi Kamezawa (The University of Tokyo), Noriki Nishida (RIKEN Center for Advanced Intelligence Project (AIP)), Nobuyuki Shimizu, Takashi Miyazaki, Hideki Nakayama (The University of Tokyo)
The 2020 Conference on Empirical Methods in Natural Language Processing. (EMNLP20202)
In real-world dialogue, first-person visual in-formation about where the other speakers are and what they are paying attention to is crucial to understand their intentions. Non-verbal responses also play an important role in social interactions. In this paper, we propose a visually-grounded first-person dialogue (VFD) dataset with verbal and non-verbal responses. TheVFD dataset provides manually annotated (1)first-person images of agents, (2) utterances ofhuman speakers, (3) eye-gaze locations of the speakers, and (4) the agents’ verbal and non-verbal responses. We present experimental results obtained using the proposed VFD dataset and recent neural network models (e.g., BERT,ResNet). The results demonstrate that first-person vision helps neural network models correctly understand human intentions, and the production of non-verbal responses is a challenging task like that of verbal responses. Our dataset is publicly available.
Software : https://github.com/yahoojapan/VFD-Dataset （外部サイト）