JOURNAL (DOMESTIC) Improving Conversation Task with Visual Scene Dataset

Abdurrisyad Fikri (Tokyo Institute of Technology), Hiep Le (LINE Fukuoka Corporation), Takashi Miyazaki, Manabu Okumura (Tokyo Institute of Technology), Nobuyuki Shimizu

Journal of Natural Language Processing (JNLP)

March 15, 2022

To build good conversation agents, one would assume that an accurate conversation context is required. As using images as conversation contexts has proven effective we argue that a conversation scene that includes speakers could provide more information on the context. We constructed a visual conversation scene dataset (VCSD) that provides scenic images corresponding to conversations. This dataset provides a combination of (1) conversation scene image (third-person view), (2) the corresponding first utterance and its response, and (3) the corresponding speaker, respondent, and topic object. In our experiments on the response-selection task, we first examined BERT (text only) as a baseline. While BERT managed to perform well in general conversations, where a response continues from the previous utterance, it failed to deal with cases where visual information was necessary to understand the context. Our error analysis found that conversations requiring visual contexts can be categorized into three types: visual question-answering, image-referring response, and scene understanding. To optimize the usage of conversation scene images and their focused parts, i.e., speaker, respondent, and topic object, we proposed a model that received texts and multiple image features as the inputs. Our model can capture this information and achieved 91% accuracy.

Paper : Improving Conversation Task with Visual Scene Dataset (external link)