ワークショップ (国内) A Comparative Study of Deep Learning Approaches for Visual Question Classification in Community QA
Hsin-Wen Liu (Waseda Univ.), Avikalp Srivastava (CMU), Sumio Fujita, Toru Shimizu, Riku Togashi, Tetsuya Sakai (Waseda Univ.)
第11回Webとデータベースに関するフォーラム / IPSJ IFAT研究会 (WebDB Forum 2018)
Tasks that take not only text but also image as inputs, such as Visual Question Answering (VQA), have received growing attention and become an active research field in recent years. In this study, we consider the task of Visual Question Classification (VQC), where a given question containing both text and an image needs to be classified into one of predefined categories for a Community Question Answering (CQA) site. Our experiments use real data from a major Japanese CQA site called Yahoo Chiebukuro. To our knowledge, our work is the first to systematically compare different deep learning approaches on VQC tasks for CQA. Our study shows that the model that uses HieText for text representation, ResNet50 for image representation, and Multimodal Compact Bilinear pooling for combining the two representations statistically significantly outperforms other models in the VQC task.