Learnings from the Limitations of Visual Question Answering Models: A Critical Analysis
Visual Question Answering (VQA) provides a way answering a question given in natural language related to a given image. These algorithms require image understanding and semantics of the question. Thus, the knowledge of computer vision (CV) field and natural language processing (NLP) domains is needed for performing the task of VQA. Both visual and question features are merged together to generate an answer. In this paper, we discussed the limitations of the few state-of-the art VQA models. Also, we discuss the failure cases of these VQA methods. Thus, this will give future directions to the researches to work on these limitations and improve the accuracy of the VQA models.