Maximizing Joint Probability in Visual Question Answering Models
Abstract
Visual Question Answering(VQA) is a task which is an important for the comprehension of humanity that will be useful for effective computer solution for a defined picture. In this article a novel VQA model is proposed which will derive the various low, high-level semantics for VQA by exploiting intermediate CNN layers. The model is Hierarchical Feature Network (HF-Net) where each hierarchical feature combines the attention map and multimodal pooled. This combination occurs at the answer, reasoning stage to get both high and low-level semantics. In the attention regions, the HF-Net is superior which is demonstrated by qualitative experiments. It also achieved the anomaly occurring due to the rephrasing of questions by a state-of-the-art model.