Attention Of Multi-Faceted Video Captioning
Abstract
Video subscription has been increasingly attractive because of its ability to improve accessibility and to retrieve information. While existing methods rely on various types of visual features and templates, they do not use the correct semantic markers entirely. They offer a coherent and extensible structure that uses various visual and semantine characteristics together. Our new architecture is based on LSTMs with two different layers of focus. Next, you can choose the most popular visual or textual features automatically and then provide the complete images for input and output to the sentence generation portion by customizable task scaling operations. Experimental results in the demanding data sets MSVD and MSR-VTT demonstrate that our architecture beats previous work and works well, even with external noise.