Enhanced Video Captioning through Residual and Bottleneck CNNS with LSTM Integration
Main Article Content
Abstract
This study investigates various methods for video captioning, emphasizing the novel contributions of the proposed model. Existing approaches such as Vid2Seq, Positive-Augmented Contrastive Learning, GL-RG, and TextKG utilize diverse techniques to extract and interpret video features, yielding notable performance on datasets like MSR-VTT and MSVD. The proposed model, however, employs a CNN encoder with a combination of residual and bottleneck blocks to effectively capture temporal and spatial features, paired with an LSTM-based RNN for handling sequential data and long-range dependencies. Evaluated on MSR-VTT, MPII Cooking 2, and M-VAD datasets, the model achieves a peak BLEU score of 51, demonstrating its ability to generate high-quality video descriptions. The model maintains a streamlined yet powerful architecture, marking a significant advancement in video captioning.
