Enhanced Video Captioning through Residual and Bottleneck CNNS with LSTM Integration

Main Article Content

Amruta Rajendra Chougule, Shankar D. Chavan

Abstract

This study investigates various methods for video captioning, emphasizing the novel contributions of the proposed model. Existing approaches such as Vid2Seq, Positive-Augmented Contrastive Learning, GL-RG, and TextKG utilize diverse techniques to extract and interpret video features, yielding notable performance on datasets like MSR-VTT and MSVD. The proposed model, however, employs a CNN encoder with a combination of residual and bottleneck blocks to effectively capture temporal and spatial features, paired with an LSTM-based RNN for handling sequential data and long-range dependencies. Evaluated on MSR-VTT, MPII Cooking 2, and M-VAD datasets, the model achieves a peak BLEU score of 51, demonstrating its ability to generate high-quality video descriptions. The model maintains a streamlined yet powerful architecture, marking a significant advancement in video captioning.


DOI : https://doi.org/10.52783/pst.1728

Article Details

Section
Articles