Enhanced Video Captioning through Residual and Bottleneck CNNS with LSTM Integration

Amruta Rajendra Chougule, Shankar D. Chavan

doi:10.52783/pst.1728

Published: Apr 3, 2025

Keywords:

Video captioning, CNN encoder, Residual blocks, LSTM, Temporal-spatial features

Amruta Rajendra Chougule, Shankar D. Chavan

Abstract

This study investigates various methods for video captioning, emphasizing the novel contributions of the proposed model. Existing approaches such as Vid2Seq, Positive-Augmented Contrastive Learning, GL-RG, and TextKG utilize diverse techniques to extract and interpret video features, yielding notable performance on datasets like MSR-VTT and MSVD. The proposed model, however, employs a CNN encoder with a combination of residual and bottleneck blocks to effectively capture temporal and spatial features, paired with an LSTM-based RNN for handling sequential data and long-range dependencies. Evaluated on MSR-VTT, MPII Cooking 2, and M-VAD datasets, the model achieves a peak BLEU score of 51, demonstrating its ability to generate high-quality video descriptions. The model maintains a streamlined yet powerful architecture, marking a significant advancement in video captioning.

DOI : https://doi.org/10.52783/pst.1728

Issue

Vol. 49 No. 1 (2025)

Section

Articles

Acceptance Rate:	24%
Review Speed:	29 days
Issue Per Year:	4
Number of Articles:	1
Number of Reviewers:	489
Number of Contributors:	8296
Contributing Countries:	42
No. of Scopus Citations:	64269
No. of WoS Citations:	3269
Abstract Views:	82,897
PDF Download:	94,708

Article Sidebar

Main Article Content

Abstract

Article Details