Sparse Spatiotemporal Feature Learning for Video-Based Hand Gesture Recognition
Main Article Content
Abstract
Sign language and hand gesture recognition play a crucial role in enabling natural and intuitive communication between humans and machines, especially for assisting individuals with hearing and speech impairments. This paper presents a novel Sparse Motion Sequence Extraction Network (SMSE-Net) for efficient and accurate gesture recognition from video sequences. The proposed framework integrates a sparse image-wise feature extraction layer to identify salient motion information and a hybrid sequence-wise modeling layer to capture temporal dependencies across consecutive frames. By selectively focusing on informative motion patterns and suppressing redundant data, SMSE-Net significantly improves recognition performance while reducing computational overhead. Extensive experimental evaluations demonstrate that the proposed approach outperforms existing methods such as CNN, RCNN, YOLO-v3, and ResNet across multiple performance metrics, including accuracy, precision, recall, and F1-score. The results confirm the robustness, efficiency, and real-time applicability of the proposed SMSE-Net framework.
