Hybrid Vision Transformer Architectures with CNN Blocks for Multi-Label Chest Disease Classification

Main Article Content

Rajendra D. Bhosale, D. M. Yadav

Abstract

This study presents a novel investigation into Vision Transformer (ViT)-based hybrid architectures for multi-label chest disease classification using the CXR-14 dataset. Traditional Convolutional Neural Networks (CNNs), though effective in local feature extraction, often struggle to capture global contextual dependencies. To address this limitation, three ViT-integrated models are proposed by embedding ViT blocks within standard CNN structures: Residual ViT, Bottleneck ViT, and MBConv-SE ViT. Each model replaces conventional 3×3 convolution units within respective blocks to leverage the self-attention mechanism for enhanced feature representation. These hybrid architectures combine the inductive bias of CNNs with the global reasoning capabilities of Transformers, improving classification accuracy and interpretability. The models are evaluated against a comprehensive set of baseline methods, including attention-guided, region-guided, and semantic-guided models. Experimental results demonstrate that the proposed MBConv-SE ViT model outperforms existing approaches across multiple disease categories, highlighting the advantages of combining efficient convolutions, attention recalibration, and global context modeling. This work establishes a robust framework for designing transformer-augmented CNNs and shows their effectiveness in high-resolution, multi-label medical image analysis tasks such as automated chest X-ray diagnosis.


DOI : https://doi.org/10.52783/pst.1729

Article Details

Section
Articles