Transformer-Based Audio Classification: Adapting Sequence-Classification Techniques from Nlp

C. S. Sonali

doi:10.52783/pst.2814

PDF

Published: Dec 6, 2025

C. S. Sonali, Suma K V, Chinmayi B S, Ahana Balasubramanian

Abstract

Audio classification is very useful in fields such as music and speech recognition. An important step in this direction is feature extraction from the signal and popularly used features are MFCCs and Mel-Spectrograms. These features are converted into spectrograms for the purpose of classification. Researchers have employed techniques such as machine learning and deep learning, for classifying spectrograms, however they can have high computational cost. An effort in reducing the computational cost is exploring a more straightforward approach inspired by sequence classification in NLP. This paper proposes a Transformer-based model for audio classification utilizing MFCCs as features. The proposed model is benchmarked against the Speech Commands v0.02, UrbanSound8k and ESC-50 datasets and has presented strong performance, with the highest accuracy of 95.2% attained upon training the model on the UrbanSound8k dataset. The model consists of a mere 127,544 total parameters, and hence is lightweight yet highly efficient at the task of audio classification. This work leads to an efficient and light on computation cost solution for audio classification which can be helpful in the field of Machine Learning and Data Science.

Issue

Vol. 49 No. 4 (2025)

Section

Articles

Acceptance Rate:	24%
Review Speed:	29 days
Issue Per Year:	4
Number of Articles:	1
Number of Reviewers:	489
Number of Contributors:	8296
Contributing Countries:	42
No. of Scopus Citations:	64269
No. of WoS Citations:	3269
Abstract Views:	82,897
PDF Download:	94,708

Article Sidebar

Main Article Content

Abstract

Article Details