Transformer-Based Audio Classification: Adapting Sequence-Classification Techniques from Nlp
Main Article Content
Abstract
Audio classification is very useful in fields such as music and speech recognition. An important step in this direction is feature extraction from the signal and popularly used features are MFCCs and Mel-Spectrograms. These features are converted into spectrograms for the purpose of classification. Researchers have employed techniques such as machine learning and deep learning, for classifying spectrograms, however they can have high computational cost. An effort in reducing the computational cost is exploring a more straightforward approach inspired by sequence classification in NLP. This paper proposes a Transformer-based model for audio classification utilizing MFCCs as features. The proposed model is benchmarked against the Speech Commands v0.02, UrbanSound8k and ESC-50 datasets and has presented strong performance, with the highest accuracy of 95.2% attained upon training the model on the UrbanSound8k dataset. The model consists of a mere 127,544 total parameters, and hence is lightweight yet highly efficient at the task of audio classification. This work leads to an efficient and light on computation cost solution for audio classification which can be helpful in the field of Machine Learning and Data Science.
