2025 Volume 16 Issue 4 Pages 1009-1021
In this study, we propose a method to construct an image recognition model with a single-channel input using transfer learning and data augmentation for music emotion classification. The data augmentation method generates a variety of spectrogram images by varying the STFT window size in small increments. This method ensures data equivalent to five times the amount of the original data and prevents degradation of classification performance due to insufficient data. The model construction method using transfer learning for grayscale images is designed to adapt the pre-trained EfficientNetV2 model, which was originally trained on ImageNet. The constructed model through transfer learning and our proposed data augmentation method achieved a classification accuracy of 94.8% on the 4Q Audio Emotion Dataset. Thus, our construction method using transfer learning for grayscale images, combined with the proposed data augmentation method, is effective in achieving music high-accuracy emotion classification.