Abstract
There are many different sound sources in a room, and classifying these sounds has many applications, such as monitoring a living situation. The authors investigate a method for estimating each sound in the environment by converting timeseries data of indoor sounds into spectrogram images and using them as input for transfer learning to build a discriminative model. In this process, the amount of sound data that can be prepared in advance is limited due to the effort required for recording and the variety of data types required. As a result, there may be cases where sufficient classification accuracy cannot be achieved due to insufficient data for training. Therefore, this study proposes and applies a data augmentation method to improve classification accuracy when the number of data is limited, with the aim of classifying single and mixed sounds exist in a room, and describes the results of clarifying its effectiveness.