Abstract
Automatic music transcription, which is primarily the conversion of sound into a symbolic representation such as sheet music, is one of the major tasks in the field of music and computers. One of these tasks, automatic drum transcription, focuses on a drum set consisting of various percussion instruments, and detects which instruments are struck and at what timing, and converts them into symbolic representations. In previous studies, automatic music transcription has been processed mainly using only acoustics as an input. However, methods that use only acoustics as an input often have difficulty in scoring polyphonic instruments and in scoring sounds recorded in an environment where background noise is present. One possible solution is to use automatic music transcription based on visual information from a video recording of a person playing a musical instrument. Therefore, this study focuses on snare drums, one of the main instruments in a drum set, and examines the possibility of detecting the onsets (i.e., the physical time when a certain drum is hit) under monophonic conditions from silent performance videos. As a specific method, we first fine-tuned the pre-trained ResNet-18 into a binary classification model using images of onset and non-onset. Then, the classification accuracy and final layer output of the evaluation data were verified. The results of the validation experiments show that the onset can be detected with high accuracy.