Utilizing deep learning in sports match analysis is very promising for efficiency, and there have been many studies focusing on volleyball. However, previous studies have used only the frames where the ball is alive (i.e., in-play frames), and recent data analysis methodologies rely on visual data, which are highly dependent on the position and angle of the camera relative to the court. As a result, these methods require a large dataset of images taken from various angles to improve accuracy. To efficiently extract and analyze the play as it is happening, we propose a model that distinguishes between in-play and out-of-play by combining visual data and audio data of volleyball match videos using late fusion. To investigate the effectiveness of the proposed model, we perform Grad-CAM visualization to determine which pixels the proposed model is focusing on.
View full abstract