Research on tactical and performance analysis utilizes videos of dynamic sports scenes. An effective multiview video switching method can support the analysis. Bullet-time video is a multiview video browsing approach. Because the image is presented almost as it is, it is suitable for high-quality observations of the subject from multiple directions. This paper proposes a multiview image switching method for understanding dynamic scenes in large-scale spaces such as soccer games. We develop a prediction model for the camerawork for shooting Bullet-time videos. The model using deep neural network, which can estimate a suitable viewpoint to observe the target scene from the position information of the soccer players, ball, and goals.
For automated game analysis, it is essential to detect the kicking motions of players in soccer videos in order to understand each player's actions. This paper presents a fast and accurate approach to detecting kicking motions with a ball-centric window in multi-view 4K soccer videos. Based on powerful object detection techniques like SSD or YOLOv3 and pose estimation techniques like OpenPose or CPN, we propose novel solutions to overcome two challenges in 4K soccer videos. The first challenge is that it is basically too computationally heavy to process the massive amount of data in multi-view 4K videos. The solution to this challenge is that we only process a small portion (i.e. a ball-centric window) of 4K video, benefiting from an object tracking technique and homography transformation. The second challenge is that kicking motions may be incorrectly detected due to two factors. One is the absence of depth information and the other is the inaccuracy of pose estimation. We fuse multiple views to avoid the depth problem. In addition, we propose enlarging the person areas to effectively improve the accuracy of pose estimation. The experiments on real data from the J1 League demonstrate that the proposed approach achieves both faster and more accurate detection of kicking motions than conventional methods.
The details of the matches of soccer can be estimated from visual and audio sequences, and they correspond to the occurrence of important scenes. Therefore, the use of these sequences is suitable for important scene detection. In this paper, a new multimodal method for important scene detection from visual and audio sequences in far-view soccer videos based on a single deep neural architecture is presented. A unique point of our method is that multiple classifiers can be realized by a single deep neural architecture that includes a Convolutional Neural Network-based feature extractor and a Support Vector Machine-based classifier. This approach provides a solution to the problem of not being able to simultaneously optimize different multiple deep neural architectures from a small amount of training data. Then we monitor confidence measures output from this architecture for the multimodal data and enable their integration to obtain the final classification result.
This paper presents a novel method of estimating temporal offsets between multi-view unsynchronized videos. When synchronizing multiple cameras scattered in a large area with a wide baseline (e.g., a sports stadium, an event hall, etc.), conventional epipolar-based approaches sometimes fail due to the difficulty of robust point correspondences. For such cases, 2D projections of human joints can be robustly associated with each other even in wide baseline videos and can be utilized as corresponding points. However, the detected 2D poses include detection errors in general that cause estimation failures. To address these problems, we introduce the motion rhythm of 2D human joints as a cue for synchronization. The proposed method detects motion rhythms from videos and estimates temporal offsets with the best harmonized motion rhythms. Moreover, we propose a hybrid synchronization algorithm to get sub-frame precision. We demonstrate our method's performance with indoor and outdoor data.
An interpretable convolutional neural network (CNN) including attribute estimation for image classification is presented in this paper. Although CNNs perform highly accurate image classification, the reason for the classification results obtained by the neural networks is not clear. In order to provide interpretation of CNNs, the proposed method estimates attributes, which explain elements of objects, in an intermediate layer of the network. This enables improvement of the interpretability of CNNs, and it is the main contribution of this paper. Furthermore, the proposed method uses the estimated attributes for image classification in order to enhance its accuracy. Consequently, the proposed method not only provides interpretation of CNNs but also realizes improvement in the performance of image classification.