2024 年 144 巻 10 号 p. 985-996
In this paper, we propose a method to detect driving scenes where cognitive function can be evaluated. This method defines assessable scenes as those composed of three elements: road structure, appearing objects, and operations. It detects scenes composed of these three elements, which are arbitrarily set. When detecting targets composed of multiple information sources, and for targets where the pre-description of useful feature vectors is difficult, multimodal deep learning is used. While there are cases where an intermediate fusion model structure is used in existing research, it has been suggested that such models face challenges with hyperparameter tuning and may fail to learn the inter-modality relationships when there are discrepancies in the amount of information each modality provides. Therefore, in this paper, a new model structure that incorporates an attention mechanism into a late fusion model is proposed. This model not only enables individual evaluation of each modality constituting the scenes and achieves the final detection result, but also provides a structure with high readability regarding how the detection results are produced. In experiments, this method is compared in terms of detection accuracy with the intermediate fusion model structure used in existing research, and improvements in both recall and precision were confirmed.
J-STAGEがリニューアルされました! https://www.jstage.jst.go.jp/browse/-char/ja/