受容野の自動最適化によるモードに適応的なTransformerの開発

浅倉 拓也; 井上 中順; 横田 理央; 篠田 浩一

doi:10.11517/pjsai.JSAI2023.0_4I3OS1b05

Abstract

The Vision Transformer (ViT), which uses Attention instead of convolution for feature extraction, has demonstrated high performance in the field of image processing. This result shows that the Transformer can be used for both time-series and images, and is expected to be a versatile model that is independent of the mode of data. However, many of the studies derived from ViT have narrowed the receptive field for feature extraction, and their adaptability to time-series such as speech is compromised. In this paper, we propose a method to adaptively optimize the receptive fields for a given mode of data. We developed a model using the proposed method and conducted experiments on two types of data, images and speech, and found that the proposed method outperforms conventional methods for both. The visualization shows that the proposed method can acquire a suitable receptive field depending on the mode of the given data.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!