2022 Volume 2022 Issue AIMED-012 Pages 05-
Convolutional neural networks (CNNs) have been adopted as standard deep learn- ing models in medical image analysis owing to their ability to automatically extract high-level features from training images. Recently, Vision Transformer (ViT) models have been proposed, which implement the Transformer architecture originally developed for natural language process- ing. Given their high predictive performance, we built a couple of ViT models to detect kidney cancer based on computed tomography (CT) images. Experimental results show that our ViT models outperformed conventional CNNs in terms of detection accuracy with various types of CT images. Moreover, we visualized the attention maps of our ViT models to help understand the basis for their detection output.