This study aimed to estimate human willingness to visit cityscape images via artificial intelligence (AI) using multimodal deep learning. In this study, gaze information was acquired through subject experiments using a measurement device. We added gaze information when humans felt motivated to visit the cityscape image, and confirmed whether the estimation accuracy of AI would improve. We also created an AI model that generated gaze-view images, and used it for multimodal deep learning. We used pix2pix to generate the images. Finally, we verified the accuracy of the proposed multimodal deep learning approach, when the generated pseudo-gaze image was attached.