In this paper, we focus on the double bind between textual and visual information and investigate the following points: (1) whether Large Language Models (LLMs) can detect a sense of incongruity, (2) whether images that evoke positive or negative impressions influence the detection of incongruity, and (3) whether the incongruity judgments made by LLMs align with those of human subjects. We examined three LLMs: GTP-4o, Gemini 1.5 Flash, and Claude 3 Haiku. Our results indicate that LLMs tend to detect the sense of incongruity arising from the double bind between text and images. Moreover, despite variations in impression evaluations due to different images, there was a consistent tendency for LLMs to detect incongruity. Finally, among the LLMs studied, GTP-4o’s incongruity judgments were most similar to those of human subjects.
 View full abstract