人間の画像認識とコンピュータビジョンの画像認識はどこが違うのか

大森 宏; 羽生 和紀

doi:10.5057/jjske.TJSKE-D-23-00036

抄録

There exist some Computer Vision Models (CVMs) such as CNN, Vision Transformer (ViT), and CLIP, which were pre-trained on a huge amount of training data. The image cognition power of these CVMs is very high. In our environmental cognition research using photos, we manually measured the inter-photo visual similarity. Our previous study found that CVM-based photo similarity and visual similarity were quite similar, when compared by photo MDS. However, it was also suggested that the difference in image cognition between humans and CVM was related to representation of humans. We investigated here numerically in detail the difference between CVM-based photo similarity and visual similarity, using six types of photo sets. The influence of representation could be evaluated by cluster size on MDS. It was shown that representation influences the cognition of shrines and temples, foods, insects, buildings, greens, garden styles, perspective views, night views, the symbol tree, and so on.

著者関連情報

お気に入り & アラート

閲覧履歴

前身誌

感性工学研究論文集

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）