In this framework, we improve the general visual inspection performance by changing the foundation Vision-Language Model (VLM), reconstructing the fine-tuning dataset, and proposing a selection algorithm for In-Context Learning (ICL). The existing approach using VLM and ICL gives non-defective or defective images and an explanatory description as a prompt to inspect the unknown products without additional parameter updating. However, the foundation VLM used in the existing approach focused on the ICL capability, without considering the local recognition capability. Thus, in this study, we change the foundation VLM to one focused on the local recognition capability. Also, we reconstruct the fine-tuning dataset to enable the model to detect defective coordinates. In addition, during the inference, we propose an example selection algorithm based on the Euclidean distance, and give the ICL example with a visual prompt. The experimental results show that our approach achieved F1-score of 0.950 on MVTec AD in a one-shot manner.
View full abstract