Host: The Japanese Society for Artificial Intelligence
Name : The 37th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 37
Location : [in Japanese]
Date : June 06, 2023 - June 09, 2023
Visual information plays an important role in the language acquisition by humans. While most of the large language models (LLM) that have been successful in various NLP tasks are trained only on textual data, the work of Vokenization established the new way of incorporating visual information into LLM training to improve the LLM performance in NLP tasks. However, the Vokenization process adversely assigns the same image to different tokens within a sentence, which prevents the LLM from learning the effective word representation. In this study, to further improve the performance of the LLM, we propose a method to diversify images assigned to tokens in the LLM training by exploiting top-k or top-p samplings. The experimental results showed that the effectiveness of our method on GLUE, an English comprehension benchmark, outperforming the baseline method that used top-1 retrieval in Vokenization.