割り当て画像の多様性を考慮したVokenizationによるマスク言語モデルの改善

平井 翔太; 村岡 雅康; 岡崎 直観

doi:10.11517/pjsai.JSAI2023.0_4Xin138

Abstract

Visual information plays an important role in the language acquisition by humans. While most of the large language models (LLM) that have been successful in various NLP tasks are trained only on textual data, the work of Vokenization established the new way of incorporating visual information into LLM training to improve the LLM performance in NLP tasks. However, the Vokenization process adversely assigns the same image to different tokens within a sentence, which prevents the LLM from learning the effective word representation. In this study, to further improve the performance of the LLM, we propose a method to diversify images assigned to tokens in the LLM training by exploiting top-k or top-p samplings. The experimental results showed that the effectiveness of our method on GLUE, an English comprehension benchmark, outperforming the baseline method that used top-1 retrieval in Vokenization.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!