Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
The large amount of pre-training data used to build large language models (LLMs) may contain inappropriate data for training, such as copyrighted text or personal information. To solve this problem, the method for detecting the contents of LLMs pre-training data was proposed. The existing method uses low token probabilities of a sequence for a determination.This method has been evaluated on LLMs trained English, and its effectiveness in LLMs trained Japanese has not been investigated. In this study, we evaluated the effectiveness of the existing detection method on Japanese LLMs and compared it with the effectiveness on English LLMs. To this end, we constructed JAWikiMIA, a benchmark for detecting Japanese pre-training data. We report that English LLMs achieve high AUC scores when the method uses the 20% of tokens from a sequence with the low token probability, while Japanese LLMs achieve high AUC scores when the method uses all tokens in a sequence.