Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 4Xin2-98
Conference information

The analysis of pretraining data detection on LLMs between English and Japanese
*Kyoko KOYANAGIMiyu SATOTeruno KAJIURAKimio KURAMITSU
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

The large amount of pre-training data used to build large language models (LLMs) may contain inappropriate data for training, such as copyrighted text or personal information. To solve this problem, the method for detecting the contents of LLMs pre-training data was proposed. The existing method uses low token probabilities of a sequence for a determination.This method has been evaluated on LLMs trained English, and its effectiveness in LLMs trained Japanese has not been investigated. In this study, we evaluated the effectiveness of the existing detection method on Japanese LLMs and compared it with the effectiveness on English LLMs. To this end, we constructed JAWikiMIA, a benchmark for detecting Japanese pre-training data. We report that English LLMs achieve high AUC scores when the method uses the 20% of tokens from a sequence with the low token probability, while Japanese LLMs achieve high AUC scores when the method uses all tokens in a sequence.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top