Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
The performance of large language models relies on massive datasets, often exceeding hundreds of gigabytes, that are preprocessed to high quality. To develop datasets of this scale, a distributed framework that spans multiple organizations is necessary as it is challenging for a single organization to do so. KOGITUNE has been designed to facilitate the training of Large Language Models (LLMs) with distributed datasets. The main concept involves performing dataset preprocessing and tensorization on external machines independently, and then delivering it on-demand to the GPU side. This approach aims to achieve high GPU utilization rates during training. KOGITUNE also includes practical features, such as the ability to adjust the mixing ratios of multiple corpora. This paper presents the design and implementation of KOGITUNE and reports on the experience of developing LLMs, which range from 0.06B to 1.3B, using KOGITUNE.