KOGITUNE: 大規模言語モデル向けの分散データセット学習基盤

相馬 菜生; 小原 百々雅; 倉光 君郎; 片桐 孝洋; 横手 靖彦; 石川 裕

doi:10.11517/pjsai.JSAI2024.0_2O1GS302

Abstract

The performance of large language models relies on massive datasets, often exceeding hundreds of gigabytes, that are preprocessed to high quality. To develop datasets of this scale, a distributed framework that spans multiple organizations is necessary as it is challenging for a single organization to do so. KOGITUNE has been designed to facilitate the training of Large Language Models (LLMs) with distributed datasets. The main concept involves performing dataset preprocessing and tensorization on external machines independently, and then delivering it on-demand to the GPU side. This approach aims to achieve high GPU utilization rates during training. KOGITUNE also includes practical features, such as the ability to adjust the mixing ratios of multiple corpora. This paper presents the design and implementation of KOGITUNE and reports on the experience of developing LLMs, which range from 0.06B to 1.3B, using KOGITUNE.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!