Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 2O1-GS-3-02
Conference information

KOGITUNE: Distributed Dataset Framework for Training Large Language Models
*Nao SOUMAMomoka OBARAKimio KURAMITSUTakahiro KATAGIRIYashuhiko YOKOTEYutaka ISHIKAWA
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

The performance of large language models relies on massive datasets, often exceeding hundreds of gigabytes, that are preprocessed to high quality. To develop datasets of this scale, a distributed framework that spans multiple organizations is necessary as it is challenging for a single organization to do so. KOGITUNE has been designed to facilitate the training of Large Language Models (LLMs) with distributed datasets. The main concept involves performing dataset preprocessing and tensorization on external machines independently, and then delivering it on-demand to the GPU side. This approach aims to achieve high GPU utilization rates during training. KOGITUNE also includes practical features, such as the ability to adjust the mixing ratios of multiple corpora. This paper presents the design and implementation of KOGITUNE and reports on the experience of developing LLMs, which range from 0.06B to 1.3B, using KOGITUNE.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top