2024 Volume 31 Issue 3 Pages 1076-1106
This study proposes a method for building language models for specific speakers. Currently, character-specific utterances are required for interactive agents and games, such as RPGs. However, the training data are limited to building a language model specialized for a specific character. Therefore, using T5, we transform the utterances of different characters in the same work as that of the target speaker into the speech style of the target speaker, that augments the training data. We fine-tuned GPT-2, the base language model, using domain adaptive pretraining (DAPT) + task adaptive pretraining (TAPT) methods. We regarded the utterances of the target speaker as the training data for TAPT and the utterances of the characters in the work as the training data for DAPT. We added character names at the beginning of the utterances to deal with the diversity of the data. Additionally, we manually transformed the utterances of the characters into general utterances that produced parallel data of character-specific and general utterances. We fine-tuned T5 using these parallel data and created two types of T5 models: (A) a transform model from general to character-specific speech style and (B) a transform model from character-specific to general speech style. We augmented the utterances of the target speaker in two ways using these models. (1) We transformed the manually rewritten general utterances into the character-specific style of the target speaker using Model (A), and (2) we transformed the utterances of different characters in the same work as that of the target speaker into the speech style of the target speaker using Models (A) and (B). The experiments showed that the average perplexity of language models for seven characters was 27.33 when GPT-2 was trained with only the utterances of the target speaker whereas 21.15 when the proposed method was used, showing an improvement in performance.