2026 年 14 巻 1 号 p. 110-118
We propose a novel text-controllable polyphonic symbolic music generation method based on diffusion models. Symbolic music generation has garnered significant attention due to its flexibility and seamless integration with Digital Audio Workstations (DAWs), as it enables the generation of MIDI files, facilitating easier modification compared to waveform music. Although existing techniques enable control through chords or other metadata, few methods allow intuitive control via text prompts, which better align with user preferences. To address this limitation, we introduce Text-Controllable Polyphonic Symbolic Music Generation (TPSMG), a diffusion model specifically designed for text-conditioned symbolic music generation. Our approach incorporates a text condition module into a U-Net backbone within a Denoising Diffusion Probabilistic Model. This module translates text prompts into embeddings that steer the denoising process, thereby enabling precise, text-based control over music generation. Experimental results demonstrate that our method generates high-quality polyphonic symbolic music outputs that closely reflect the intended textual input.