2026 Volume E109.D Issue 5 Pages 695-706
Accurately controlling the output length of large language models (LLMs) remains a non-trivial challenge, with many existing approaches exhibiting limited reliability or incurring additional architectural and inference-time costs. Failure to adhere to user-specified length constraints in real-world applications, such as news summarization and dialog systems, significantly degrades system reliability. This paper addresses this gap by applying Group Relative Policy Optimization (GRPO)—a stable, value-function-free reinforcement learning algorithm—to efficiently fine-tune LLMs for prompt-based length control without any architectural modification. We systematically compare four reward functions: a simple binary threshold (BLTR), a linear deviation penalty (PLR), and two novel proximity-aware variants with linear (LLPR) and exponential (ELPR) decay, designed to incentivize not just constraint satisfaction but also proximity to the target length. Experiments on CNNDM (English) and XL-Sum (Japanese) datasets with 1-billion-parameter models show that our GRPO-based approach dramatically improves length adherence. On Llama-3.2-1B-Instruct, the saturating PLR reward achieved the highest binary adherence (BLTR: 0.705), but our proximity-aware ELPR achieved strong adherence (0.612) while dramatically improving target proximity (LLPR score: -24.994 to -2.293). Notably, on Gemma-3-1b-it, ELPR consistently outperformed PLR on all metrics. Our analysis suggests that ELPR offers a strong balance of stability and performance. The results indicate that continuous, proximity-aware rewards may be more effective than simple binary signals for achieving robust and practical length control, highlighting a promising direction for future reward design.