Journal of Advanced Computational Intelligence and Intelligent Informatics
Online ISSN : 1883-8014
Print ISSN : 1343-0130
ISSN-L : 1883-8014
Regular Papers
Interactive Image Caption Generation Reflecting User Intent from Trace Using a Diffusion Language Model
Satoko HiranoIchiro Kobayashi
Author information
JOURNAL OPEN ACCESS

2025 Volume 29 Issue 6 Pages 1417-1426

Details
Abstract

This study proposes an image captioning method designed to incorporate user-specific explanatory intentions into the generated text, as signaled by the user’s trace on the image. We extract areas of interest from dense sections of the trace, determine the order of explanations by tracking changes in the pen-tip coordinates, and assess the degree of interest in each area by analyzing the time spent on them. Additionally, a diffusion language model is utilized to generate sentences in a non-autoregressive manner, allowing control over sentence length based on the temporal data of the trace. In the actual caption generation task, the proposed method achieved higher string similarity than conventional methods, including autoregressive models, and successfully captured user intent from the trace and faithfully reflected it in the generated text.

Content from these authors

This article cannot obtain the latest cited-by information.

© 2025 Fuji Technology Press Ltd.

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license (https://creativecommons.org/licenses/by-nd/4.0/).
The journal is fully Open Access under Creative Commons licenses and all articles are free to access at JACIII official website.
https://www.fujipress.jp/jaciii/jc-about/#https://creativecommons.org/licenses/by-nd
Previous article Next article
feedback
Top