2020 年 59 巻 6 号 p. 591-600
For communication between humans and intelligent agents such as robots, it is an important issue for agents to tell humans what they see. In this article, we introduce the results of our research on the generation of sentences that not only refer to objects correctly but also let humans find them quickly. If the target is not salient, finding the target itself becomes difficult. Therefore, we designed the model to utilize the salient contexts around it (e.g. “beside a car”) to help humans to find the targets. Moreover, we optimized the generation of sentences that are easily understood by using the time required to locate the referred objects by humans and their accuracies. To evaluate our system, we created a new dataset using images from Grand Theft Auto V (GTA V). Experimental results showed that our system generated sentences that are easily comprehended by humans, especially for less salient targets.