2020 Volume 27 Issue 2 Pages 257-279
This paper proposes a new problem for generating recipes from photo sequences and suggests a new method to more successfully achieve this, which aims to help users obtain multimedia recipes only by taking photographs. For this purpose, the output texts should include expressions with important terms that make sense as instructions. However, traditional methods proposed in “Visual storytelling” do not consider these expressions. To select expressions with important terms to describe a photo, the proposed method incorporates a retrieval method as well as a generation model. The proposed method was implemented and tested using Japanese cooking recipes. From various experimental results, it was confirmed that the new method outperforms standard baselines.