2023 Volume 30 Issue 3 Pages 1042-1060
We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). We explain the data collection and annotation procedure and evaluate the dataset by measuring the inter-annotator agreement. Finally, we investigate the importance of each annotation component by conducting multi-modal information retrieval experiments.