Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 4Xin2-52
Conference information

Unedited Fixed-viewpoint Procedural Videos with Language Resources for Understanding Cooking Activities
*Atsushi HASHIMOTOKoki MAEDATosho HIRASAWAJun HARASHIMALeszek RYBICKIYusuke FUKASAWAYoshitaka USHIKU
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Large-scale web videos have contributed significantly to recent progress in video analysis techniques. At the same time, the domain gap between web and unedited videos still limits vision-language applications to those between text and edited videos. Ego vision datasets are actively collected to overcome such problems; as another format of unedited videos, this paper provides a dataset with fixed-viewpoint unedited videos (FV videos). We can effortlessly obtain FV videos with commercial smartphones. We collected 145 videos, a total of 40 hours of footage, in which participants prepare foods based on given recipes. We manually add action graphs that tie videos and procedural texts while identifying the workflow of the process. In addition, we propose two benchmark tasks on this dataset: online recipe retrieval (OnRR) and dense video captioning on FV videos (DVC-FV). Experimental results demonstrated that recent SoTA methods can not solve OnRR and DVC-FV trivially.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top