調理作業理解のための言語資源付き固定視点映像データセットの構築

橋本 敦史; 前田 航希; 平澤 寅庄; 原島 純; RYBICKI Leszek; 深澤 祐援; 牛久 祥孝

doi:10.11517/pjsai.JSAI2024.0_4Xin252

38th (2024)

Session ID : 4Xin2-52

DOI https://doi.org/10.11517/pjsai.JSAI2024.0_4Xin252

Conference information

Host: The Japanese Society for Artificial Intelligence

Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence

Number : 38

Location : [in Japanese]

Date : May 28, 2024 - May 31, 2024

Unedited Fixed-viewpoint Procedural Videos with Language Resources for Understanding Cooking Activities

*Atsushi HASHIMOTO, Koki MAEDA, Tosho HIRASAWA, Jun HARASHIMA, Leszek RYBICKI, Yusuke FUKASAWA, Yoshitaka USHIKU

Author information

Keywords: Vision and Language, Procedural Text Understanding, Video Analysis

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

Large-scale web videos have contributed significantly to recent progress in video analysis techniques. At the same time, the domain gap between web and unedited videos still limits vision-language applications to those between text and edited videos. Ego vision datasets are actively collected to overcome such problems; as another format of unedited videos, this paper provides a dataset with fixed-viewpoint unedited videos (FV videos). We can effortlessly obtain FV videos with commercial smartphones. We collected 145 videos, a total of 40 hours of footage, in which participants prepare foods based on given recipes. We manually add action graphs that tie videos and procedural texts while identifying the workflow of the process. In addition, we propose two benchmark tasks on this dataset: online recipe retrieval (OnRR) and dense video captioning on FV videos (DVC-FV). Experimental results demonstrated that recent SoTA methods can not solve OnRR and DVC-FV trivially.

Corresponding author

Conference information

Register with J-STAGE for free!