マルチモーダル推論のための動画アクションデータセットの構築

横関 茉衣; 村上 夏輝; 鈴木 莉子; 谷中 瞳; 峯島 宏次; 戸次 大介

doi:10.11517/pjsai.JSAI2021.0_4I1GS7b01

35th (2021)

Session ID : 4I1-GS-7b-01

DOI https://doi.org/10.11517/pjsai.JSAI2021.0_4I1GS7b01

Conference information

Host: The Japanese Society for Artificial Intelligence

Name : The 35th Annual Conference of the Japanese Society for Artificial Intelligence

Number : 35

Location : [in Japanese]

Date : June 08, 2021 - June 11, 2021

Building a Video-and-Language Dataset with Human Actions for Multimodal Inference

*Mai YOKOZEKI, Natsuki MURAKAMI, Riko SUZUKI, Hitomi YANAKA, Koji MINESHIMA, Daisuke BEKKI

Author information

Keywords: Multimodal Inference, Visual Textual Entailment, Video Dataset

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

This paper introduces a new video-and-language dataset with human actions for multimodal inference.The dataset consists of 200 videos, 5554 action labels, and 1942 action triplets of the form <subject, action, object>. Action labels contain various expressions such as aspectual and intentional phrases that are characteristic of videos but do not appear in existing video and image datasets. The dataset is expected to be applied to the evaluation of the multimodal inference system between the video and semantically complicated sentences such as negation and quantity.

Corresponding author

Conference information

Register with J-STAGE for free!