Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
35th (2021)
Session ID : 4I1-GS-7b-01
Conference information

Building a Video-and-Language Dataset with Human Actions for Multimodal Inference
*Mai YOKOZEKINatsuki MURAKAMIRiko SUZUKIHitomi YANAKAKoji MINESHIMADaisuke BEKKI
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

This paper introduces a new video-and-language dataset with human actions for multimodal inference.The dataset consists of 200 videos, 5554 action labels, and 1942 action triplets of the form <subject, action, object>. Action labels contain various expressions such as aspectual and intentional phrases that are characteristic of videos but do not appear in existing video and image datasets. The dataset is expected to be applied to the evaluation of the multimodal inference system between the video and semantically complicated sentences such as negation and quantity.

Content from these authors
© 2021 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top