行動条件付けVideoGPTの構築と検証

田畑 浩大; 蒲原 惇乃輔; 海野 良介; 佐藤 誠人; 渡部 泰樹; 久米 大雅; 根岸 優大; 岡田 領; 岩澤 有祐; 松尾 豊

doi:10.11517/pjsai.JSAI2023.0_1G4OS21a02

37th (2023)

Session ID : 1G4-OS-21a-02

DOI https://doi.org/10.11517/pjsai.JSAI2023.0_1G4OS21a02

Conference information

Host: The Japanese Society for Artificial Intelligence

Name : The 37th Annual Conference of the Japanese Society for Artificial Intelligence

Number : 37

Location : [in Japanese]

Date : June 06, 2023 - June 09, 2023

Construction and Validation of Action-Conditioned VideoGPT

*Koudai TABATA, Junnosuke KAMOHARA, Ryosuke UNNO, Makoto SATO, Taiju WATANABE, Taiga KUME, Masahiro NEGISHI, Ryo OKADA, Yusuke IWASAWA, Yutaka MATSUO

Author information

*Koudai TABATA
The University of Tokyo
Matsuo Institute
Junnosuke KAMOHARA
Tohoku University
Matsuo Institute
Ryosuke UNNO
The University of Tokyo
Matsuo Institute
Makoto SATO
Nara Institute of Science and Technology
Matsuo Institute
Taiju WATANABE
Waseda University
Matsuo Institute
Taiga KUME
Keio University
Matsuo Institute
Masahiro NEGISHI
The University of Tokyo
Matsuo Institute
Ryo OKADA
The University of Tokyo
Matsuo Institute
Yusuke IWASAWA
The University of Tokyo
Yutaka MATSUO
The University of Tokyo

Keywords: World Models, Conditioned video prediction

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

World models acquire external structure based on observations of the external world and can predict the future states of the external world as it changes with the action of the agent. Recent advances in generative models and language models have contributed to multi-modal world models, which are expected to be applied in various domains, including automated driving and robotics. Video prediction is the field that has made progress in terms of high fidelity and long term prediction, and world models have potential applications for acquiring temporal representations. One example of model architecture that has performed well is a combination of Encoder-Decoder based latent variable model for image reconstruction and auto-regressive model for prediction of latent sequence. In this work, we extend a video prediction model called VideoGPT, which uses VQVAE and Image-GPT by introducing action conditioning. Validation with CARLA and RoboNet showed improved performance compared to the model without conditioning.

Corresponding author

Conference information

Register with J-STAGE for free!