状態空間モデルを用いたメモリ効率的な動画生成拡散モデル

大島 佑太; 谷口 尚平; 鈴木 雅大; 松尾 豊

doi:10.11517/pjsai.JSAI2024.0_3D1GS205

Abstract

Recent video diffusion models have utilized attention layers to extract temporal features. However, attention layers are limited by their memory consumption, which increases quadratically with sequence length. This limitation presents challenges when attempting to generate longer video sequences. To overcome this challenge, we propose leveraging state-space models (SSMs). SSMs have recently gained attention as viable alternatives due to their linear memory consumption relative to sequence length. In the experiments, we first evaluate our SSM-based model with UCF101. In this scenario, our approach outperforms attention-based models in terms of Fr'echet Video Distance (FVD). In addition, to investigate the potential of SSMs for longer video generation, we perform an experiment using the MineRL Navigate. In this setting, our SSM-based model can save memory consumption for longer sequences, while maintaining competitive FVD scores.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!