Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 3D1-GS-2-05
Conference information

SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces
*Yuta OSHIMAShohei TANIGUCHIMasahiro SUZUKIYutaka MATSUO
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Recent video diffusion models have utilized attention layers to extract temporal features. However, attention layers are limited by their memory consumption, which increases quadratically with sequence length. This limitation presents challenges when attempting to generate longer video sequences. To overcome this challenge, we propose leveraging state-space models (SSMs). SSMs have recently gained attention as viable alternatives due to their linear memory consumption relative to sequence length. In the experiments, we first evaluate our SSM-based model with UCF101. In this scenario, our approach outperforms attention-based models in terms of Fr'echet Video Distance (FVD). In addition, to investigate the potential of SSMs for longer video generation, we perform an experiment using the MineRL Navigate. In this setting, our SSM-based model can save memory consumption for longer sequences, while maintaining competitive FVD scores.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top