Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
Recent video diffusion models have utilized attention layers to extract temporal features. However, attention layers are limited by their memory consumption, which increases quadratically with sequence length. This limitation presents challenges when attempting to generate longer video sequences. To overcome this challenge, we propose leveraging state-space models (SSMs). SSMs have recently gained attention as viable alternatives due to their linear memory consumption relative to sequence length. In the experiments, we first evaluate our SSM-based model with UCF101. In this scenario, our approach outperforms attention-based models in terms of Fr'echet Video Distance (FVD). In addition, to investigate the potential of SSMs for longer video generation, we perform an experiment using the MineRL Navigate. In this setting, our SSM-based model can save memory consumption for longer sequences, while maintaining competitive FVD scores.