Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks

Koki Matsuishi; Kosuke Ukita; Tsuyoshi Okita

doi:10.60401/ijabc.124

抄録

In recent years, the widespread adoption of wearable devices has high-lighted the growing importance of behavior analysis using IMU. While applications span diverse fields such as healthcare and robotics, recent studies have increasingly focused on multimodal analysis, in addition to unimodal analysis. Several studies have proposed multimodal foundation models that incorporate first-person video and text data; however, these models still fall short in providing a detailed analysis of full-body human activity. To address this limitation, we propose Activity Understanding and Representations Alignment - Multimodal Foundation Model (AURA-MFM), a foundation model that integrates four modalities: third-person video, motion capture, IMU, and text. By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity, which first-person video alone fails to capture. Additionally, a Transformer-based IMU encoder is employed to enhance the model’s overall performance. Experimental evaluations on retrieval and activity recognition tasks demonstrate that our model surpasses existing methods. Notably, in the zero-shot classification for action recognition, our method achieved significantly higher performance, with an F1-score of 0.6226 and an accuracy of 0.7320, whereas the existing method recorded an F1-score of 0.0747 and an accuracy of 0.1961. The code is available at https://github.com/Okita-Laboratory/AURA-MFM.

著者関連情報

この記事はクリエイティブ・コモンズ [表示 4.0 国際]ライセンスの下に提供されています。
https://creativecommons.org/licenses/by/4.0/deed.ja

お気に入り & アラート

閲覧履歴

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）