International Journal of Activity and Behavior Computing
Online ISSN : 2759-2871
Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks
Koki MatsuishiKosuke UkitaTsuyoshi Okita
著者情報
ジャーナル オープンアクセス

2025 年 2025 巻 3 号 p. 1-25

詳細
抄録
In recent years, the widespread adoption of wearable devices has high-lighted the growing importance of behavior analysis using IMU. While applications span diverse fields such as healthcare and robotics, recent studies have increasingly focused on multimodal analysis, in addition to unimodal analysis. Several studies have proposed multimodal foundation models that incorporate first-person video and text data; however, these models still fall short in providing a detailed analysis of full-body human activity. To address this limitation, we propose Activity Understanding and Representations Alignment - Multimodal Foundation Model (AURA-MFM), a foundation model that integrates four modalities: third-person video, motion capture, IMU, and text. By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity, which first-person video alone fails to capture. Additionally, a Transformer-based IMU encoder is employed to enhance the model’s overall performance. Experimental evaluations on retrieval and activity recognition tasks demonstrate that our model surpasses existing methods. Notably, in the zero-shot classification for action recognition, our method achieved significantly higher performance, with an F1-score of 0.6226 and an accuracy of 0.7320, whereas the existing method recorded an F1-score of 0.0747 and an accuracy of 0.1961. The code is available at https://github.com/Okita-Laboratory/AURA-MFM.
著者関連情報
© 2025 Author

この記事はクリエイティブ・コモンズ [表示 4.0 国際]ライセンスの下に提供されています。
https://creativecommons.org/licenses/by/4.0/deed.ja
前の記事 次の記事
feedback
Top