精密工学会誌
Online ISSN : 1882-675X
Print ISSN : 0912-0289
ISSN-L : 0912-0289
論文
動画内の音と映像によるイベント推定タスクにおける時間方向クロスモーダルアテンションの導入
長崎 好輝林 昌希金子 直史青木 義満
著者情報
ジャーナル フリー

2022 年 88 巻 3 号 p. 263-268

詳細
抄録

In this paper, we propose a new method for audio-visual event localization 1) to find the corresponding segment between audio and visual event. While previous methods use Long Short-Term Memory (LSTM) networks to extract temporal features, recurrent neural networks like LSTM are not able to precisely learn long-term features. Thus, we propose a Temporal Cross-Modal Attention (TCMA) module, which extract temporal features more precisely from the two modalities. Inspired by the success of attention modules in capturing long-term features, we introduce TCMA, which incorporates self-attention. Finally, we were able to localize audio-visual event precisely and achieved a higher accuracy than the previous works.

著者関連情報
© 2022 公益社団法人 精密工学会
前の記事 次の記事
feedback
Top