動画内の音と映像によるイベント推定タスクにおける時間方向クロスモーダルアテンションの導入

長崎 好輝; 林 昌希; 金子 直史; 青木 義満

doi:10.2493/jjspe.88.263

抄録

In this paper, we propose a new method for audio-visual event localization ¹⁾ to find the corresponding segment between audio and visual event. While previous methods use Long Short-Term Memory (LSTM) networks to extract temporal features, recurrent neural networks like LSTM are not able to precisely learn long-term features. Thus, we propose a Temporal Cross-Modal Attention (TCMA) module, which extract temporal features more precisely from the two modalities. Inspired by the success of attention modules in capturing long-term features, we introduce TCMA, which incorporates self-attention. Finally, we were able to localize audio-visual event precisely and achieved a higher accuracy than the previous works.

著者関連情報

お気に入り & アラート

お気に入りに追加
追加情報アラート
被引用アラート
認証解除アラート

閲覧履歴

急性肝炎の臨床的研究
[title in Japanese]
微生物産生ポリエステルの構造, 物性および生分解性
目次
びまん性特発性骨増殖症に伴う腰椎椎体骨折に対してvertebral body stentingと上下１椎体の後方固定を行った１例

前身誌

精密機械

精密工学会誌論文集

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）