Target sound information extraction: Speech and audio processing with neural networks conditioned on target clues

Tsubasa Ochiai; Marc Delcroix; Takafumi Moriya; Takanori Ashihara; Hiroshi Sato; Naohiro Tawara; Tomohiro Nakatani; Shoko Araki

doi:10.1250/ast.e24.124

この記事には本公開記事があります。本公開記事を参照してください。
引用する場合も本公開記事を引用してください。

Target sound information extraction: Speech and audio processing with neural networks conditioned on target clues

Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

著者情報

キーワード: Target speech extraction, personalized voice activity detection, target speaker automatic speech recognition, speech processing, audio processing

ジャーナルオープンアクセス早期公開

論文ID: e24.124

DOI https://doi.org/10.1250/ast.e24.124

この記事には本公開記事があります。

The final version of this article is now available: Vol. 46 (2025), No. 3 pp. 197-209

詳細

抄録

This paper overviews neural target sound information extraction (TSIE), which consists of extracting the desired information about a sound source in an observed sound mixture given clues about the target source. TSIE is a general framework, which covers various applications, such as target speech/sound extraction (TSE), personalized voice activity detection (PVAD), target speaker automatic speech recognition (TS-ASR), etc. We formalize the ideas of TSIE and show how it can be implemented through various examples such as TSE, PVAD, and TS-ASR. We conclude the paper with a discussion of potential future research directions.

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）