Target sound information extraction: Speech and audio processing with neural networks conditioned on target clues

Tsubasa Ochiai; Marc Delcroix; Takafumi Moriya; Takanori Ashihara; Hiroshi Sato; Naohiro Tawara; Tomohiro Nakatani; Shoko Araki

doi:10.1250/ast.e24.124

INVITED REVIEW

Target sound information extraction: Speech and audio processing with neural networks conditioned on target clues

Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

Author information

Keywords: Target speech extraction, Personalized voice activity detection, Target speaker automatic speech recognition, Speech processing, Audio processing

JOURNAL OPEN ACCESS

2025 Volume 46 Issue 3 Pages 197-209

DOI https://doi.org/10.1250/ast.e24.124

Browse “Advance online publication” version

Details

Abstract

This paper overviews neural target sound information extraction (TSIE), which consists of extracting the desired information about a sound source in an observed sound mixture given clues about the target source. TSIE is a general framework, which covers various applications, such as target speech/sound extraction (TSE), personalized voice activity detection (PVAD), target speaker automatic speech recognition (TS-ASR), etc. We formalize the ideas of TSIE and show how it can be implemented through various examples such as TSE, PVAD, and TS-ASR. We conclude the paper with a discussion of potential future research directions.

Corresponding author

Register with J-STAGE for free!