Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232

この記事には本公開記事があります。本公開記事を参照してください。
引用する場合も本公開記事を引用してください。

Autoregressive ConvNeXt-Transformer Fusion Framework for Polyphonic Optical Music Recognition with Focal Loss Optimization
Jun WuWanshan Guo
著者情報
ジャーナル オープンアクセス 早期公開

論文ID: e25.60

この記事には本公開記事があります。
詳細
抄録

Optical music recognition technology has significantly enhanced the efficiency and accuracy of computational score transcription through deep learning methodologies. Although current techniques demonstrate strong performance in processing monophonic and single-voice scores, they struggle to achieve comparable accuracy when handling complex scores containing harmonic intervals, chords, polyphony, or multivoice compositions. In this paper, we propose ConvNeXt-Transformer Fusion (CNTF), an autoregressive end-to-end neural network framework employing an image-to-sequence architecture specifically optimized for automated transcription of intricate musical scores. The model integrates a ConvNeXt-based encoder for sheet music feature extraction and a Transformer-based decoder that generates transcription sequences through autoregressive prediction. To address class imbalance during training, we implement Focal Loss optimization. Experimental results demonstrate that the CNTF model achieves state-of-the-art performance in polyphony-rich score recognition, exhibiting superior character, symbol, and line error rates to existing baseline systems.

著者関連情報
© 2025 by The Acoustical Society of Japan

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nd/4.0/
feedback
Top