Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
PAPERS
Autoregressive ConvNeXt-transformer fusion framework for polyphonic optical music recognition with focal loss optimization
Jun WuWanshan Guo
著者情報
ジャーナル オープンアクセス

2026 年 47 巻 2 号 p. 86-96

詳細
抄録

Optical music recognition technology has significantly enhanced the efficiency and accuracy of computational score transcription through deep learning methodologies. Although current techniques demonstrate strong performance in processing monophonic and single-voice scores, they struggle to achieve comparable accuracy when handling complex scores containing harmonic intervals, chords, polyphony, or multivoice compositions. In this paper, we propose ConvNeXt-Transformer Fusion (CNTF), an autoregressive end-to-end neural network framework employing an image-to-sequence architecture specifically optimized for automated transcription of intricate musical scores. The model integrates a ConvNeXt-based encoder for sheet music feature extraction and a Transformer-based decoder that generates transcription sequences through autoregressive prediction. To address class imbalance during training, we implement Focal Loss optimization. Experimental results demonstrate that the CNTF model achieves state-of-the-art performance in polyphony-rich score recognition, exhibiting superior character, symbol, and line error rates to existing baseline systems.

著者関連情報
© 2026 by The Acoustical Society of Japan

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nd/4.0/
前の記事 次の記事
feedback
Top