2026 Volume 47 Issue 2 Pages 86-96
Optical music recognition technology has significantly enhanced the efficiency and accuracy of computational score transcription through deep learning methodologies. Although current techniques demonstrate strong performance in processing monophonic and single-voice scores, they struggle to achieve comparable accuracy when handling complex scores containing harmonic intervals, chords, polyphony, or multivoice compositions. In this paper, we propose ConvNeXt-Transformer Fusion (CNTF), an autoregressive end-to-end neural network framework employing an image-to-sequence architecture specifically optimized for automated transcription of intricate musical scores. The model integrates a ConvNeXt-based encoder for sheet music feature extraction and a Transformer-based decoder that generates transcription sequences through autoregressive prediction. To address class imbalance during training, we implement Focal Loss optimization. Experimental results demonstrate that the CNTF model achieves state-of-the-art performance in polyphony-rich score recognition, exhibiting superior character, symbol, and line error rates to existing baseline systems.