Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
PAPERS
Autoregressive ConvNeXt-transformer fusion framework for polyphonic optical music recognition with focal loss optimization
Jun WuWanshan Guo
Author information
JOURNAL OPEN ACCESS

2026 Volume 47 Issue 2 Pages 86-96

Details
Abstract

Optical music recognition technology has significantly enhanced the efficiency and accuracy of computational score transcription through deep learning methodologies. Although current techniques demonstrate strong performance in processing monophonic and single-voice scores, they struggle to achieve comparable accuracy when handling complex scores containing harmonic intervals, chords, polyphony, or multivoice compositions. In this paper, we propose ConvNeXt-Transformer Fusion (CNTF), an autoregressive end-to-end neural network framework employing an image-to-sequence architecture specifically optimized for automated transcription of intricate musical scores. The model integrates a ConvNeXt-based encoder for sheet music feature extraction and a Transformer-based decoder that generates transcription sequences through autoregressive prediction. To address class imbalance during training, we implement Focal Loss optimization. Experimental results demonstrate that the CNTF model achieves state-of-the-art performance in polyphony-rich score recognition, exhibiting superior character, symbol, and line error rates to existing baseline systems.

Content from these authors
© 2026 by The Acoustical Society of Japan

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nd/4.0/
Previous article Next article
feedback
Top