Autoregressive ConvNeXt-transformer fusion framework for polyphonic optical music recognition with focal loss optimization

Jun Wu; Wanshan Guo

doi:10.1250/ast.e25.60

Abstract

Optical music recognition technology has significantly enhanced the efficiency and accuracy of computational score transcription through deep learning methodologies. Although current techniques demonstrate strong performance in processing monophonic and single-voice scores, they struggle to achieve comparable accuracy when handling complex scores containing harmonic intervals, chords, polyphony, or multivoice compositions. In this paper, we propose ConvNeXt-Transformer Fusion (CNTF), an autoregressive end-to-end neural network framework employing an image-to-sequence architecture specifically optimized for automated transcription of intricate musical scores. The model integrates a ConvNeXt-based encoder for sheet music feature extraction and a Transformer-based decoder that generates transcription sequences through autoregressive prediction. To address class imbalance during training, we implement Focal Loss optimization. Experimental results demonstrate that the CNTF model achieves state-of-the-art performance in polyphony-rich score recognition, exhibiting superior character, symbol, and line error rates to existing baseline systems.

Content from these authors

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nd/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!