2020 Volume 41 Issue 1 Pages 160-165
The advent of deep learning has led to a great progress in solving many problems that had been considered challenging. Several recent studies have shown promising results in directly changing the styles between two different domains that share the same latent content, for example, from paintings to photographs and from simulated roads to real roads. One of the key ideas that lie in this series of domain translation approaches is the concept of generative adversarial networks (GANs). Motivated by this concept of changing a certain style of data into another style using GANs, we apply this technique to two challenging and yet very important applications in the music signal processing field: music source separation and automatic music transcription. Both tasks can be interpreted as a style transition between two different spectrogram domains that share the same content; i.e., from a mixture spectrogram to a specific source spectrogram in the case of source separation, and from an audio spectrogram to a piano roll representation in the case of music transcription. Through experiments using real-world audio, we demonstrate that one general deep learning framework, namely ``spectrogram to spectrogram'' or ``Spec2Spec,'' can successfully be applied to tackle these problems.