Abstract
A method has been proposed to convert noisy speech data into images and remove the noise using U-Net, one of the fully convolutional networks. The authors have already conducted experiments to remove various types of noise in addition to human speech using this method. Good results were obtained in all experiments. In this study, we assumed that a specific person's voice is emphasized during the meeting to record his/her voice. Alternatively, we thought that the voice of the emergency announcement speaker or the voice of the evacuation guide is emphasized to convert into the text to convey it to the hearingimpaired person. The authors prepared multiple datasets for training and created a speech enhancement model for a specific speaker's speech from multiple (up to 6) speakers. Then, it is confirmed that the enhancing speech of a specific person in mixed voice data can be possible by regenerating the voice.