2025 年 40 巻 4 号 p. AG25-C_1-9
In recent years, inpainting of human face images has attracted a great deal of attention among various inpainting tasks that aim to complete missing or obscured regions in digital images. Compared to inpainting of abstract objects such as landscapes, buildings, or patterns, inpainting of human face images presents unique and considerable challenges. This is due to the fact that human faces have an complex visual structure, where even subtle differences in facial features or proportions can cause a sense of discomfort or seem unnatural. As such, faithfully completing missing regions in face images while maintaining a convincing and realistic appearance is an extremely difficult problem. Although some inpainting techniques have been developed to fill in missing facial parts based on analyzing and extrapolating from surrounding available facial information, most existing methods struggle to reproduce facially coherent results. To address this, we propose leveraging supplementary voice data, which contains cues strongly correlated to an individual’s facial structure and expressions, to guide and enhance face image inpainting. Specifically, our proposed method uses voice segments as additional conditioning inputs when generating missing facial regions, with the aim of improving fidelity and perceptual realism of completed faces. To rigorously evaluate this voiceaugmented face inpainting approach, we constructed a large test dataset consisting of around 20,000 pseudo-masked face images paired with corresponding preprocessed voice samples of each individual. In comparative experiments, our method attained significantly higher face completion quality versus a baseline model without any voice inputs. Additionally, we conducted challenging real-world verification tests using actual masked face images and raw voice data as inputs. Although performance remained insufficient for reliably handling real occluded faces, these experiments confirmed that voice conditioning clearly improves results on artificial test data, demonstrating its viability as a supplementary signal to guide generative face inpainting systems.