2022 年 40 巻 4 号 p. 351-354
This paper addresses reconstruction of visual scenes based on echolocation, aiming to develop auditory scene understanding for robots and systems. Although scene understanding technology with a camera and a LIDAR has been studied well, it is prone to changes in lighting conditions and has difficulty in detecting invisible materials. Ultrasonic sensors are widely used, but their use is limited to distance estimation. There is an unavoidable risk of ultrasonic exposure since most ultrasonic power exists in inaudible frequency ranges. To solve these problems, we propose a framework for echolocation-based scene reconstruction (ELSR). ELSR can reconstruct a visual scene using the transmitted/received audible sound, and it exploits a Generative Adversarial Network (GAN) to learn translation from input sound to a visual scene. As GAN is originally designed for image input, we carefully considered the difference between image and sound input and propose introducing cross-correlation and trigonometric function-based features to input audio features. The proposed framework is implemented based on pix2pix, a kind of conditional GAN, and a dataset for ELSR consisting of 10,800 pairs of input sound and depth images recorded at 28 indoor locations was newly created. Experimental results using the dataset showed the effectiveness of the proposed framework ELSR and audio features.