This paper addresses reconstruction of visual scenes based on echolocation, aiming to develop auditory
scene
understanding for robots
and
systems. Although
scene
understanding technology with a camera
and
a LIDAR has been studied well, it is prone to changes in lighting conditions
and
has difficulty in detecting invisible materials. Ultrasonic sensors are widely used, but their use is limited to distance estimation. There is an unavoidable risk of ultrasonic exposure since most ultrasonic power exists in inaudible frequency ranges. To solve these problems, we propose a framework for echolocation-based
scene
reconstruction (ELSR). ELSR can reconstruct a visual
scene
using the transmitted/received audible sound,
and
it exploits a Generative Adversarial Network (GAN) to learn translation from input sound to a visual
scene
. As GAN is originally designed for image input, we carefully considered the difference between image
and
sound input
and
propose introducing cross-correlation
and
trigonometric function-based features to input audio features. The proposed framework is implemented based on pix2pix, a kind of conditional GAN,
and
a dataset for ELSR consisting of 10,800 pairs of input sound
and
depth images recorded at 28 indoor locations was newly created. Experimental results using the dataset showed the effectiveness of the proposed framework ELSR
and
audio features.
抄録全体を表示