Journal of Robotics and Mechatronics
Online ISSN : 1883-8049
Print ISSN : 0915-3942
ISSN-L : 0915-3942
Volume 29, Issue 1
Displaying 1-26 of 26 articles from this issue
Review on Community-Centric System - Support of Human Ties -
  • Eri Sato-Shimokawara, Toru Yamaguchi
    Article type: Review
    2017 Volume 29 Issue 1 Pages 7-13
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    Motivated toward the revitalization of local communities, we have started an aggressive study on community support systems. In particular, we have studied a “community-centric system,” in which individual persons are tied to each other in a natural manner. In this paper, based on this community-centric concept, we introduce different robot operating interfaces, dialogue robots, and a telepresence system for generating human ties.

    Download PDF (2404K)
Special Issue on Robot Audition Technologies
  • Hiroshi G. Okuno, Kazuhiro Nakadai
    Article type: Editorial
    2017 Volume 29 Issue 1 Pages 15
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    Robot audition, the ability of a robot to listen to several things at once with its own “ears,” is crucial to the improvement of interactions and symbiosis between humans and robots. Since robot audition was originally proposed and has been pioneered by Japanese research groups, this special issue on robot audition technologies of the Journal of Robotics and Mechatronics covers a wide collection of advanced topics studied mainly in Japan. Specifically, two consecutive JSPS Grants-in-Aid for Scientific Research (S) on robot audition (PI: Hiroshi G. Okuno) from 2007 to 2017, JST Japan-France Research Cooperative Program on binaural listening for humanoids (PI: Hiroshi G. Okuno and Patrick Danès) from 2009 to 2013, and the ImPACT Tough Robotics Challenge (PM: Prof. Satoshi Tadokoro) on extreme audition for search and rescue robots since 2015 have contributed to the promotion of robot audition research, and most of the papers in this issue are the outcome of these projects. Robot audition was surveyed in the special issue on robot audition in the Journal of Robotic Society of Japan, Vol.28, No.1 (2011) and in our IEEE ICASSP-2015 paper. This issue covers the most recent topics in robot audition, except for human-robot interactions, which was covered by many papers appearing in Advanced Robotics as well as other journals and international conferences, including IEEE IROS.

    This issue consists of twenty-three papers accepted through peer reviews. They are classified into four categories: signal processing, music and pet robots, search and rescue robots, and monitoring animal acoustics in natural habitats.

    In signal processing for robot audition, Nakadai, Okuno, et al. report on HARK open source software for robot audition, Takeda, et al. develop noise-robust MUSIC-sound source localization (SSL), and Yalta, et al. use deep learning for SSL. Odo, et al. develop active SSL by moving artificial pinnae, and Youssef, et al. propose binaural SSL for an immobile or mobile talker. Suzuki, Otsuka, et al. evaluate the influence of six impulse-response-measuring signals on MUSIC-based SSL, Sekiguchi, et al. give an optimal allocation of distributed microphone arrays for sound source separation, and Tanabe, et al. develop 3D SSL by using a microphone array and LiDAR. Nakadai and Koiwa present audio-visual automatic speech recognition, and Nakadai, Tezuka, et al. suppress ego-noise, that is, noise generated by the robot itself.

    In music and pet robots, Ohkita, et al. propose audio-visual beat tracking for a robot to dance with a human dancer, and Tomo, et al. develop a robot that operates a wayang puppet, an Indonesian world cultural heritage, by recognizing emotion in Gamelan music. Suzuki, Takahashi, et al. develop a pet robot that approaches a sound source.

    In search and rescue robots, Hoshiba, et al. implement real-time SSL with a microphone array installed on a multicopter UAV, and Ishiki, et al. design a microphone array for multicopters. Ohata, et al. detect a sound source with a multicopter microphone array, and Sugiyama, et al. identify detected acoustic events through a combination of signal processing and deep learning. Bando, et al. enhance the human-voice online and offline for a hose-shaped rescue robot with a microphone array.

    In monitoring animal acoustics in natural habitats, Suzuki, Matsubayashi, et al. design and implement HARKBird, Matsubayashi, et al. report on the experience of monitoring birds with HARKBird, and Kojima, et al. use a spatial-cue-based probabilistic model to analyze the songs of birds singing in their natural habitat. Aihara, et al. analyze a chorus of frogs with dozens of sound-to-light conversion device Firefly, the design and analysis of which is reported on by Mizumoto, et al.

    The editors and authors hope that this special issue will promote the further evolution of robot audition technologies in a diversity of applications.

    Download PDF (130K)
  • Kazuhiro Nakadai, Hiroshi G. Okuno, Takeshi Mizumoto
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 16-25
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    Robot audition is a research field that focuses on developing technologies so that robots can hear sound through their own ears (microphones). By compiling robot audition studies performed over more than 10 years, open source software for research purposes called HARK (Honda Research Institute Japan Audition for Robots with Kyoto University) was released to the public in 2008. HARK is updated every year, and free tutorials are often held for its promotion. In this paper, the major functions of HARK – such as sound source localization, sound source separation, and automatic speech recognition – are explained. In order to promote HARK, HARK-Embedded for embedding purposes and HARK-SaaS used as Software as a Service (SaaS) have been actively studied and developed in recent years; these technologies are also described in the paper. In addition, applications of HARK are introduced as case studies.

    Download PDF (1868K)
  • Ryu Takeda, Kazunori Komatani
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 26-36
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    We focus on the problem of localizing soft/weak voices recorded by small humanoid robots, such as NAO. Sound source localization (SSL) for such robots requires fast processing and noise robustness owing to the restricted resources and the internal noise close to the microphones. Multiple signal classification using generalized eigenvalue decomposition (GEVD-MUSIC) is a promising method for SSL. It achieves noise robustness by whitening robot internal noise using prior noise information. However, whitening increases the computational cost and creates a direction-dependent bias in the localization score, which degrades the localization accuracy. We have thus developed a new implementation of GEVD-MUSIC based on steering vector transformation (TSV-MUSIC). The application of a transformation equivalent to whitening to steering vectors in advance reduces the real-time computational cost of TSV-MUSIC. Moreover, normalization of the transformed vectors cancels the direction-dependent bias and improves the localization accuracy. Experiments using simulated data showed that TSV-MUSIC had the highest accuracy of the methods tested. An experiment using real recoded data showed that TSV-MUSIC outperformed GEVD-MUSIC and other MUSIC methods in terms of localization by about 4 points under low signal-to-noise-ratio conditions.

    Download PDF (1674K)
  • Nelson Yalta, Kazuhiro Nakadai, Tetsuya Ogata
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 37-48
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    This study proposes the use of a deep neural network to localize a sound source using an array of microphones in a reverberant environment. During the last few years, applications based on deep neural networks have performed various tasks such as image classification or speech recognition to levels that exceed even human capabilities. In our study, we employ deep residual networks, which have recently shown remarkable performance in image classification tasks even when the training period is shorter than that of other models. Deep residual networks are used to process audio input similar to multiple signal classification (MUSIC) methods. We show that with end-to-end training and generic preprocessing, the performance of deep residual networks not only surpasses the block level accuracy of linear models on nearly clean environments but also shows robustness to challenging conditions by exploiting the time delay on power information.

    Download PDF (3387K)
  • Wataru Odo, Daisuke Kimoto, Makoto Kumon, Tomonari Furukawa
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 49-58
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    Animals use two ears to localize the source of a sound, and this paper considers a robot system that localizes a sound source by using two microphones with active external reflectors that mimic movable pinnae. The body of the robot and the environment both affect the propagation of sound waves, which complicates mapping the acoustic cues to the source. The mapping may be multimodal, and the observed acoustic cues may lead to the incorrect estimation of the locations. In order to achieve sound source localization with such multimodal likelihoods, this paper presents a method for determining a configuration of active pinnae, which uses prior knowledge to optimize their location and orientation, and thus attenuates the effects of pseudo-peaks in the observations. The observations are also adversely affected by noise in the sensor signals, and thus Bayesian inference approach to process them is further introduced. Results of experiments that validate the proposed method are also presented.

    Download PDF (1037K)
  • Karim Youssef, Katsutoshi Itoyama, Kazuyoshi Yoshii
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 59-71
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    This paper jointly addresses the tasks of speaker identification and localization with binaural signals. The proposed system operates in noisy and echoic environments and involves limited computations. It demonstrates that a simultaneous identification and localization operation can benefit from a common signal processing front end for feature extraction. Moreover, a joint exploitation of the identity and position estimation outputs allows the outputs to limit each other’s errors. Equivalent rectangular bandwidth frequency cepstral coefficients (ERBFCC) and interaural level differences (ILD) are extracted. These acoustic features are respectively used for speaker identity and azimuth estimation through artificial neural networks (ANNs). The system was evaluated in simulated and real environments, with still and mobile speakers. Results demonstrate its ability to produce accurate estimations in the presence of noises and reflections. Moreover, the advantage of the binaural context over the monaural context for speaker identification is shown.

    Download PDF (834K)
  • Takuya Suzuki, Hiroaki Otsuka, Wataru Akahori, Yoshiaki Bando, Hiroshi ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 72-82
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    Two major functions, sound source localization and sound source separation, provided by robot audition open source software HARK exploit the acoustic transfer functions of a microphone array to improve the performance. The acoustic transfer functions are calculated from the measured acoustic impulse response. In the measurement, special signals such as Time Stretched Pulse (TSP) are used to improve the signal-to-noise ratio of the measurement signals. Recent studies have identified the importance of selecting a measurement signal according to the applications. In this paper, we investigate how six measurement signals – up-TSP, down-TSP, M-Series, Log-SS, NW-SS, and MN-SS – influence the performance of the MUSIC-based sound source localization provided by HARK. Experiments with simulated sounds, up to three simultaneous sound sources, demonstrate no significant difference among the six measurement signals in the MUSIC-based sound source localization.

    Download PDF (3278K)
  • Kouhei Sekiguchi, Yoshiaki Bando, Katsutoshi Itoyama, Kazuyoshi Yoshii
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 83-93
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    The active audition method presented here improves source separation performance by moving multiple mobile robots to optimal positions. One advantage of using multiple mobile robots that each has a microphone array is that each robot can work independently or as part of a big reconfigurable array. To determine optimal layout of the robots, we must be able to predict source separation performance from source position information because actual source signals are unknown and actual separation performance cannot be calculated. Our method thus simulates delay-and-sum beamforming from a possible layout to calculate gain theoretically, i.e., the expected ratio of a target sound source to other sound sources in the corresponding separated signal. Robots are moved into the layout with the highest average gain over target sources. Experimental results showed that our method improved the harmonic mean of signal-to-distortion ratios (SDRs) by 5.5 dB in simulation and by 3.5 dB in a real environment.

    Download PDF (1331K)
  • Ryo Tanabe, Yoko Sasaki, Hiroshi Takemura
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 94-104
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    The study proposes a probabilistic 3D sound source mapping system for a moving sensor unit. A microphone array is used for sound source localization and tracking based on the multiple signal classification (MUSIC) algorithm and a multiple-target tracking algorithm. Laser imaging detection and ranging (LIDAR) is used to generate a 3D geometric map and estimate the location of its six-degrees-of-freedom (6 DoF) using the state-of-the-art gyro-integrated iterative closest point simultaneous localization and mapping (G-ICP SLAM) method. Combining these modules provides sound detection in 3D global space for a moving robot. The sound position is then estimated using Monte Carlo localization from the time series of a tracked sound stream. The results of experiments using the hand-held sensor unit indicate that the method is effective for arbitrary motions of the sensor unit in environments with multiple sound sources.

    Download PDF (8395K)
  • Kazuhiro Nakadai, Tomoaki Koiwa
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 105-113
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    Audio-visual speech recognition (AVSR) is a promising approach to improving the noise robustness of speech recognition in the real world. For AVSR, the auditory and visual units are the phoneme and viseme, respectively. However, these are often misclassified in the real world because of noisy input. To solve this problem, we propose two psychologically-inspired approaches. One is audio-visual integration based on missing feature theory (MFT) to cope with missing or unreliable audio and visual features for recognition. The other is phoneme and viseme grouping based on coarse-to-fine recognition. Preliminary experiments show that these two approaches are effective for audio-visual speech recognition. Integration based on MFT with an appropriate weight improves the recognition performance by −5 dB. This is the case even in a noisy environment, in which most speech recognition systems do not work properly. Phoneme and viseme grouping further improved the AVSR performance, particularly at a low signal-to-noise ratio.*

    * This work is an extension of our publication “Tomoaki Koiwa et al.: Coarse speech recognition by audio-visual integration based on missing feature theory, IROS 2007, pp. 1751-1756, 2007.”

    Download PDF (673K)
  • Kazuhiro Nakadai, Taiki Tezuka, Takami Yoshida
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 114-124
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    This paper addresses ego-motion noise suppression for a robot. Many ego-motion noise suppression methods use motion information such as position, velocity, and the acceleration of each joint to infer ego-motion noise. However, such inferences are not reliable, since motion information and ego-motion noise are not always correlated. We propose a new framework for ego-motion noise suppression based on single channel processing using only acoustic signals captured with a microphone. In the proposed framework, ego-motion noise features and their numbers are automatically estimated in advance from an ego-motion noise input using Infinite Non-negative Matrix Factorization (INMF), which is a non-parametric Bayesian model that does not use explicit motion information. After that, the proposed Semi-Blind INMF (SB-INMF) is applied to an input signal that consists of both the target and ego-motion noise signals. Ego-motion noise features, which are obtained with INMF, are used as inputs to the SB-INMF, and are treated as the fixed features for extracting the target signal. Finally, the target signal is extracted with SB-INMF using these newly-estimated features. The proposed framework was applied to ego-motion noise suppression on two types of humanoid robots. Experimental results showed that ego-motion noise was effectively and efficiently suppressed in terms of both signal-to-noise ratio and performance of automatic speech recognition compared to a conventional template-based ego-motion noise suppression method using motion information. Thus, the proposed method worked properly on a robot without a motion information interface.*

    * This work is an extension of our publication “Taiki Tezuka, Takami Yoshida, Kazuhiro Nakadai: Ego-motion noise suppression for robots based on Semi-Blind Infinite Non-negative Matrix Factorization,” ICRA 2014, pp. 6293-6298, 2014.”

    Download PDF (2647K)
  • Misato Ohkita, Yoshiaki Bando, Eita Nakamura, Katsutoshi Itoyama, Kazu ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 125-136
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    This paper presents a real-time beat-tracking method that integrates audio and visual information in a probabilistic manner to enable a humanoid robot to dance in synchronization with music and human dancers. Most conventional music robots have focused on either music audio signals or movements of human dancers to detect and predict beat times in real time. Since a robot needs to record music audio signals with its own microphones, however, the signals are severely contaminated with loud environmental noise. To solve this problem, we propose a state-space model that encodes a pair of a tempo and a beat time in a state-space and represents how acoustic and visual features are generated from a given state. The acoustic features consist of tempo likelihoods and onset likelihoods obtained from music audio signals and the visual features are tempo likelihoods obtained from dance movements. The current tempo and the next beat time are estimated in an online manner from a history of observed features by using a particle filter. Experimental results show that the proposed multi-modal method using a depth sensor (Kinect) to extract skeleton features outperformed conventional mono-modal methods in terms of beat-tracking accuracy in a noisy and reverberant environment.

    Download PDF (2086K)
  • Tito Pradhono Tomo, Alexander Schmitz, Guillermo Enriquez, Shuji Hashi ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 137-145
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    This paper proposes a way to protect endangered wayang puppet theater, an intangible cultural heritage from Indonesia, by turning a robot into a puppeteer successor. We developed a seven degrees-of-freedom (DOF) manipulator to actuate the sticks attached to the wayang puppet body and hands. The robot can imitate 8 distinct human puppeteer’s manipulations. Furthermore, we developed a gamelan music pattern recognition, towards a robot that can perform based on the gamelan music. In the offline experiment, we extracted energy (time domain), spectral rolloff, 13 Mel-frequency cepstral coefficients (MFCCs), and the harmonic ratio from 5 s long clips, every 0.025 s, with a window length of 1 s, for a total of 2576 features. Two classifiers (3 layers feed-forward neural network (FNN) and multi-class Support Vector Machine (SVM)) were compared. The SVM classifier outperformed the FNN classifier with a recognition rate of 96.4% for identifying the three different gamelan music patterns.

    Download PDF (2043K)
  • Ryo Suzuki, Takuto Takahashi, Hiroshi G. Okuno
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 146-153
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    We have developed a self-propelling robotic pet, in which the robot audition software HARK (Honda Research Institute Japan Audition for Robots with Kyoto University) was installed to equip it with sound source localization functions, thus enabling it to move in the direction of sound sources. The developed robot, which is not installed with cameras or speakers, can communicate with humans by using only its own movements and the surrounding audio information obtained using a microphone. We have confirmed through field experiments, during which participants could gain hands-on experience with our developed robot, that participants behaved or felt as if they were touching a real pet. We also found that its high-precision sound source localization could contribute to the promotion and facilitation of human-robot interactions.

    Download PDF (2404K)
  • Kotaro Hoshiba, Osamu Sugiyama, Akihide Nagamine, Ryosuke Kojima, Mako ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 154-167
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    We have studied on robot-audition-based sound source localization using a microphone array embedded on a UAV (unmanned aerial vehicle) to locate people who need assistance in a disaster-stricken area. A localization method with high robustness against noise and a small calculation cost have been proposed to solve a problem specific to the outdoor sound environment. In this paper, the proposed method is extended for practical use, a system based on the method is designed and implemented, and results of sound source localization conducted in the actual outdoor environment are shown. First, a 2.5-dimensional sound source localization method, which is a two-dimensional sound source localization plus distance estimation, is proposed. Then, the offline sound source localization system is structured using the proposed method, and the accuracy of the localization results is evaluated and discussed. As a result, the usability of the proposed extended method and newly developed three-dimensional visualization tool is confirmed, and a change in the detection accuracy for different types or distances of the sound source is found. Next, the sound source localization is conducted in real-time by extending the offline system to online to ensure that the detection performance of the offline system is kept in the online system. Moreover, the relationship between the parameters and detection accuracy is evaluated to localize only a target sound source. As a result, indices to determine an appropriate threshold are obtained and localization of a target sound source is realized at a designated accuracy.

    Download PDF (3070K)
  • Takahiro Ishiki, Kai Washizaki, Makoto Kumon
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 168-176
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    High expectations are placed on the use of unmanned aerial vehicles (UAVs) in such tasks as rescue operations, which require a system that makes use of visual or auditory information to recognize the surrounding environment. As an example of such a system, this study examines the recognition of the environment using a helicopter mounted with a microphone array. Because the rotors of a helicopter generate high noise during operation, it is necessary to reduce the effects of this noise and those from other sources to record the audio signals coming from the ground with onboard microphones. In particular, because of helicopter body control, the rotor speed changes continuously and causes an unsteady rotor noise, which implies that it would be effective to arrange the microphones at a sufficient distance from the rotors. When a large microphone array is employed, however, the array weight may alter the helicopter’s flight characteristics and increase the noise, presenting a dilemma. This paper presents a model of rotor noise that takes into account the effect of the microphone array on the helicopter’s dynamic characteristics and proposes a method of evaluating the optimality of the array configuration, which is necessary for design. The validity of the proposed method is investigated using a multirotor helicopter mounted with a microphone array previously developed by the authors. In addition, an application example for locating sound sources on the ground using this helicopter is presented.

    Download PDF (3375K)
  • Takuma Ohata, Keisuke Nakamura, Akihide Nagamine, Takeshi Mizumoto, Ta ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 177-187
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    This paper addresses sound source detection in an outdoor environment using a quadcopter with a microphone array. As the previously reported method has a high computational cost, we proposed a sound source detection algorithm called multiple signal classification based on incremental generalized singular value decomposition (iGSVD-MUSIC) that detects the sound source location and temporal activity at low computational cost. In addition, to relax the estimation error problem of a noise correlation matrix that is used in iGSVD-MUSIC, we proposed correlation matrix scaling (CMS) to achieve soft whitening of noise. As CMS requires a parameter to decide the degree of whitening, we analyzed the optimal value of the parameter by using numerical simulation. The prototype system based on the proposed methods was evaluated with two types of microphone arrays in an outdoor environment. The experimental results showed that the proposed iGSVD-MUSIC-CMS significantly improves sound source detection performance, and the prototype system achieves real-time processing. Moreover, we successfully clarified the behavior of the CMS parameter by using a numerical simulation in which the empirically-obtained optimal value corresponded with the analytical result.*

    * This work is an extension of our publication “Takuma Ohata et al.: Improvement in outdoor sound source detection using a quadrotor-embedded microphone array, IROS 2014, pp. 1902-1907, 2014.”

    Download PDF (1914K)
  • Osamu Sugiyama, Satoshi Uemura, Akihide Nagamine, Ryosuke Kojima, Keis ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 188-197
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    This paper addresses Acoustic Event Identification (AEI) of acoustic signals observed with a microphone array embedded in a quadrotor that is flying in a noisy outdoor environment. In such an environment, noise generated by rotors, wind, and other sound sources is a big problem. To solve this, we propose the use of a combination of two approaches that have recently been introduced: Sound Source Separation (SSS) and Sound Source Identification (SSI). SSS improves the Signal-to-Noise Ratio (SNR) of the input sound, and SSI is then performed on the SNR-improved sound. Two SSS methods are investigated. One is a single channel algorithm, Robust Principal Component Analysis (RPCA), and the other is Geometric High-order Decorrelation-based Source Separation (GHDSS-AS), known as a multichannel method. For SSI, we investigate two types of deep neural networks namely Stacked denoising Autoencoder (SdA) and Convolutional Neural Network (CNN), which have been extensively studied as highly-performant approaches in the fields of automatic speech recognition and visual object recognition. Preliminary experiments have showed the effectiveness of the proposed approaches, a combination of GHDSS-AS and CNN in particular. This combination correctly identified over 80% of sounds in an 8-class sound classification recorded by a hovering quadrotor. In addition, the CNN identifier that was implemented could be handled even with a low-end CPU by measuring the prediction time.

    Download PDF (1281K)
  • Yoshiaki Bando, Hiroshi Saruwatari, Nobutaka Ono, Shoji Makino, Katsut ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 198-212
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    This paper presents the design and implementation of a two-stage human-voice enhancement system for a hose-shaped rescue robot. When a microphone-equipped hose-shaped robot is used to search for a victim under a collapsed building, human-voice enhancement is crucial because the sound captured by a microphone array is contaminated by the ego-noise of the robot. For achieving both low latency and high quality, our system combines online and offline human-voice enhancement, providing an overview first and then details on demand. The online enhancement is used for searching for a victim in real time, while the offline one facilitates scrutiny by listening to highly enhanced human voices. Our online enhancement is based on an online robust principal component analysis, and our offline enhancement is based on an independent low-rank matrix analysis. The two enhancement methods are integrated with Robot Operating System (ROS). Experimental results showed that both the online and offline enhancement methods outperformed conventional methods.

    Download PDF (2866K)
  • Reiji Suzuki, Shiho Matsubayashi, Richard W. Hedley, Kazuhiro Nakadai, ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 213-223
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    Understanding auditory scenes is important when deploying intelligent robots and systems in real-world environments. We believe that robot audition can better recognize acoustic events in the field as compared to conventional methods such as human observation or recording using single-channel microphone array. We are particularly interested in acoustic interactions among songbirds. Birds do not always vocalize at random, for example, but may instead divide a soundscape so that they avoid overlapping their songs with those of other birds. To understand such complex interaction processes, we must collect much spatiotemporal data in which multiple individuals and species are singing simultaneously. However, it is costly and difficult to annotate many or long recorded tracks manually to detect their interactions. In order to solve this problem, we are developing HARKBird, an easily-available and portable system consisting of a laptop PC with open-source software for robot audition HARK (Honda Research Institute Japan Audition for Robots with Kyoto University) together with a low-cost and commercially available microphone array. HARKBird enables us to extract the songs of multiple individuals from recordings automatically. In this paper, we introduce the current status of our project and report preliminary results of recording experiments in two different types of forests – one in the USA and the other in Japan – using this system to automatically estimate the direction of arrival of the songs of multiple birds, and separate them from the recordings. We also discuss asymmetries among species in terms of their tendency to partition temporal resources.

    Download PDF (4582K)
  • Shiho Matsubayashi, Reiji Suzuki, Fumiyuki Saito, Tatsuyoshi Murate, T ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 224-235
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    This paper reports the results of our field test of HARKBird, a portable system that consists of robot audition, a laptop PC, and omnidirectional microphone arrays. We assessed its localization accuracy to monitor songs of the great reed warbler (Acrocephalus arundinaceus) in time and two-dimensional space by comparing locational and temporal data collected by human observers and HARKBird. Our analysis revealed that stationarity of the singing individual affected the spatial accuracy. Temporally, HARKBird successfully captured the exact song duration in seconds, which cannot be easily achieved by human observers. The data derived from HARKBird suggest that one of the warbler males dominated the sound space. Given the assumption that the cost of the singing activity is represented by song duration in relation to the total recording session, this particular male paid a higher cost of singing, possibly to win the territory of best quality. Overall, this study demonstrated the high potential of HARKBird as an effective alternative to the point count method to survey bird songs in the field.

    Download PDF (2211K)
  • Ryosuke Kojima, Osamu Sugiyama, Kotaro Hoshiba, Kazuhiro Nakadai, Reij ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 236-246
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    This paper addresses bird song scene analysis based on semi-automatic annotation. Research in animal behavior, especially in birds, would be aided by automated or semi-automated systems that can localize sounds, measure their timing, and identify their sources. This is difficult to achieve in real environments, in which several birds at different locations may be singing at the same time. Analysis of recordings from the wild has usually required manual annotation. These annotations may be inaccurate or inconsistent, as they may vary within and between observers. Here we suggest a system that uses automated methods from robot audition, including sound source detection, localization, separation and identification. In robot audition, these technologies are assessed separately, but combining them has often led to poor performance in natural setting. We propose a new Spatial-Cue-Based Probabilistic Model (SCBPM) for their integration focusing on spatial information. A second problem has been that supervised machine learning methods usually require a pre-trained model, which may need a large training set of annotated labels. We have employed a semi-automatic annotation approach, in which a semi-supervised training method is deduced for a new model. This method requires much less pre-annotation. Preliminary experiments with recordings of bird songs from the wild revealed that our system outperformed the identification accuracy of a method based on conventional robot audition.*

    * This paper is an extension of a proceeding of IROS2015.

    Download PDF (2125K)
  • Ikkyu Aihara, Ryu Takeda, Takeshi Mizumoto, Takuma Otsuka, Hiroshi G. ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 247-254
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    Sensing the external environment is a core function of robots and autonomous mechanics. This function is useful for monitoring and analyzing the ecosystem for our deeper understanding of the nature and accomplishing the sustainable ecosystem. Here, we investigate calling behavior of male frogs by applying audio-processing technique on multiple audio data. In general, male frogs call from their breeding site, and a female frog approaches one of the males by hearing their calls. First, we conducted an indoor experiment to record spontaneous calling behavior of three male Japanese tree frogs, and then separated their call signals according to independent component analysis. The analysis of separated signals shows that chorus size (i.e., the number of calling frogs) has a positive effect on call number, inter-call intervals, and chorus duration. We speculate that a competition in a large chorus encourages the male frogs to make their call properties more attractive to conspecific females.

    Download PDF (1355K)
  • Takeshi Mizumoto, Ikkyu Aihara, Takuma Otsuka, Hiromitsu Awano, Hirosh ...
    Article type: Paper
    2017 Volume 29 Issue 1 Pages 255-267
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    While many robots have been developed to monitor environments, most studies are dedicated to navigation and locomotion and use off-the-shelf sensors. We focus on a novel acoustic device and its processing software, which is designed for a swarm of environmental monitoring robots equipped with the device. This paper demonstrates that a swarm of monitoring devices is useful for biological field studies, i.e., understanding the spatio-temporal structure of acoustic communication among animals in their natural habitat. The following processes are required in monitoring acoustic communication to analyze the natural behavior in the field: (1) working in their habitat, (2) automatically detecting multiple and simultaneous calls, (3) minimizing the effect on the animals and their habitat, and (4) working with various distributions of animals. We present a sound-imaging system using sound-to-light conversion devices called “Fireflies” and their data analysis method that satisfies the requirements. We can easily collect data by placing a swarm (dozens) of Fireflies and record their light intensities using an off-the-shelf video camera. Because each Firefly converts sound in its vicinity into light, we can easily obtain when, how long, and where animals call using temporal analysis of the Firefly light intensities. The device is evaluated in terms of three aspects: volume to light-intensitycharacteristics, battery life through indoor experiments, and water resistance via field experiments. We also present the visualization of a chorus of Japanese tree frogs (Hyla japonica) recorded in their habitat, that is, paddy fields.

    Download PDF (1539K)
Regular Papers
  • Atsushi Mitani, Yuhei Suzuki, Yuta Tochigi
    Article type: Development Report
    2017 Volume 29 Issue 1 Pages 269-272
    Published: February 20, 2017
    Released on J-STAGE: November 20, 2018
    JOURNAL OPEN ACCESS

    The mobile robot Riden uses the trident motif for in the robot-triathlon contest annually held in Hokkaido, Japan. The robot-triathlon contest involves three tasks: line tracing, a wandering forest, and cone stacking. Robots must complete these tasks as fast as possible with autonomous control. This means that function design usually takes priority over aesthetic appeal. We are the only college of design education team taking part. Our teams have done so by developing robots that take both function and aesthetic appeal into account. Based on 3D modeling technology, design education students use their design and modeling skills to design robots that are both aesthetic and functional. Riden was designed using SolidWorks 3D-CAD software and its parts modeled using a 3D printer.

    Download PDF (3096K)
feedback
Top