Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
Volume 41, Issue 1
The Commemoration of Universal Acoustical Communication Month 2018 (UAC2018)
Displaying 1-50 of 95 articles from this issue
FOREWORD
INVITED TUTORIALS
  • Chia-huei Tseng, Ya-Ting Wang, Satoshi Shioiri
    2020 Volume 41 Issue 1 Pages 2-5
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    ``Ma'' is a Japanese word that contains very rich meanings. It is used commonly to refer space, time, and things in between by Japanese. The mutual understanding and agreement of such concept by group individuals is a key to sustain social harmonics. In the past, this concept is primarily discussed in literature/humanity fields, and little in scientific and engineering communities. In this presentation, I will try to offer a few examples (e.g. music appreciation of silence, Japanese comic story-telling, Rakugo) to demonstrate that it is possible to use an interdisciplinary approach to investigate the concept of ``Ma'' scientifically. Furthermore, this may provide a starting point for designers and engineers to device into the interpersonal communication on other abstract concepts.

    Download PDF (542K)
  • Charles Spence
    2020 Volume 41 Issue 1 Pages 6-12
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    The last few years have seen an explosion of interest from researchers in the crossmodal correspondences, defined as the surprising connections that the majority of people share between seemingly-unrelated stimuli presented in different sensory modalities. Intriguingly, many of the crossmodal correspondences that have been documented/studied to date have involved audition as one of the corresponding modalities. In fact, auditory pitch may well be the single most commonly studied dimension in correspondences research thus far. That said, relatively separate literatures have focused on the crossmodal correspondences involving simple versus more complex auditory stimuli. In this review, I summarize the evidence in this area and consider the relative explanatory power of the various different accounts (statistical, structural, semantic, and emotional) that have been put forward to explain the correspondences. The suggestion is made that the relative contributions of the different accounts likely differs in the case of correspondences involving simple versus more complex stimuli (i.e., pure tones vs. short musical excerpts). Furthermore, the consequences of presenting corresponding versus non-corresponding stimuli likely also differ in the two cases. In particular, while crossmodal correspondences may facilitate binding (i.e., multisensory integration) in the case of simple stimuli, the combination of more complex stimuli (such as, for example, musical excerpts and paintings) may instead be processed more fluently when the component stimuli correspond. Finally, attention is drawn to the fact that the existence of a crossmodal correspondence does not in-and-of-itself necessarily imply that a crossmodal influence of one modality on the perception of stimuli in the other will also be observed.

    Download PDF (87K)
  • Katharine Molloy, Nilli Lavie, Maria Chait
    2020 Volume 41 Issue 1 Pages 13-15
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    The extent to which auditory processing depends on attention has been a key question in auditory cognitive neuroscience, crucial for establishing how the acoustic environment is represented in the brain when attention is directed away from sound. Here I review emerging behavioural and brain imaging results which demonstrate that, contrary to the traditional view of a computationally encapsulated system, the auditory system shares computational resources with the visual system: high demand on visual processing (e.g. as a consequence of a task with high perceptual load) can undercut auditory processing such that both the neural response to, and perceptual awareness of, non-attended sounds are impaired. These results are discussed in terms of our understanding of the architecture of the auditory modality and its role as the brain's early warning system.

    Download PDF (49K)
  • Craig T. Jin
    2020 Volume 41 Issue 1 Pages 16-27
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    There is renewed interest in virtual auditory perception and spatial audio arising from a technological drive toward enhanced perception via mixed-reality systems. Because the various technologies for three-dimensional (3D) sound are so numerous, this tutorial focuses on underlying principles. We consider the rendering of virtual auditory space via both loudspeakers and binaural headphones. We also consider the recording of sound fields and the simulation of virtual auditory space. Special attention is given to areas with the potential for further research and development. We highlight some of the more recent technologies and provide references so that participants can explore issues in more detail.

    Download PDF (820K)
INVITED REVIEWS
  • Charles Spence
    2020 Volume 41 Issue 1 Pages 28-36
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    A growing number of food and beverage brands have recently started to become interested in trying to link extraordinary emotional experiences to their product offerings. Oftentimes, such extraordinary responses are triggered by product-extrinsic auditory and, to a lesser extent, visual stimuli, such as music or videos having particular sensory qualities or semantic meaning. While much of the interest in this area recently has been linked to the Autonomous Sensory Meridian Response (ASMR), it is worth noting that there are also a number of other responses, such as chills, thrills, and so-called `skin orgasms' that have been documented previously, if not always in a food-related context. Elsewhere, both multisensory dining experiences and experiential events have also been reported to bring people to tears. There are, in other words, a number of extraordinary emotional responses that can or, in some cases, already have been linked to the consumption of food and drink. While such responses to auditory stimuli (increasingly mediated by technology) in the context of food are by no means widespread, they nevertheless hold the potential of delivering dramatic food and beverage experiences that offer the promise of being more stimulating, more memorable, and more emotionally-engaging than anything that has gone before.

    Download PDF (203K)
  • Kaoru Sekiyama
    2020 Volume 41 Issue 1 Pages 37-38
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Speech perception is often audiovisual, as demonstrated in the McGurk effect: Auditory and visual speech cues are integrated even when they are incongruent. Although this illusion suggests a universal process of audiovisual integration, the process has been shown to be modulated by language backgrounds. This paper reviews studies investigating inter-language differences in audiovisual speech perception. In these examinations with behavioral and neural data, it is shown that native speakers of English use visual speech cues more than those of Japanese, with different neural underpinnings for the two language groups.

    Download PDF (43K)
  • Hirokazu Takahashi
    2020 Volume 41 Issue 1 Pages 39-47
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Place codes of frequency, or tonotopic maps, are commonly found in the auditory pathway, from the cochlea to the auditory cortex, and thus, are believed to play substantial roles in auditory computation. In contrast, in the auditory cortex, tonotopic activation is clearly observed at the onset responses within 50-ms post-stimulus latency but rapidly decays to long-lasting suboptimal stimuli, suggesting that neural representation is made beyond the tonotopic map. We recently demonstrated in the rat auditory cortex that the degree of response variance is closely correlated with the size of its representational area, suggesting that place coding is an effective strategy to generate diverse response properties within a neural population. We also demonstrated long-lasting sound-induced steady-state local synchrony within the auditory cortex, where neural representation might be made in a different manner from transient tonotopic activation at stimulus onset. These results support the idea of Darwinian computation, where the tonotopic map effectively creates a response variance, while the steady-state synchrony gradually selects the neural population beyond the tonotopic map.

    Download PDF (528K)
  • Maria Chait
    2020 Volume 41 Issue 1 Pages 48-53
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Sensitivity to patterns is fundamental to sensory processing, in particular in the auditory system, and a major component of the influential `predictive coding' theory of brain function. Supported by growing experimental evidence, the `predictive coding' framework suggests that perception is driven by a mechanism of inference, based on an internal model of the signal source. However, a key element of this theory –- the process through which the brain acquires this model, and its neural underpinnings –- remains poorly understood. Here I review recent brain imaging and behavioural work which focuses on this missing link. Together these emerging results paint a picture of the brain as a regularity seeker, rapidly extracting and maintaining representations of acoustic structure on multiple time scales and even when these are not relevant to behaviour.

    Download PDF (534K)
  • Andrej Kral, Mika Sato
    2020 Volume 41 Issue 1 Pages 54-58
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    The present manuscript reviews the contribution of cochlear implants to the understanding of the impact of congenital deafness on brain development. The results show that many characteristics of the afferent auditory system, particularly the anatomical features, are genetically determined (are in origin ``nature''), and that experience is used to maintain and improve them to allow discriminating auditory stimuli. Experience (``nurture'') is additionally required to group the auditory features at the level of the auditory cortex into abstract ``auditory objects.'' This requires interaction between bottom-up and top-down streams of information processing, since features define objects and context (i.e. active objects) defines which features may carry relevant information in the given condition. The interaction of feature- and object-level-representation is allowed by columnar microcircuits. The integration of bottom-up and top-down streams of information also controls adult learning. Since congenital deafness interferes with the relevant microcircuitry of the cortical column, congenital deafness, if persisting beyond certain age, also leads to failure of key high-level auditory processes including the switch between juvenile and adult learning and therefore closes the sensitive periods for its therapy.

    Download PDF (65K)
  • M. Charles Liberman
    2020 Volume 41 Issue 1 Pages 59-62
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    In acquired sensorineural hearing loss, the hearing impairment arises mainly from damage to cochlear hair cells or the sensory fibers of the auditory nerve that innervate them. Hair cell loss or damage is well captured by the changes in the threshold audiogram, but the degree of neural damage is not. We have recently shown, in animal models of noise-damage and aging, and in autopsy specimens from aging humans, that the synapses connecting inner hair cells and auditory nerve fibers are the first to degenerate. This primary neural degeneration, or cochlear synaptopathy, leaves many surviving inner hair cells permanently disconnected from their sensory innervation, and many spiral ganglion cells surviving with only their central projections to the brainstem intact. This pathology represents a kind of ``hidden hearing loss.'' This review summarizes current speculations as to the functional consequences of this primary neural degeneration and the prospects for a therapeutic rescue based on local delivery of neuroptrophins to elicit neurite extension and synaptogenesis in the adult ear.

    Download PDF (176K)
  • Shigeto Furukawa, Hiroki Terashima, Takuya Koumura, Hiroaki Tsukano
    2020 Volume 41 Issue 1 Pages 63-66
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Typical neurophysiological experiments employ ``hypothesis-driven'' approaches: Researchers set a specific hypothesis, based on which stimuli and their parameters are chosen. However, there is always a concern that the hypothesis or stimulus parameter could be irrelevant to the essence of the brain function. The present paper review the authors' recent studies that have applied some ``data-driven'' approaches as relatively hypothesis-free methodologies to traditional questions in auditory neurophysiology, such as neural frequency tuning and cortical topography. The results provide some new insights into the functional organization of the cortex and the optimality of the brain structure for auditory processing.

    Download PDF (404K)
  • Aleksander P. Sęk, Brian C. J. Moore
    2020 Volume 41 Issue 1 Pages 67-74
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Current research in the field of psychoacoustics is mostly conducted using a computer to generate and present the stimuli and to collect the responses of the subject. However, writing the computer software to do this is time-consuming and requires technical expertise that is not possessed by many would-be researchers. We have developed a software package that makes it possible to set up and conduct a wide variety of experiments in psychoacoustics without the need for time-consuming programming or technical expertise. The only requirements are a personal computer (PC) with a good-quality sound card and a set of headphones. Parameters defining the stimuli and procedure are entered via boxes on the screen and drop-down menus. Possible experiments include measurement of the absolute threshold, simultaneous and forward masking (including notched-noise masking), comodulation masking release, intensity and frequency discrimination, amplitude-modulation detection and discrimination, gap detection, discrimination of interaural time and level differences, measurement of sensitivity to temporal fine structure, and measurement of the binaural masking level difference. The software is intended to be useful both for researchers and for students who want to try psychoacoustic experiments for themselves, which can be very valuable in helping them gain a deeper understanding of auditory perception.

    Download PDF (263K)
  • Brian C. J. Moore
    2020 Volume 41 Issue 1 Pages 75-82
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    The quality of an audio device depends on how accurately the device transmits the properties of the sound source to the ear(s) of the listener. Two types of ``distortion'' can occur: (1) ``Linear'' distortion, namely a deviation of the frequency response from the ``target'' response; (2) Nonlinear distortion, which is characterised by frequency components in the output of the device that were not present in the input. These two forms of distortion have different perceptual effects. Their effects on sound quality can be predicted using a model of auditory processing with the following stages: filters simulating the transmission of sound from the device to the ear of the listener and through the middle ear; and an array of bandpass filters simulating the filters that exist in the cochlea. For predicting the perceptual effects of linear distortion, a model operating in the frequency domain can be used. For predicting the perceptual effects of nonlinear distortion, a model operating in the time domain is required, since the detailed waveforms at the outputs of the auditory filters need to be considered. The models described give accurate predictions for a wide range of ``artificial'' and ``real'' linear and nonlinear distortions.

    Download PDF (935K)
  • Hedwig E. Gockel
    2020 Volume 41 Issue 1 Pages 83-89
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    In recent years, there has been increased interest in the scalp-recorded frequency following response (FFR), which is an electrical signal that reflects sustained phase locking to sound of large populations of neurons mainly in the upper brainstem in response to stimulus-related periodicities. It provides a non-invasive measure of neural processing in humans, which can be compared to behavioural responses concerning the listener's perception. It has been argued that the FFR reflects processes important for the perception of pitch and that changes in the FFR with experience and/or training provide a measure of neural plasticity at the level of the brainstem. This paper reviews recent work aimed at elucidating the origin and the specifics of the information present in the FFR. It is argued that the neural responses measured by the FFR preserve temporal information important for pitch to a certain degree, but do not necessarily represent pitch-related processing over and above that present in the auditory periphery. In addition, multiple generators may affect the overall measure to various degrees, depending on the repetition rate of the stimulus.

    Download PDF (568K)
  • Takayuki Arai
    2020 Volume 41 Issue 1 Pages 90-93
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Several types of physical models of the human vocal tract have been developed previously by our group. Even though they were originally designed for education purposes in acoustics and speech science, some of the models can also be applied to research, pronunciation training and clinical purposes. For example, a model for the English /r/ was originally designed to teach how the sound is produced, but we have also found it to be affective when applied to practicing English vowels for non-native speakers. Another model for lateral approximant was originally designed to teach how lateral sounds are produced. The model was then tested to measure differences in sounds radiated from the center and lateral directions with the possibility of evaluating misarticulation in a clinical situation. A recent model with a movable lower lip and rotating tongue to imitate the retroflex gesture was used to simulate the English /br/ cluster, a particularly difficult speech sound for Japanese native speakers. By using this model, users can observe each articulators' movement visibly with individually adjustable parameters to produce different speech sounds. Thus, the vocal-tract models potentially contribute to the field of speech communication.

    Download PDF (522K)
  • Birger Kollmeier, Constantin Spille, Angel Mario Castro Martínez, Step ...
    2020 Volume 41 Issue 1 Pages 94-98
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    The advantage and limitations of utilizing automatic speech recognition (ASR) techniques for modelling human speech recognition are investigated for a set of ``critical'' speech maskers for which many standard models of human speech recognition fail. A deep neural net (DNN)-based ASR system utilizing a closed-set sentence recognition test is used to model the speech recognition threshold (SRT) of normal-hearing listeners for a variety of noise types. The benchmark data from Schubotz et al. (2016) include SRTs measured in conditions with an increasing complexity in terms of spectro-temporal modulation (from stationary speech-shaped noise to a single interfering talker). The DNN-based model as proposed in Spille et al. (2018) produces a higher prediction accuracy than baseline models (i.e., SII, ESII, STOI, and mr-sESPM) even though it does not require a clean speech reference signal (as is the case for most auditory model-based SRT predictions). The most accurate predictions are obtained with multi-condition training with known noise types and ASR features that explicitly account for temporal modulations in noisy sentences. Another advantage of the approach is that the DNN can serve as valuable analysis tool to uncover signal recognition strategies: For instance, by identifying the most relevant cues for correct classification in modulated noise, it is shown that the DNN is listening in the dips. Finally, we present preliminary data indicating that the WER of the model can be replaced with an estimate of the WER, which does not require the transcript of utterances during test time and therefore eliminates an important limitation of the previous model that prevents it from being used in real-world scenarios.

    Download PDF (312K)
  • Toshio Irino, Roy D. Patterson
    2020 Volume 41 Issue 1 Pages 99-107
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    We review the gammachirp (GC) auditory filter and its use in speech perception research. The GC was originally developed to explain the asymmetric, auditory filter shapes derived in notched-noise (NN) masking studies, and the strongly compressive input-output function observed in the mammalian cochlea. This compressive GC was fitted to a very large collection of notched-noise (NN) masking thresholds measured with a wide range of stimulus levels and center frequencies. The fit showed how the GC auditory filter could explain NN masking throughout the domain of human hearing with a relatively small number of parameters, only one of which was level dependent. Subsequently, a dynamic, compressive GC filterbank (dcGC-FB) was developed to simulate time-domain cochlear processing. This dcGC-FB has been used to cancel the peripheral compression of normal hearing and thereby simulate the most common forms of hearing loss. This simulator allows normal hearing listeners to experience the difficulties of hearing impaired listeners. It has been used in training courses for speech-language-hearing therapists and psychoacoustic experiments. The dcGC-FB has also been used for modeling speaker size perception and predicting speech intelligibility with GEDI (the gammachirp envelope distortion index).

    Download PDF (896K)
  • Andrew J. Oxenham
    2020 Volume 41 Issue 1 Pages 108-112
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    We are generally able to identify sounds and understand speech with ease, despite the large variations in the acoustics of each sound, which occur due to factors such as different talkers, background noise, and room acoustics. This form of perceptual constancy is likely to be mediated in part by the auditory system's ability to adapt to the ongoing environment or context in which sounds are presented. Auditory context effects have been studied under different names, such as spectral contrast effects in speech and auditory enhancement effects in psychoacoustics, but they share some important properties and may be mediated by similar underlying neural mechanisms. This review provides a survey of recent studies from our laboratory that investigate the mechanisms of speech spectral contrast effects and auditory enhancement in people with normal hearing, hearing loss, and cochlear implants. We argue that a better understanding of such context effects in people with normal hearing may allow us to restore some of these important effects for people with hearing loss via signal processing in hearing aids and cochlear implants, thereby potentially improving auditory and speech perception in the complex and variable everyday acoustic backgrounds that surround us.

    Download PDF (140K)
  • William A. Yost, M. Torben Pastore, Michael F. Dorman
    2020 Volume 41 Issue 1 Pages 113-120
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    A review of data published or presented by the authors from two populations of subjects (normal hearing listeners and patients fit with cochlear implants, CIs) involving research on sound source localization when listeners move is provided. The overall theme of the review is that sound source localization requires an integration of auditory-spatial and head-position cues and is, therefore, a multisystem process. Research with normal hearing listeners includes that related to the Wallach Azimuth Illusion, and additional aspects of sound source localization perception when listeners and sound sources rotate. Research with CI patients involves investigations of sound source localization performance by patients fit with a single CI, bilateral CIs, a CI and a hearing aid (bimodal patients), and single-sided deaf patients with one normal functioning ear and the other ear fit with a CI. Past research involving CI patients who were stationary and more recent data based on CI patients' use of head rotation to localize sound sources is summarized.

    Download PDF (603K)
  • Hayato Sato, Hiroshi Sato, Masayuki Morimoto
    2020 Volume 41 Issue 1 Pages 121-128
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    In Japan, auditory guide signals are installed in public spaces mainly for the purpose of guiding visually impaired pedestrians. The acoustic signal is emitted from a loudspeaker installed at a destination such as a ticket gate, a stairway, and a restroom. Then, the pedestrians move in accordance with spatial information obtained from the signal. As the auditory guide signal is targeted at pedestrians, not only static sound localization cues but also dynamic sound localization cues are effective. In addition, unlike other applications such as sound field reproduction, precise sound image localization is not necessarily required and it is important to grasp a rough sound source position in this application. Furthermore, the degradation of sound localization accuracy owing to background noise and reverberation sound cannot be ignored in public spaces. Considering the above factors, the research on the optimization of auditory guide signals based on sound localization tests with human listeners so far carried out by the authors and their colleagues will be introduced in this review.

    Download PDF (1317K)
  • Sungyoung Kim
    2020 Volume 41 Issue 1 Pages 129-133
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    This paper reviews previous experimental studies on the relationship between a listener's cultural framework and auditory perception of an enclosed space. Cultural influence on auditory perception of noise and music has been assessed through a range of studies. Is it same for spatial hearing? When we enter to a space, would a particular cultural framework influence on understanding of the corresponding auditory environment? As physical buildings and enclosures reflect architectural and visual heritage, the auditory environment of an enclosed space also represents a unique and distinct heritage where people have interacted with and shaped their culture. When two listener groups (East-Asian and North-American) compared a reproduced field, previous findings show that (1) the semantic value of a same descriptor was distinctly different for two groups, and (2) there was an inverse relationship between the area of a personal space and size of a desired (preferred) auditory environment. With the advance of virtual reality (VR) technology, listeners can enter any auditory environment ubiquitously. Therefore, researchers and developers in the field should consider multiple user groups and the role of cultural framework in virtual environments.

    Download PDF (1350K)
  • Dingding Yao, Huaxing Xu, Junfeng Li, Risheng Xia, Yonghong Yan
    2020 Volume 41 Issue 1 Pages 134-141
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    To provide listeners with an immersive listening experience, binaural rendering technology has become an important research topic, especially with the rising prominence of virtual/augmented reality in recent years. In this paper, we introduce our recent works on binaural rendering technology over headphones and loudspeakers. The first work is on crosstalk cancellation (CTC), which is critical for loudspeaker-based binaural rendering. An improved free-field CTC method is first presented, in which the head effects are formulated using an attenuation factor and a phase difference factor. A stochastic robust approximation method is then suggested to further improve its robustness against the perturbation perception caused by the listener's head movement or rotation. The second work is on elevation perception of sound images, which is critical for headphone- and loudspeaker-based binaural rendering methods. By analyzing the elevation-dependent head-related transfer function (HRTF), a parametric elevation control approach is presented in which the key perceptual cues (i.e., spectral peaks and notches) are modeled using digital filters and controlled according to learned rules. The effectiveness and performance of the suggested algorithms are verified by subjective and objective experiments.

    Download PDF (946K)
  • Makoto Otani, Haruki Shigetani, Masataka Mitsuishi, Ryo Matsuda
    2020 Volume 41 Issue 1 Pages 142-150
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    To better understand acoustic environment and the resulting auditory perception, it is essential to capture, analyze, and reproduce a sound field as a three-dimensional physical phenomenon because spatial aspects of auditory perception play important roles in various situations in our lives. Some approaches have been proposed to achieve the three-dimensional capture and reproduction of acoustic fields. Among them, Higher-Order Ambisonics (HOA) based on spherical harmonics expansion enables the capture and reproduction of a directivity pattern of incoming sound waves. On the basis of HOA, three-dimensional auditory space can be presented to a listener typically via a spherical loudspeaker array. In addition, binaural synthesis emulating the loudspeaker presentation enables HOA reproduction with a set of headphones or several loudspeakers by employing crosstalk cancellation. Thus, we are developing an HOA-based binaural reproduction/auralization system with head tracking. This system is aimed at realizing the reproduction and auralization of a sound field, including one excited by the listener's own voice. In this paper, we review the topics related to the reproduction and auralization of the sound field and introduce the HOA-based binaural synthesis system we have developed, as well as our works on sweet-spot expansion in HOA decoding and self-voice reproduction/auralization.

    Download PDF (707K)
  • Akira Omoto, Hiroshi Kashiwazaki
    2020 Volume 41 Issue 1 Pages 151-159
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    The three dimensional sound field reproduction systems can be categorized mainly into two types, physical reproduction, and artistic reproduction. The former is sometimes referred to as scientific or engineering, and the latter is sometimes recognized as psychological reproduction using phantom images produced by, for example, amplitude panning and the other effects. The purpose of the reproduction system is widely spread. The system can be a design tool of enclosed space, such as a concert hall, before practical construction by reproducing physical characteristics accurately. Also, the system can be a pure entertainment tool, mostly with visual images. Of course, the scale and necessary conditions vary with their purpose and objectives; however, it might be interesting to investigate what are the essential factors for the higher total performance of reproduction systems. We currently hypothesize that the following four conditions might be necessary for the total performance of the versatile sound field reproduction system. A) physical accuracy, B) robustness against disturbance, C) flexibility for additional direction, D) capability of integration with visual stimuli. As a platform of examination, 24-channel narrow directional microphone array and 24-channel loudspeaker array are used. The boundary surface control principle and its modified version are adopted for the physical background. As examples, several practical efforts are attempted to assure the total performance of the system effectively.

    Download PDF (1193K)
INVITED PAPERS
  • Hyeong-Seok Choi, Juheon Lee, Kyogu Lee
    2020 Volume 41 Issue 1 Pages 160-165
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    The advent of deep learning has led to a great progress in solving many problems that had been considered challenging. Several recent studies have shown promising results in directly changing the styles between two different domains that share the same latent content, for example, from paintings to photographs and from simulated roads to real roads. One of the key ideas that lie in this series of domain translation approaches is the concept of generative adversarial networks (GANs). Motivated by this concept of changing a certain style of data into another style using GANs, we apply this technique to two challenging and yet very important applications in the music signal processing field: music source separation and automatic music transcription. Both tasks can be interpreted as a style transition between two different spectrogram domains that share the same content; i.e., from a mixture spectrogram to a specific source spectrogram in the case of source separation, and from an audio spectrogram to a piano roll representation in the case of music transcription. Through experiments using real-world audio, we demonstrate that one general deep learning framework, namely ``spectrogram to spectrogram'' or ``Spec2Spec,'' can successfully be applied to tackle these problems.

    Download PDF (354K)
  • Akinori Ito
    2020 Volume 41 Issue 1 Pages 166-169
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    This article briefly reviews the research works related to metacommunication. Metacommunication is a term meaning ``communication on communication,'' which is related to marginal communication such as conveying recognition, comprehension, and evaluation of an interlocutor's words. Herein, several research works are reviewed from the metacommunication point of view.

    Download PDF (59K)
  • Sakriani Sakti, Andros Tjandra, Satoshi Nakamura
    2020 Volume 41 Issue 1 Pages 170-172
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    In this paper, we introduce our recent machine speech chain frameworks based on deep learning that learned, not only to listen or speak but also listen while speaking. To the best of our knowledge, this is the first deep learning model that integrates human speech perception and production behaviors. Our experimental results show that the proposed approach significantly improved the performance more than separate systems that were only trained with labeled data.

    Download PDF (257K)
  • M. Ercan Altinsoy
    2020 Volume 41 Issue 1 Pages 173-181
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    The first aim of this study is to investigate the relationship between the signal properties and the perceptual attributes of everyday push button sounds and the second aim is to investigate the effect of loudness on the perceived tactile feedback intensity from buttons. This knowledge is useful for product designers and sound engineers to find an optimum button sound and haptic feedback for a defined application. In the first step of this study, the physics and signal properties of button sounds are discussed, and an investigation was conducted to determine the users' common language to describe the perceptual properties of everyday button sounds. The results of this investigation showed that the fundamental perceptual factors of button sounds are pleasantness, confirmation, alerting, irritating, and quality. In the next step, a listening experiment was conducted to investigate the relationship between signal properties, such as frequency and damping, and the perceptual factors above. The second part of this study is concerned with auditory-tactile interaction. An experiment was conducted to understand the effect of button sound on the perceived tactile feedback. The results of the experiment clearly show that in bimodal judgments both haptic and auditory information contribute to the perceived tactile strength.

    Download PDF (1074K)
  • Nobuyuki Sakai
    2020 Volume 41 Issue 1 Pages 182-188
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    This article addresses the psycho-social attitude toward food perception using three prior studies to show that cognitive processes play an important role in food perception. Many existing studies about food perception are based on animal behavior experiments, leading to human cognitive processes in food perception being overlooked. The three studies in this article emphasizes the importance of the human cognitive processes in human flavor perception.

    Download PDF (886K)
  • Waka Fujisaki
    2020 Volume 41 Issue 1 Pages 189-195
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Shitsukan is a Japanese word that means ``a sense of quality.'' Shitsukan includes visual qualities such as ``glossiness'' and ``translucency''; acoustic qualities such as ``brightness'' ``sharpness'' and ``pitch''; tactile qualities such as ``roughness'' and ``hardness''; aspects of materials themselves such as ``glass,'' ``cloth,'' ``wood,'' ``stone,'' ``metal,'' and ``pearl''; and affective properties such as ``prettiness,'' ``fragility,'' ``expensiveness,'' ``preference,'' ``naturalness,'' and ``genuineness.'' Thus, a wide range of concepts has been examined with respect to the Shitsukan perception. It is also important to note that Shitsukan perception is not merely the processing of information input through various sensory modalities; it also results from multimodal, adaptive, and active processes including prediction, decision-making, body motor control, and sensory-motor feedback. In this review, I would like to introduce the following three studies that my collaborators and I have recently conducted. 1) Auditory modulation of material properties of food by pseudo-mastication feedback sound generated from electromyogram signal; 2) Perception of the material properties of wood based on vision, audition, and touch; and 3) The rules of audiovisual integration in human perception of materials.

    Download PDF (557K)
  • Hidehiko Okamoto
    2020 Volume 41 Issue 1 Pages 196-200
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Non-invasive neuroimaging techniques have revealed that not only child brains but also adult brains can be reorganized in human. The cortical reorganization is usually beneficial such as an increased auditory cortical representation in professional musicians. On the other hand, maladaptive cortical reorganization in the auditory cortex can lead to hearing disorders such as tinnitus and hyperacusis. We tried to non-invasively visualize the pathological neural activity in the human auditory cortex and to reverse maladaptive cortical reorganization by suitable behavioral training to decrease detrimental auditory symptoms. Here, we report our previous studies that measured the neural activity in the auditory cortex of hearing impaired people using magnetoencephalography. The results obtained indicated that hearing impairments were related to the reorganization of the auditory neural pathway and that sound therapy was an effective approach for sudden sensorineural hearing loss. Visualization of the healthy and pathological brain activity by non-invasive neuroimaging techniques can lead to the development of a new clinical approach for those affected.

    Download PDF (470K)
  • Masanori Higuchi, Yuko Suzuka
    2020 Volume 41 Issue 1 Pages 201-203
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    As a novel method for analyzing the auditory-evoked brain response, we used the coherence function between a sound envelope and brain signal, such as a signal acquired on a magnetoencephalogram (MEG) or an electroencephalogram (EEG), in a selective listening study. In this study, we examined mixed sounds with various mixing ratios and investigated how the coherence value changes according to the ease of hearing. We used two types of mixed sounds: speech mix and music mix. We investigated three types of mixing ratios: easy, normal, and difficult with respect to hearing the target sound. MEG recording was obtained as each mixed sound was heard. We calculated the coherence function between the sound envelope and the MEG data. This is a function of the frequency, and a higher coherence value indicates that both signals have higher correlation at that frequency. We found that the coherence value tends to increase according to the easiness of hearing the target sound. This property might be useful for developing a new-type of hearing aid for selective listening.

    Download PDF (594K)
  • Yi-Wen Liu
    2020 Volume 41 Issue 1 Pages 204-208
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Otoacoustic emission (OAE) refers to acoustic waves that originate from the cochlea. Since its discovery, various ways have been developed to elicit OAEs; those elicited by short clicks are called transient-evoked (TE) OAEs, and the cubic distortion elicited by two tones are called distortion-product (DP) OAEs. In addition, spontaneous OAEs can be found from some ears without applying any external stimulus. Shera and Guinan proposed a taxonomy of OAEs that consists of three kinds: the linearly reflected emissions, the spontaneous emissions, and the distortion emissions. This article aims to introduce an additional 4th kind of OAEs to the taxonomy. We have shown theoretically that, when a high frequency, large-amplitude suppressor tone is present, it may set up a temporary and reversible impedance mismatch for the traveling waves that pass through its characteristic place. Because of this mismatch, the waves get partially reflected and going backward toward the stapes. The derivation of this ``nonlinear reflection'' mechanism is based on de Boer's quasi-linear, equivalent system framework, and may help explain the controversial tone-burst evoked OAEs experiments obtained in recent years.

    Download PDF (280K)
  • Qinglin Meng, Xianren Wang, Nengheng Zheng, Jan W. H. Schnupp, Alan Ka ...
    2020 Volume 41 Issue 1 Pages 209-213
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Cochlear implants (CIs) convert sound to electrical stimulation by extracting the envelope in each frequency band while discarding the temporal fine structure (TFS). This processing removes the fine structure interaural time differences (ITDs), which are an important cue for locating sounds on the horizontal plane in normal-hearing (NH) listeners, but are unavailable to CI users. A temporal limits encoder (TLE) strategy was previously proposed to enhance TFS in CIs, and our previous studies via tone-carrier vocoder simulation have shown improved unilateral speech-in-noise understanding and pitch perception. Here, binaural benefits of TLE were assessed, by measuring the binaural intelligibility level difference (BILD), using a 22-channel tone-carrier vocoder in NH listeners. TLE was compared to continuous interleaved sampling (CIS), a common CI strategy. Speech reception thresholds (SRTs) were measured for diotic target speech (male), and diotically-colocated or dichotically-separated (applying ± 625 µs delay between ears) competitors (male or female). Compared to CIS, TLE showed significantly larger BILDs for different genders, indicating that TLE-simulation listeners were able to benefit from both pitch and spatial cues. However, SRTs for vocoded conditions were much higher than non-vocoded listening, likely due to a lack of familiarity with vocoded speech listening.

    Download PDF (283K)
  • Borys Kowalewski, Michal Fereczkowski, Olaf Strelcyk, Ewen MacDonald, ...
    2020 Volume 41 Issue 1 Pages 214-222
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Sensorineural hearing loss results both in a reduced sensitivity to sound, as well as suprathreshold deficits, such as loudness recruitment and degraded spectral and temporal resolution. To compensate for loudness recruitment, most hearing aids apply level-dependent amplification, such as multi-channel wide dynamic-range compression. However, the most appropriate choice of parameters, such as the time constants and the number of channels, has been controversial. Speech intelligibility has been often considered as an outcome measure and it has been difficult to delineate the effects of hearing-aid signal processing on the representation of signals, due to the complex spectro-temporal structure of speech. In the current study, hearing-aid compensation strategies were evaluated using synthetic stimuli in psychoacoustic experiments with hearing-impaired and normal-hearing listeners. A computational model of the auditory signal processing was used to assess the effects of linear amplification and multi-channel fast-acting compression on spectral and temporal masking. Improvements in the decay of forward masking were predicted with both types of amplification due to the increased audibility. On the other hand, spectral masking was reduced with compression, but not linear amplification, due to the increased signal-to-noise ratio across frequency. The results provide insights into the effects of hearing-aid amplification on basic auditory processing.

    Download PDF (628K)
  • John F. Culling, Rada Gocheva, Yanyu Li, Nur Kamaludin
    2020 Volume 41 Issue 1 Pages 223-228
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    The intelligibility of speech was measured in simulated rooms with parametrically manipulated acoustic features. The rectangular rooms were designed to simulate restaurant environments with either three or nine occupied tables, using either speech or noise as interfering sounds. The existence of more detailed acoustic features, such as furniture was also modelled. The measurements revealed that reverberation time was poorly correlated with speech intelligibility. In contrast, a psychoacoustic model of spatial release from masking produced accurate predictions for noise interferers and ordinally correct predictions for speech interferers. It was found that rooms with high ceilings facilitated higher speech intelligibility than rooms with lower ceilings and that acoustic treatment of walls facilitated higher speech intelligibility than equivalent treatment of ceilings. Ground-level acoustic clutter, formed by furniture and the presence of other diners had a substantial beneficial effect. Where acoustic treatment was limited to the ceiling, it was found that continuous acoustic ceilings were more effective than suspended panels, and that the panels were more effective if acoustically absorbent on both sides. The results suggest that the most effective control of reverberation for the purpose of speech intelligibility is provided by absorbers placed vertically and close to the diners.

    Download PDF (460K)
  • Maki Sakamoto
    2020 Volume 41 Issue 1 Pages 229-232
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Sounds can be expressed by onomatopoeias such as tick-tock and ding-dong, where we verbalize the perceived auditory information from environmental sound. Onomatopoeias, i.e., sound symbolic words, indicate linguistic forms closely related to environmental sounds. In recent years, some researchers have reported that onomatopoeias are implicated in affective aspects. Our research group has developed a system to quantify the affective impression or texture of environmental sounds expressed by onomatopoeias. Interestingly, our system can estimate not only sound impressions but also tactile or taste impressions expressed by onomatopoeias.

    Download PDF (209K)
  • Masashi Unoki, Zhi Zhu
    2020 Volume 41 Issue 1 Pages 233-244
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Speech signals can be represented as a sum of amplitude-modulated frequency bands. This sum can also be regarded as a temporal amplitude envelope (TAE) with temporal fine structure. Our previous studies using noise-vocoded speech (NVS) showed that the TAE of speech plays an important role in the perception of linguistic information (speech intelligibility) as well as non-linguistic information (e.g., vocal-emotion recognition). It was found that the upper limit of the modulation frequency from 4 to 8 Hz on the TAE is important for speech intelligibility, while that from 8 to 16 Hz is important for vocal-emotion recognition. However, speech intelligibility generally dramatically degrades due to reverberation. The concept of the modulation transfer function (MTF) takes into account the relationship between the transfer function in an enclosure in terms of input and output TAEs and characteristics of the enclosure under reverberant conditions. This concept was introduced as a measure in room acoustics for assessing the effect of an enclosure on speech intelligibility. For this study, we conducted two experiments involving word intelligibility tests and vocal-emotion recognition with NVS under reverberant conditions to investigate the relationship between the contributions of the TAE of speech and MTF of reverberation to modulation perception of NVS. We also pointed out that the straightforward scheme, i.e., the relationship between the contributions of the static features (peak/slope) in the modulation spectrum (MS) of speech and MTF of reverberation, cannot consistently account for the auditory perception of both linguistic and non-linguistic information obtained from these perceptual data of NVS under reverberant conditions. We then developed a scheme in which the relationship between the contributions of the temporal MS features and MTF of reverberation to modulation perception can consistently account for these perceptual data of NVS.

    Download PDF (1283K)
  • W. Owen Brimijoin, Shawn Featherly, Philip Robinson
    2020 Volume 41 Issue 1 Pages 245-248
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    The perception of acoustic motion is not uniform as a function of azimuth; listeners need roughly twice as much motion at the side than at the front to judge the two motions as equivalent. Self-generated acoustic motion perception has also been shown to be distorted. Sounds moved slightly with the listener's head are more consistently judged to be world-stable than those that are truly static. These distortions can be captured by a model that incorporates a head-centric warping of perceived sound location, characterized by a displacement in apparent sound location away from the acoustic midline. Such a distortion has been demonstrated; listeners tend to overestimate azimuth when they are asked to point at a sound source while keeping their head and eyes fixated ahead of them. Here we show that this mathematical framework may be inverted and we demonstrate the benefits of re-mapping sound source locations toward the auditory midline. We show that listeners prefer different amounts of spatial remapping, but none preferred no remapping. Modelling shows minimal impact on spatial release from masking for small amounts of remapping, demonstrating that it is possible to achieve a more stable perceptual environment without sacrificing speech intelligibility in spatially complex environments.

    Download PDF (369K)
  • Akio Honda, Yoji Masumi, Yôiti Suzuki, Shuichi Sakamoto
    2020 Volume 41 Issue 1 Pages 249-252
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    This study investigated the effect of passive whole-body rotation on the accuracy of listener subjective straight ahead. Listeners sat on a digitally controlled spinning chair placed at the center of a circular loudspeaker array (radius = 1.1 m, speaker spacing = 2.5°) and were exposed to a single 30-ms pink noise burst emitted from one loudspeaker of this array. Under the chair-still condition, listeners were asked to keep their head still, whereas under the chair-rotation condition, listeners were asked to keep their head still and their chairs were rotated at angular velocities of 5, 10, and 20°/s. In both cases, listeners judged whether the stimulus was presented from the right or left of the subjective straight ahead, and there was a significant decrease in the sound localization accuracies under the chair-rotation condition, while chair rotation speed had almost no effect on sound localization accuracy.

    Download PDF (370K)
  • Tapio Lokki, Jukka Pätynen
    2020 Volume 41 Issue 1 Pages 253-259
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    One of the central acoustical features of a concert hall is its ability to make sound sufficiently loud. Acoustics researchers often measure the objective parameter strength to investigate sound-amplifying properties of a hall. However, the strength is a linear variable, it does not reveal anything about the true dynamic responsiveness of a hall. The hall should render the music expressive with large dynamics and in this part the dynamic responsiveness plays an inseparable role. As an example, we analyze measurements from two concert halls combining the binaural sound levels with additional information on music and dynamics. These factors represent the spectral changes in the source signals as well as binaural hearing sensitivity according to the sound level. With such factors combined to the information obtained from the conventional impulse response, the dynamic responsiveness as well as the actual dynamic range experienced by the listener could be objectively measured. The presented analysis method shows the overall magnitude of differences in dynamic responsiveness that could be observed between concert halls.

    Download PDF (819K)
  • Toru Kamekawa, Atsushi Marui
    2020 Volume 41 Issue 1 Pages 260-268
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Various microphone techniques for three-dimensional audio (abbreviated as 3D audio) with playback channels along the height direction, such as NHK 22.2 multichannel sound, Dolby Atmos, and Auro 3D have been proposed. In this study, we compared three microphone techniques for 22.2 multichannel sound, the spaced microphone array, near-coincident microphone array, and coincident microphone array (Ambisonics). First, the evaluation attributes were extracted by referring to the repertory grid technique. Then, using these attributes, participants compared the differences between these microphone techniques, including the difference in the listening position through two experiments. From the results, we observe that the difference, depending on the listening position, was the smallest in the spaced array. In addition, it was estimated that Ambisonics gives the impression of ``hard,'' the near-coincident array gives ``rich'' and ``wide,'' and the spaced array gives ``clear'' and ``presence.'' Furthermore, ``presence'' was evaluated from the viewpoints of clarity and richness of reverberation, with a negative correlation with the spectral centroid and a positive correlation with the reflection from lateral and vertical sides.

    Download PDF (947K)
  • Shoichi Koyama
    2020 Volume 41 Issue 1 Pages 269-275
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Estimating and interpolating a sound field from measurements using multiple microphones are fundamental tasks in sound field analysis for sound field reconstruction. The sound field reconstruction inside a source-free region is achieved by decomposing the sound field into plane-wave or harmonic functions. When the target region includes sources, it is necessary to impose some assumptions on the sources. Recently, it has been increasingly popular to apply sparse representation algorithms to various sound field analysis methods. In this paper, we present an overview of sparsity-based sound field reconstruction methods and also demonstrate their application to sound field recording and reproduction.

    Download PDF (856K)
  • Jorge Treviño, Shuichi Sakamoto, Yôiti Suzuki
    2020 Volume 41 Issue 1 Pages 276-281
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Virtual Auditory Displays (VADs) are used to present realistic spatial sound. High-quality VADs must account for three factors: individuality (Head-Related Transfer Function), room acoustics (Room Transfer Function) and freedom of motion (active listening). The Auditory Display based on the VIrtual SpherE model (ADVISE) was proposed to simplify the problem by dividing it, through the Kirchhoff-Helmholtz integral theorem, into 1) a listener-free room acoustics simulation and 2) a free-field VAD using HRTFs. Users of ADVISE can move freely within the free-field region, thus accounting for active listening. This paper revisits the classic theory of ADVISE and identifies three oversights in the original proposal: 1) The ADVISE formulation suffers from non-unique boundary conditions at some frequencies. 2) The original proposal re-creates a set of boundary conditions using secondary sources that diverge on the boundary itself. 3) Considerations for sound propagation are absent in the original formulation. Two new formulations that retain the philosophy of ADVISE but are free from these problems are presented. The first one is based on the theory of Boundary Matching Filters, while the second is inspired by High-Order Ambisonics. The latter is found to be better suited for applications where freedom of motion is important since the presented sound field can be shifted by a translation matrix.

    Download PDF (616K)
  • Tsukasa Suenaga, Shoken Kaneko, Hiraku Okumura
    2020 Volume 41 Issue 1 Pages 282-287
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Three-dimensional (3D) audio reproduction systems incorporated into a set of headphones are growing in popularity alongside the evolution of virtual reality (VR)/augmented reality (AR) technology. In this paper, some applications of binaural 3D audio systems will be presented. Game audio is one of the application fields for binaural 3D audio systems. We have developed a binaural 3D audio system for game development. When applying the binaural 3D audio system to games, it is necessary to solve the problem of calculation load. Therefore, we developed a hybrid system of virtual loudspeakers and object sound sources. In addition, the problem of timbre change was highlighted. Meanwhile, 360° video is another potential application field for 3D audio. In conventional binaural systems, it is difficult to express ambient sound sources such as the sound of rustling leaves. Our system has made it possible to reproduce ambient sound sources by combining higher-order ambisonics (HoA) and head-related transfer functions (HRTFs).

    Download PDF (548K)
  • William L. Martens, Michael Cohen
    2020 Volume 41 Issue 1 Pages 288-296
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Spatial soundscape superposition occurs whenever multiple sound signals impinge upon a human listener's ears from multiple sources, as in augmented reality displays that combine natural soundscapes with reproduced soundscapes. Part I of this two-part contribution on spatial soundscape superposition regards perceptual superposition of soundscapes, and therefore focusses upon human response to displayed auditory scenes, and the influence of subject (listener) motion on making sense of them in the context of information received from other sensory systems, especially the visual and vestibular systems. Consideration of listener motion and multimodal integration here is intended to lay the foundation for Part II of this contribution, which focusses upon physical stimuli, i.e., sounds and signals, and the systems used to mix, transmit, and display them. Through superposition of complex sound stimuli at the ears of a moving listener, these systems create complex auditory scenes, the nature of which cannot be predicted by simple combination of physical stimuli. As it is left to the human listener to interpret auditory scenes comprising those stimuli, this part focuses upon perceptual principles, such as grouping principles, that can aid in successfully predicting whether multiple auditory events are perceptually segregated or fused in the auditory scenes that are experienced. Furthermore, for moving listeners, four fundamental laws are identified here describing sensorimotor contingencies that enable prediction not only of what auditory images are formed, but also where in auditory space those images are likely to be perceived.

    Download PDF (293K)
  • Michael Cohen, William L. Martens
    2020 Volume 41 Issue 1 Pages 297-307
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    We present an analytical framework for a cognitively informed organization of signals involved in computational representations of spatial soundscape superposition, defined here as ``procedural superposition,'' building on the accompanying article Part I, where we discussed physical (acoustical) and perceptual (subjective and psychological) frameworks for soundscape representations in virtual auditory displays. Exploiting multimodal sensation and mental models of situations and environments, convention and idiom can tighten listeners' apprehension of an auditory scene, using metaphor and relaxed expectation of sonorealism to enrich communication. Besides physical and psychological combinations, procedural (logical and cognitive) superposition considers metaphorical mappings between audio sources and virtual location, including such aspects as separation of visual and auditory perspectives; separation of direction and distance; parameterized binaural and spatial effects, including directionality; range-compression and indifference; layering of soundscapes; ``audio windowing'' (analogous to graphical user interface windows), narrowcasting, and multipresence as a strategies for managing privacy; and rotation as revolution. These auditory display strategies leverage virtual relaxations of sonorealism to enable enhanced soundscape representation.

    Download PDF (1810K)
  • Craig T. Jin, Shiduo Yu, Fabio Antonacci, Augusto Sarti
    2020 Volume 41 Issue 1 Pages 308-317
    Published: January 01, 2020
    Released on J-STAGE: January 06, 2020
    JOURNAL FREE ACCESS

    Hands-free audio services supporting speech communication are playing an increasingly ubiquitous and foundational role in everyday life as services for the home and work become more automated, interactive and robotic. People will speak their instructions (e.g. Siri) to control and interact with their environment. This makes it an exciting time for acoustics engineering because the demands on microphone array performance are rapidly increasing. The microphone arrays are expected to work at increasing distances in noisy and reverberant situations; they are expected to record not just the sound content, but also the sound field; they are expected to work in multi-talker situations and even on moving, robotic platforms. Audio technology is currently undergoing rapid change in which it is becoming feasible, from both a cost and hardware point-of-view, to incorporate multiple and distributed microphone arrays with hundreds or even thousands of microphones within a built environment. In this review paper, we consider microphone array signal processing from two relatively recent vantage points: sparse recovery and ray space analysis. To a lesser extent, we also consider neural networks. We present the principles underlying each method. We consider the advantages and disadvantages of the approaches and also present possible methods to integrate these techniques.

    Download PDF (1009K)
ACOUSTICAL LETTERS
feedback
Top