Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
Volume 39, Issue 2
—Special Issue on speech communication—
Displaying 1-21 of 21 articles from this issue
FOREWORD
PAPERS
  • Kenta Ofuji, Naomi Ogasawara
    2018 Volume 39 Issue 2 Pages 56-65
    Published: March 01, 2018
    Released on J-STAGE: March 01, 2018
    JOURNAL FREE ACCESS
    In this paper, we study the effects of acoustic characteristics of spoken disaster warnings in Japanese on listeners' perceived intelligibility, reliability, and urgency. Our findings are threefold: (a) For both speaking speed and fo, setting them to normal (compared from slow/fast ({+}/{-}20%) for speed, and from low/high (+/- up to 36 Hz) for fo) improved the average evaluations for Intelligibility and Reliability. (b) For Urgency only, setting speed to faster (both slow to normal and normal to fast) or setting fo to higher (both low to normal and normal to high) resulted in an improved average evaluation. (c) For all of intelligibility, reliability, and urgency, the main effect of speaking speed was the most dominant. In particular, urgency can be influenced by the speed factor alone by up to 39%. By setting speed to fast (+20%), all other things being equal, the average perceived urgency raised to 4.0 on the 1–5 scale from 3.2 when the speed is normal. Based on these results, we argue that the speech rate may effectively be varied depending on the purpose of an evacuation call, whether it prioritizes urgency, or intelligibility and reliability. Care should be taken to the possibility that the respondent-specific variation and experimental conditions may interplay these results.
    Download PDF (640K)
  • Masako Fujimoto, Seiya Funatsu
    2018 Volume 39 Issue 2 Pages 66-74
    Published: March 01, 2018
    Released on J-STAGE: March 01, 2018
    JOURNAL FREE ACCESS
    Japanese voiced geminates have a tendency to devoice (e.g. baggu > bakku `bag'). Voiced obstruents have inherent susceptibility for devoicing due to the aerodynamic voicing constraints (AVC), and the susceptibility is higher for geminate obstruents than singletons. As a way to investigate how Japanese speakers realize the contrast between the [+/-voice] in obstruents, we examined oral and nasal airflow patterns during intervocalic voiced and voiceless stops in singletons and geminates. The results showed that no nasal airflow appeared during voiced and voiceless stops. Oral airflow showed asymmetry between single and geminate stops in realization of the stop voicing contrast. While the oral airflow pattern clearly differentiates the voiced vs. voiceless contrast in singletons, the patterns are similar in geminates. Acoustic signals also show the same asymmetry between the singletons and geminates. The observed convergence –- a clear voicing contrast in singletons vs. a lack of the contrast in geminates, both in oral airflow and acoustic signals, indicate the tendency of neutralization of the voiced geminates into voiceless ones. Our results support the idea of phonetic and articulatory bases in phonological patterning of voicing neutralization in Japanese geminate stops.
    Download PDF (1682K)
  • Jeff Moore, Jason Shaw, Shigeto Kawahara, Takayuki Arai
    2018 Volume 39 Issue 2 Pages 75-83
    Published: March 01, 2018
    Released on J-STAGE: March 01, 2018
    JOURNAL FREE ACCESS
    This study examines the tongue shapes used by Japanese speakers to produce the English liquids /ɹ/ and /l/. Four native Japanese speakers of varying levels of English acquisition and one North American English speaker were recorded both acoustically and with Electromagnetic Articulography. Seven distinct articulation strategies were identified. Results indicate that the least advanced speaker uses a single articulation strategy for both sounds. Intermediate speakers used a wide range of articulations, while the most advanced non-native speaker relied on a single strategy for each sound.
    Download PDF (718K)
  • Alexei Kochetov
    2018 Volume 39 Issue 2 Pages 84-91
    Published: March 01, 2018
    Released on J-STAGE: March 01, 2018
    JOURNAL FREE ACCESS
    This study employed electropalatography (EPG) to explore place and manner of articulation differences in Japanese consonants. Linguopalatal contact data were collected from 5 native speakers using custom-made artificial palates. The materials included words with 10 word-initial consonants and a word-final moraic nasal. Quantitative analyses of the data revealed some consistent differences among consonants in constriction location and constriction degree, even within the same-place classes. Certain differences among dorsal consonants, as well as among consonants with no active lingual constriction were also observed. The results for Japanese coronal consonants were further compared to previous quantitative findings for English and Spanish with the goal to establish common manner-specific patterns of linguopalatal contact across languages.
    Download PDF (970K)
  • Hafiyan Prafiyanto, Takashi Nose, Yuya Chiba, Akinori Ito
    2018 Volume 39 Issue 2 Pages 92-100
    Published: March 01, 2018
    Released on J-STAGE: March 01, 2018
    JOURNAL FREE ACCESS
    We investigate the effect of speaking rate and pauses on the perception of spoken Easy Japanese, which is Japanese language with mostly easy words to facilitate understanding by non-native speakers. In this research, we used synthetic speech with various speaking rates, pause positions, and pause lengths to investigate how they correlate with the perception of Easy Japanese for non-native speakers of Japanese. We found that speech rates of 320 and 360 morae per minute are perceived to be close to the ideal speaking rate. Inserting pauses in natural places for Japanese native speakers, based on the dependency relation rule of Japanese, makes sentences easier to listen to for non-native speakers as well, whereas inserting too many pauses makes the sentences hard to listen to.
    Download PDF (902K)
  • Carlos Toshinori Ishi, Jun Arai
    2018 Volume 39 Issue 2 Pages 101-108
    Published: March 01, 2018
    Released on J-STAGE: March 01, 2018
    JOURNAL FREE ACCESS
    Pressed voice is a type of voice quality produced by pressing/straining the vocal folds, which often appears in Japanese conversational speech when expressing paralinguistic information related to emotional or attitudinal behaviors of the speaker. With the aim of clarifying the acoustic and physiological features involved in pressed voice production, we conducted periodicity, spectral and electroglottographic (EGG) analyses on pressed voice segments extracted from spontaneous dialogue speech of several speakers. Periodicity analysis first indicated that pressed voice is usually accompanied by creaky or harsh voices, having irregularities in periodicity, but can also be accompanied by periodic voices with fundamental frequencies in the range of modal phonation. Spectral analysis indicated power is usually reduced in low frequency components of pressed segments. A spectral measure H1'-A1' was then proposed for characterizing pressed voice segments which commonly has few or no harmonicity. H1'-A1' was shown to be effective for identifying most pressed segments, but fails when nasalization occurs. Vocal fold vibratory pattern analysis from the EGG signals revealed that most pressed voice segments (including nasalized vowels) are characterized by glottal pulses with closed intervals longer than open intervals on average, regardless of periodicity.
    Download PDF (1335K)
  • Eri Iwagami, Takayuki Arai, Keiichi Yasu, Kei Kobayashi
    2018 Volume 39 Issue 2 Pages 109-118
    Published: March 01, 2018
    Released on J-STAGE: March 01, 2018
    JOURNAL FREE ACCESS
    In this study, two perception experiments were conducted to investigate the misperception of Japanese words with devoiced vowels and/or geminate consonants by young and elderly listeners. In Experiment 1, eight young normal-hearing listeners participated under a white-noise condition, and eight elderly listeners participated in Experiment 2. Two types of word sets which consist of combinations of vowels (V = /i, u/) and voiceless consonants (C = /k, t, s/) were used as stimuli. The first word set involved two- or three-mora words and the second word set had 14 minimal pairs of CVC (:) V, where (:) stands for with or without a geminate consonant. The results of both experiments showed that misperception was great for words with devoiced vowels and even greater for words with geminate consonants. In particularly, the misperception of consonants including high frequency components such as /shi/ or /shu/ was observed for elderly listeners.
    Download PDF (1322K)
  • Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi ...
    2018 Volume 39 Issue 2 Pages 119-129
    Published: March 01, 2018
    Released on J-STAGE: March 01, 2018
    JOURNAL FREE ACCESS
    This paper proposes a method for constructing text-to-speech (TTS) systems for languages with unknown pronunciations. One goal of speech synthesis research is to establish a framework that can be used to construct TTS systems for any written language. Generally, language-specific knowledge is required to construct TTS systems for a new language. However, it is difficult to acquire language-specific knowledge in each new language. Therefore, constructing a TTS system for a new language entails huge costs. To address this problem, we investigate a framework for automatically constructing a TTS system from a target language database consisting of only speech data and corresponding Unicode texts. In the proposed method, pseudo phonetic information of the target language with unknown pronunciation is obtained by a speech recognizer of a rich-resource proxy language. Then, a grapheme-to-phoneme converter and a statistical parametric speech synthesizer are constructed based on the obtained pseudo phonetic information. The proposed method was applied to Japanese and was evaluated in terms of objective and subjective measures. Additionally, we challenged the construction of TTS systems for nine Indian languages using the proposed method, and TTS systems were evaluated in the Blizzard Challenge 2014 and 2015.
    Download PDF (831K)
  • William F. Katz, Sonya Mehta, Matthew Wood
    2018 Volume 39 Issue 2 Pages 130-137
    Published: March 01, 2018
    Released on J-STAGE: March 01, 2018
    JOURNAL FREE ACCESS
    In order to investigate the articulatory processes involved in producing Japanese /r/, we obtained speech recordings for native talkers of standard Japanese using an electromagnetic articulography (EMA) system. Each talker produced repetitions of /r/ in a carrier phrase designed to contrast syllable (CV and VCV VCV) and vowel (/a/, /i/, /u/, /e/, and /o/) contexts. Kinematic recordings were made using tongue (tip, TT; dorsum, TD; body, TB; left lateral, TLL; and right lateral, TRL) and lower lip/jaw (LL) sensors. We measured TT vertical displacement, TT duration at maximum position, and tongue blade width for the consonant gestures. In a perceptual experiment, American English listeners decided whether these consonants consisted of `l,' `r,' or `d.' The kinematic results indicate Japanese talkers produced CV consonants with greater stricture and longer closures than consonants in intervocalic positions. CV productions also had narrower tongue blade widths than VCV VCV productions, especially in /i/ and /u/ contexts. The data were modeled with Dirichlet regression in order to determine how strongly tongue width and context (syllable and vowel) factors predict listeners' judgments. The results showed a significant fit for `r' judgments, with the tongue width fit successively increased by the addition of syllable and vowel context information.
    Download PDF (1075K)
ACOUSTICAL LETTERS
feedback
Top