Journal of the Acoustical Society of Japan (E)

Foreword to Special Issue on Speech Database/Assessment for Oriental Languages

Shuichi Itahashi

1999 Volume 20 Issue 3 Pages 159-161
Published: 1999
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.20.159

JOURNAL FREE ACCESS

Download PDF (332K)
On recent speech corpora activities in Japan

Shuichi Itahashi

1999 Volume 20 Issue 3 Pages 163-169
Published: 1999
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.20.163

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper describes a range of Japanese projects which are concerned with speech corpora. ETL is to be credited for initiating research on speech database in 1973, while Tohoku University played a pioneering role in speech corpus development. The JEIDA Japanese Common Speech Data Corpus was reported in 1986 and then later converted to DAT form. Subsequently in 1990, the JEIDA Noise Database was released to the public. Other important contributions are due not only to ATR which has developed a wide varietyof speech corpora, but also to the so-called priority area projects funded by MESSC. On the one hand, the “Spoken Language” project has yielded data on continuous speech, while the “Spoken Japanese” project yielded data on various dialectal speech from all over Japan. On the other hand, the “Spoken Dialogue” project has yielded data on various spoken dialogues. Six CD-ROMs were produced by a committee of the Acoustical Society of Japan. Three of them contain speech of isolated sentences that are phonetically balanced, while the remaining three include continuous speech obtained for various guide-tasks. This paper finally refers to the new ASJ corpus and the “Real World Computing Program” formulated in 1992 by the Japanese Government.

View full abstract

Download PDF (2561K)
Performance comparison of recognition systems: A Bayesian approach

Kazuhiko Ozeki, Yoshiyasu Ishigami, Kazuyuki Takagi

1999 Volume 20 Issue 3 Pages 171-179
Published: 1999
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.20.171

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper describes a Bayesian approach to performance comparison of recognition systems. Unlike a conventional statistical test, this method makes no decision whether there is a significant difference between the true recognition rate of System A and that of System B. Instead, it gives the probability of the event that the true recognition rate of A is higher than that of B given their recognition results. The probability is referred to as the superiority of A to B. This is similar to a numerical weather forecast, in which what is predicted is the probability of having a certain amount of rain, not a prospect of being sunny or rainy. The superiority is exemplified in various cases for the manner of inputting test data and observing the recognition results, and then its sensitivity for the difference between the respective sample recognition rates of A and B is investigated. All the results support that this method has natural properties which conform to our intuition. The relationship between the superiority in this method and the level of significance in statistical tests is also discussed.

View full abstract

Download PDF (2315K)
Design of Mongolian speech database considering dialectal characteristics

Idomuso Dawa, Shigeki Okawa, Katsuhiko Shirai

1999 Volume 20 Issue 3 Pages 181-188
Published: 1999
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.20.181

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper, aiming at the advancement of speech research for Mongolian language, we discuss design and collection of Mongolian speech database. Mongolian, one of the most spoken oriental languages, has several dialectal variations in its phonetic and graphic systems because of the historical and the geographical background. This causes a significant obstacle to develop their culture and education. Furthermore, unlike the other major languages, the basis of linguistic and phonetic research has not been investigated well yet. In this study, therefore, we first analyze fundamental characteristics of Mongolian spoken language in order to define phoneme labels. Next, we design the Mongolian speech database by collecting actual native speech. Lastly, we apply the database to a simple word recognition experiment.

View full abstract

Download PDF (8441K)
Building a Thai part-of-speech tagged corpus (ORCHID)

Virach Sornlertlamvanich, Naoto Takahashi, Hitoshi Isahara

1999 Volume 20 Issue 3 Pages 189-198
Published: 1999
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.20.189

JOURNAL FREE ACCESS

Show abstractHide abstract

ORCHID (Open linguistic Resources CHanelled toward InterDisciplinary research) is an initiative project aimed at building linguistic resources to support research in, but not limited to, natural language processing. Based on the concept of an open architecture design, the resources must be fully compatible with similar resources, and software tools must also be made available. This paper presents one result of the project, the construction of a Thai part-of-speech (POS) tagged corpus, which is a preliminary stage in the construction of a Thai speech corpus. The POS-tagged corpus is the result of collaborative research between the Communications Research Laboratory (CRL) in Japan and the National Electronics and Computer Technology Center (NECTEC) in Thailand, with technical support from the Electrotechnical Laboratory (ETL) in Japan. In this paper, we propose a new tagset, based on the results of a prior multilingual machine translation project. The corpus is annotated on three levels: the paragraph, sentence, and word levels. Text information is maintained in the form of the text information lines and the number lines, which are both utilized in data retrieval. Both word segmentation and POS tagging were carried out by way of a probabilistic trigram model. Rules for syllable demarkation were additionally used to reduce the number of candidates in computing tagging probabilities

View full abstract

Download PDF (5240K)
JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research

Katunobu Itou, Mikio Yamamoto, Kazuya Takeda, Toshiyuki Takezawa, Tats ...

1999 Volume 20 Issue 3 Pages 199-206
Published: 1999
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.20.199

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper we present the first public Japanese speech corpus for large vocabulary continuous speech recognition (LVCSR) technology, which we have titled JNAS (Japanese Newspaper Article Sentences). We designed it to be comparable to the corpora used in the American and European LVCSR projects. The corpus contains speech recordings (60 h) and their orthographic transcriptions for 306 speakers (153 males and 153 females) reading excerpts from the newspaper's articles and phonetically balanced (PB) sentences. This corpus contains utterances of about 45, 000 sentences as a whole with each speaker reading about 150 sentences. JNAS is being distributed on 16 CD-ROMs.

View full abstract

Download PDF (963K)
A Japanese spontaneous speech corpus collected using automatically inferencing Wizard of OZ system

Katunobu Itou, Tomoyosi Akiba, Osamu Hasegawa, Satoru Hayamizu, Kazuyo ...

1999 Volume 20 Issue 3 Pages 207-214
Published: 1999
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.20.207

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper, we describe a corpus that contains spontaneous speech data between human and machine using the WOZ technique. It is designed to collect data in order to analyze the elements that enable interaction between human and machine, such as taking turns naturally during the dialog, chiming in, interruption, and natural recovery from interruption. The corpus contains the data from forty speakers during 197 sessions. The net amount of time that was spent on dialog sessions was over 1, 300 min. The corpus contains speech signal waves, pitch patterns, transcriptions, utterance segment boundaries, and semantic representations of the user utterance.

View full abstract

Download PDF (3065K)
Machine readable phonetic transcription system for Chinese dialects spoken in Taiwan

Chiu-yu Tseng, Fu-chiang Chou

1999 Volume 20 Issue 3 Pages 215-223
Published: 1999
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.20.215

JOURNAL FREE ACCESS

Show abstractHide abstract

Since existing ASCII versions of phonetic transcription systems appear to aim at transcribing European languages only, they prove to be insufficient to accommodate syllable based tonal languages such as Chinese. An ASCII encoding of phonetic transcription system for speech database of Chinese was designed. Major characteristics of the proposed design are: 1. Though based on the principles of the International Phonetic Alphabet (IPA), the current design includes two levels of transcription, namely, segmental and prosodic. The consequence is a more elaborate system than the IPA or its equivalent. 2. The proposed system specifically aims to transcribe three major Chinese dialects spoken in Taiwan, namely, Mandarin, Taiwanese and Hakka. The consequence is a more language-dependent system rather than a general system as the IPA.

View full abstract

Download PDF (5098K)
Sound scene data collection in real acoustical environments

Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Takashi Endo

1999 Volume 20 Issue 3 Pages 225-231
Published: 1999
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.20.225

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper describes a sound scene database necessary for studies such as sound source localization, sound retrieval, sound recognition and speech recognition in real acoustical environments. Many speech databases have been collected for speech recognition so far. The statistical modeling of speech based on the collected speech databases realizes a drastic improvement of speech recognition performance. However, there are only a few databases available for sound scene data including non-speech sound in real environments. A sound scene database is obviously necessary for studies of acoustical signal processing and sound recognition. This paper reports on a project for collection of the sound scene database supported by Real World Computing Partnership (RWCP). There are many kinds of sound scenes in real environments. The sound scene is denoted by sound sources androomacoustics. The number of combination of the sound sources, source positions and rooms is huge in real acoustical environments. Two approaches are taken to build the sound scene database in the early stage of the project. The first approach is to collect isolated sound sources of many kinds of non-speech sounds and speech sounds. The second approach is to collect impulse responses in various acoustical environments. The sound in the collected environments can be simulated by convolution of the isolated sound sources and impulse responses. In a later stage, the sound scene data in real acoustical environments is planned to be collected using a three dimensional microphone array. In this paper, the plan and progress of our sound scene database project are described.

View full abstract

Download PDF (4605K)
Japanese Dictation Toolkit-1997 version-

Tatsuya Kawahara, Akinobu Lee, Tetsunori Kobayashi, Kazuya Takeda, Nob ...

1999 Volume 20 Issue 3 Pages 233-239
Published: 1999
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.20.233

JOURNAL FREE ACCESS

Show abstractHide abstract

The Japanese Dictation Toolkit has been designed and developed as a baseline platform for Japanese LVCSR (Large Vocabulary Continuous Speech Recognition). The platform consists of a standard recognition engine, Japanese phone models and Japanese statistical language models. We set up a variety of Japanese phone HMMs from a contextindependent monophone to a triphone model of thousands of states. They are trained with ASJ (The Acoustical Society of Japan) databases. A lexicon and word N-gram (2-gram and 3-gram) models are constructed with a corpus of Mainichi newspaper. The recognition engine JULIUS is developed for evaluation of both acoustic and language models. As an integrated system of these modules, we have implemented a baseline 5, 000-word dictation system and evaluated various components. The software repository is available to the public.

View full abstract

Download PDF (760K)
A call for generic-use large-scale single-speaker speech corpora and an example of their application in concatenative speech synthesis

Nick Campbell

1999 Volume 20 Issue 3 Pages 241-246
Published: 1999
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.20.241

JOURNAL FREE ACCESS

Download PDF (6189K)

Register with J-STAGE for free!