This article describes an outline of the creation of the Japanese Map Task Corpus and part of the analysis of the Corpus. The project recorded 128 dialogues which take place while the two participants are engaged in the performance of the Map Task. The total length of the recording is about 23 hours. The dialogues are transcribed in Japanese kana characters. The corpus is planned to be published on CD-ROMs, with the set of digitized sound files and the transcriptions in the format of the TEI P3, together with the software which have helped the creation of the corpus and will be helpful in analysis of the corpus. In this article the principles and practices of the creation of the corpus are described and discussed. Some results from the analysis of the corpus are presented.
This article reports databases for conversational speech translation research built by ATR Interpreting Telecommunications Research Laboratories. It first reports a bilingual travel conversation database and a Japanese monolingual travel conversation database for speech translation research. Second, the article goes on to a Japanese speech database for speaker-independent continuous speech recognition research. This database contains dialogues and read speech uttered by about 4,000 speakers covering all regions of Japan. In addition, we mention another speech database for research purpose, which contains read speech that was uttered by professional announcers and narrators and that was given precise labels of speech segments. All of these databases have already been released outside of ATR and used for research purposes at many research institutes and universities.
This article describes the CallHome Japanese (CHJ) corpus and a project for annotating this corpus with various sorts of linguistic tags. The CHJ corpus is a collection of digitized speech data and text transcriptions of 120 spontaneous, unscripted telephone conversations. The annotation of the corpus provides word segmentations, part-of-speech tags and alignment with the speech for all words, semantic classes for nouns and verbs, and argument structures for verbs. A large scale, high quality corpus of naturally occurring conversations with such extensive linguistic annotations will provide a basis for scientific and technological investigation into human speech communication.
This article provides an overview of the X-ray microbeam speech production database in Japanese. The database is a research resource characterized by (1) synchronized acoustic and articulatory data, (2) relatively large number of speakers (N=19), and (3) rich inventory of speech tasks that can incorporate wide range of research interests. The rationale, tasks, speakers, data characteristics, previous research, future research possibilities and an application to phonetics education are discussed.
This CD-ROM dictionary contains accent data, both contemporary and ancient, concerning 65,928 words in Osaka (by three old and three young generation speakers) and Tokyo Japanese. For the basic entries of 5,684 words, the CD-ROM enables you to listen to the actual pronunciation from the two regions, while simultaneously viewing its sound wave and pitch contour. Over 30 years in the making, this dictionary opens new possibilities for learning and/or conducting research on accents in Japanese.
NTT psycholinguistic databases "Lexical Properties of Japanese" were developed for the large number of Japanese words and characters. The databases contain word and character information such as familiarity, frequency of occurrence, appropriateness of accent, appropriateness of orthography, subjective complexity, and other important psycholinguistic properties. Words and characters satisfying multiple search conditions can be very efficiently obtained for stimuli in psycholinguistic experiments, which was impossible without the database in past days. The databases provide a basis for psycholinguistic research on Japanese.
Compilation of a large-scale corpus of spontaneous Japanese monologue is underway as a joint work of the National Language Research Institute (under the Agency of Cultural Affairs) and the Communications Research Laboratory (under Ministry of Post and Telecommunication). The corpus will contain about 700 hours of digitized speech (about 7 million morphemes), its transcription, and various tagging information such as POS information. Phonological labels (segmental as well as prosodic) will be provided for a subset of the corpus. The corpus will become publicly available in the spring of 2004.
Infants prefer native language to foreign language soon after their birth. This experiment aimed to determine whether infants can detect the phonological differences between a native dialect and an unfamiliar dialect, even though both dialects belong to the same language. The subjects are 43 infants, 5 to 8-months-old. All infants came from families speaking Eastern Japanese dialect only, and had had little exposure to Western Japanese dialect. Preferential listening time toward the Eastern and Western dialect was measured by me Head Turn Preference Procedure. The result demonstrated that 8-month-old infants showed greater preference to the native Eastern dialect than the unfamiliar Western dialect, whereas 5,6,7-month-old infants didn't show any significant difference in their listening time. This fact suggests that 8-month-old infants can detect the difference of these dialects, and pay more attention to a native dialect than to an unfamiliar one.