-
Dongyuan Li, Ying Zhang, Yusong Wang, Kotaro Funakoshi, Manabu Okumura
2024 Volume 31 Issue 3 Pages
825-867
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Speech emotion recognition (SER) has garnered increasing attention due to its wide range of applications in various fields, including human-machine interaction, virtual assistants, and mental health assistance. However, existing SER methods often overlook the information gap between the pre-training speech recognition task and the downstream SER task, resulting in sub-optimal performance. Moreover, current methods require much time for fine-tuning on each specific speech dataset, such as IEMOCAP, which limits their effectiveness in real-world scenarios with large-scale noisy data. To address these issues, we propose an active learning (AL)-based fine-tuning framework for SER, called After, that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. Specifically, we first use TAPT to minimize the information gap between the pre-training speech recognition task and the downstream speech emotion recognition task. Then, AL methods are employed to iteratively select a subset of the most informative and diverse samples for fine-tuning, thereby reducing time consumption. Experiments demonstrate that our proposed method After, using only 20% of samples, improves accuracy by 8.45% and reduces time consumption by 79%. The additional extension of After and ablation studies further confirm its effectiveness and applicability to various real-world scenarios.
View full abstract
-
Kosuke Doi, Katsuhito Sudoh, Satoshi Nakamura
2024 Volume 31 Issue 3 Pages
868-893
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
This paper describes the development of a large-scale English-Japanese simultaneous interpretation corpus named NAIST-SIC and presents analyses of it. We collected the recordings of simultaneous interpreting sentences (SIsent). To understand the characteristics of simultaneous interpreting by human simultaneous interpreters (SIers), we analyzed a subset of this corpus. Samples of speech were interpreted by three SIers having different levels of experience and can be used to compare SIsent attributes in terms of the SIers’ experience. Using this corpus subset, we analyzed the differences in latency, quality, and word order. The results show that (1) SIers with more experience tended to generate a higher quality of SIsent, and (2) they better controlled the latency and quality. We also observed that (3) a large latency degraded the SIsent quality.
View full abstract
-
Kazuki Ishikawa, Kohei Ogawa, Satoshi Sato
2024 Volume 31 Issue 3 Pages
894-934
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Many different speech styles are used in spoken Japanese. Authors of Japanese young-adult novels use this characteristic to indicate that an utterance is spoken by which character. For a reader of such novels, the speech style of an utterance is a clue for identifying the speaker when there is no speaker information in the ground sentences around the utterance. To realize automatic speaker identification using speech style, we proposed three things: (1) We proposed a speech-style encoder that converts utterances into vectors (speech-style vectors), in which the features of the speech style are embedded. (2) We proposed a method for automatically identifying the speaker of utterances (speech-style-based speaker identification) using this encoder. (3) We implemented a speaker identification system that combines the above method with a speaker candidate generation module. Using this system, we conducted speaker identification experiments on five novels. For four novels among five, the proposed method outperformed the two baseline methods.
View full abstract
-
Zhengdong Yang, Shuichiro Shimizu, Chenhui Chu, Sheng Li, Sadao Kuroha ...
2024 Volume 31 Issue 3 Pages
935-957
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Speech-to-text translation (ST) translates speech from the source language into text in the target language. Because ST deals with different forms of language, it faces a language style gap between spoken and written language. The gap lies not only between the input speech and the output text but also between the input speech and the bilingual parallel corpora that are often used in ST. These gaps become an obstacle to improving the performance of ST. Spoken-to-written style conversion has been proven to improve cascaded Japanese-English ST by reducing such gaps. Integrating this conversion into end-to-end ST is desirable because of its ease of deployment, improved efficiency, and reduced error propagation compared to cascaded ST. In this study, we construct a large-scale Japanese-English lecture domain ST dataset. We also propose a joint task of speech-to-text spoken-to-written style conversion and end-to-end ST, as well as an interactive-attention-based multi-decoder model for the joint task to improve end-to-end ST. Experiments on the constructed dataset show that our model outperforms a strong baseline.
View full abstract
-
Kyosuke Takahagi, Kanako Komiya, Hiroyuki Shinnou
2024 Volume 31 Issue 3 Pages
958-983
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Data augmentation is a technique used for augmenting training data to improve the model performance in supervised learning and has been widely used in the field of computer vision. However, the technique remains underdeveloped in natural language processing. In this study, we focus on two data augmentation methods that can be used for Japanese natural language processing tasks. The first method involves replacing a word in a sentence with another word using the masked language model of a different BERT from that used in the analysis and inference. The second method involves shuffling the order of phrases so that the dependency relations of the sentences are not broken. In this study, we provide an overview of each method and its corresponding conversion and then describe the tasks for which each method is effective.
View full abstract
-
Takayoshi Shibahara, Ikuya Yamada, Noriki Nishida, Hiroki Teranishi, K ...
2024 Volume 31 Issue 3 Pages
984-1014
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Named entity recognition (NER) is a fundamental and important task in natural language processing. However, traditional NER methods require large amounts of supervised data. As a result, they cannot respond to real-world demands flexibly, e.g., for extracting categories with varying granularities depending on the user’s requirements. Weakly supervised NER, which uses contexts in which known words occur as pseudo-data, can satisfy the demands for a variety of categories when combined with a large-scale thesaurus. Previous studies on weakly supervised NER have proposed learning methods that are robust with respect to pseudo-supervised data errors. However, the models created using such learning methods suffer from the side effect of making predictions across boundaries between interested and uninterested categories. To mitigate this shortcoming, we propose a method that utilizes all categories in the thesaurus, including those demanded by users, for pseudo-data generation and clarify the usefulness of the holistic knowledge contained in the thesaurus empirically.
View full abstract
-
Fuka Narita, Shiki Sato, Ryoko Tokuhisa, Kentaro Inui
2024 Volume 31 Issue 3 Pages
1015-1048
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Expressing personal impressions contributes to the liveliness of dialogue in open-domain conversations. However, generating natural impressions of topics or other utterances requires an understanding of the conversation subject and the utterance of the interlocutor, as well as the utilization of common-sense knowledge, making it a challenging task for open-domain chatbot systems. We aim to develop an open-domain chatbot system capable of generating appropriate impressions in conversational contexts by incorporating impressions of real people as external information. In this study, we constructed a “News Commentary Chat Corpus,” enabling open-domain chatbot systems to learn the selection of suitable impressions and response generation based on these selected impressions. The proposed corpus comprises 1005 triplets containing “news articles,” “impressions of the people toward news articles,” and “dialogues on news articles.” Each dialogue was collected using the Wizard of Oz method, in which the system speaker engages in conversations by incorporating impressions written in posts on social media. The results of training systems to generate responses on this corpus using impressions of the people as external information revealed that the systems produced natural responses to context. Additionally, these systems generated a considerable number of responses that included impressions, thus enhancing the overall liveliness of open-domain conversations.
View full abstract
-
Yasumasa Kano, Katsuhito Sudoh, Satoshi Nakamura
2024 Volume 31 Issue 3 Pages
1049-1075
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
In simultaneous translation, translation begins before the end of an input speech segment. Its evaluation should be conducted based on latency and quality. For users, the smallest possible latency is preferable. Most existing metrics measure latency based on the start timings of partial translations and ignore their duration. This implies that such metrics do not penalize the latency caused by a long translation output, which delays user comprehension and subsequent translations. In this paper, we propose a novel latency evaluation metric for simultaneous translation called the Average Token Delay (ATD), which focuses on the duration of partial translations. We demonstrate its effectiveness through analyses that simulate user-side latency based on the Ear-Voice Span (EVS). In our experiments, ATD had the highest correlation with EVS among the baseline latency metrics under most conditions. These results suggest that ATD provides a more accurate evaluation of latency.
View full abstract
-
Wen Ma, Mika Kishino, Kanako Komiya, Hiroyuki Shinnou
2024 Volume 31 Issue 3 Pages
1076-1106
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
This study proposes a method for building language models for specific speakers. Currently, character-specific utterances are required for interactive agents and games, such as RPGs. However, the training data are limited to building a language model specialized for a specific character. Therefore, using T5, we transform the utterances of different characters in the same work as that of the target speaker into the speech style of the target speaker, that augments the training data. We fine-tuned GPT-2, the base language model, using domain adaptive pretraining (DAPT) + task adaptive pretraining (TAPT) methods. We regarded the utterances of the target speaker as the training data for TAPT and the utterances of the characters in the work as the training data for DAPT. We added character names at the beginning of the utterances to deal with the diversity of the data. Additionally, we manually transformed the utterances of the characters into general utterances that produced parallel data of character-specific and general utterances. We fine-tuned T5 using these parallel data and created two types of T5 models: (A) a transform model from general to character-specific speech style and (B) a transform model from character-specific to general speech style. We augmented the utterances of the target speaker in two ways using these models. (1) We transformed the manually rewritten general utterances into the character-specific style of the target speaker using Model (A), and (2) we transformed the utterances of different characters in the same work as that of the target speaker into the speech style of the target speaker using Models (A) and (B). The experiments showed that the average perplexity of language models for seven characters was 27.33 when GPT-2 was trained with only the utterances of the target speaker whereas 21.15 when the proposed method was used, showing an improvement in performance.
View full abstract
-
Nobuhiro Ueda, Hideko Habe, Yoko Matsui, Akishige Yuguchi, Seiya Kawan ...
2024 Volume 31 Issue 3 Pages
1107-1139
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Understanding the situation in the physical world is crucial for robots assisting humans in the real world. Especially, when a robot is required to collaborate with humans through verbal interactions, such as dialogues, the verbal information that appears in user interactions must be grounded in the visual information observed in egocentric views. To this end, we proposed a multimodal reference resolution task and constructed a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3). Our dataset contains egocentric videos and dialogue audio from real-world conversations between two people acting as a master and an assistant robot at home. It is annotated with crossmodal tags between phrases in the utterances and object bounding boxes in the video frames. These tags include indirect reference relations, such as predicate-argument structures and bridging references as well as direct reference relations. We also constructed an experimental model that combined an existing textual reference resolution model with a phrase grounding model. Our experiments with this model showed that crossmodal reference resolution is significantly more challenging than textual reference resolution in the proposed task.
View full abstract
-
Zhishen Yang, Raj Dabre, Hideki Tanaka, Naoaki Okazaki
2024 Volume 31 Issue 3 Pages
1140-1165
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Figures in scholarly documents provide a straightforward method of communicating scientific findings to readers. Automating figure caption generation enhances model understanding of scientific documents beyond text and helps authors write informative captions. Unlike previous studies, we refer to scientific figure captioning as a knowledge-augmented image-captioning task in which models must utilize knowledge embedded across modalities for caption generation. To this end, we extend the large-scale SciCap dataset (Hsu et al. 2021) to SciCap+, which includes mention paragraphs (paragraphs mentioning figures) and OCR tokens. We then conducted experiments using the M4C-Captioner (a multimodal transformer-based model with a pointer network) as a baseline for our study. Our results indicate that mention paragraphs serve as additional context knowledge, significantly boosting automatic standard image caption evaluation scores compared to figure-only baselines. Human evaluations further reveal the challenges associated with generating figure captions that are informative to readers. The code and SciCap+ dataset are publicly available: https://github.com/ZhishenYang/scientific_figure_captioning_dataset
View full abstract
-
Yuki Yasuda, Taro Miyazaki, Jun Goto
2024 Volume 31 Issue 3 Pages
1166-1192
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Multi-label text classification, which assigns multiple labels to a single text, is a key task in natural language processing. In this task, a model is often trained on an imbalanced dataset whose label frequencies follow a long-tail distribution. Low-frequency labels that rarely appear in training data have an extremely small number of positive samples, so most of the input samples are negative. Therefore, the model learns low-frequency labels with the loss value dominated by the negative samples. In this research, we propose a method called weighted asymmetric loss that combines the appearance frequency weight of labels, the weight that suppresses the loss value derived from negative samples, and a label smoothing method in accordance with the co-occurrences of each label. Experimental results demonstrate that the proposed method improves the accuracy compared to existing methods, especially on imbalanced datasets.
View full abstract
-
Tomoya Kurosawa, Hitomi Yanaka
2024 Volume 31 Issue 3 Pages
1193-1238
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Characters are the smallest units of natural language, and humans understand texts from characters. Past studies have attempted to train language models with the information obtained from character sequences (character-level information) in addition to tokens to improve the performance of these models in various natural language processing tasks in various languages. However, they treated the performance improvement by character-level information as a performance difference between with and without characters. The extent to which these models use character-level information to solve these tasks remains unclear. The effects of linguistic features such as morphological factors on differences in the performance across languages are also under investigation. In this study, we examine existing character-employed neural models and the variation in their performance with character-level information. We focus on four languages: English, German, Italian, and Dutch, and three tasks: part-of-speech (POS) tagging, dependency parsing, and Discourse Representation Structure (DRS) parsing. The experimental results show that character-level information has the greatest effects on model performance on POS tagging and dependency parsing tasks in German and on a DRS parsing task in Italian. Based on these results, we hypothesize that the significant effects on model performance in German is caused by the average lengths of the words and the forms of common nouns. A detailed analysis reveals a strong correlation between the average lengths of the words and effectiveness on POS tagging in German.
View full abstract
-
Kouta Nakayama, Shuhei Kurita, Yukino Baba, Satoshi Sekine
2024 Volume 31 Issue 3 Pages
1239-1291
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Named entity recognition (NER), which detects named entities in text and classifies them as PERSON or LOCATION, is a fundamental technique in natural language processing. Recently, NER systems have been demanded for classification into fine-grained classes. Generally, training data are required to construct an NER system. However, manual labeling is costly, particularly when fine-grained classes are involved. Previous studies proposed utilizing the link structure of Wikipedia to automatically create training data for NER. Wikipedia links are insufficient for constructing training data. Therefore, researchers have attempted to extend these links using language-dependent methods that do not apply to Japanese. In this study, we propose a method for extending links using deep learning and a method to estimate the entity rate of Wikipedia articles. The estimated value is used to impose constraints during training, thereby mitigating the effects of links that cannot be extended using the former method. Additionally, we construct a Japanese NER system for 200 categories of an extended named entity hierarchy. For evaluation, we create data by manually annotating web news articles. Experimental results show that the proposed method performs better than previous methods.
View full abstract
-
Ryo Sekizawa, Hitomi Yanaka
2024 Volume 31 Issue 3 Pages
1292-1329
Published: 2024
Released on J-STAGE: September 15, 2024
JOURNAL
FREE ACCESS
Using honorifics properly is essential for maintaining harmonious relationships with others when communicating in Japanese. Japanese Honorifics have both grammatical aspects (e.g., verb conjugation) and contextual ones (e.g., social relationships among people). Therefore, a precise understanding of honorifics is a challenging task for systems because it requires knowledge of grammatical rules and the ability to understand contexts. While large language models are known to perform well in Japanese tasks, no existing dataset has aimed to evaluate these models’ performance of using Japanese honorifics flexibly according to contextual information. In this thesis, we introduce two honorific understanding tasks that require contextual information: an acceptability judgment task regarding the usage of honorifics and an honorific conversion task. We first construct a new Japanese honorifics dataset using a template-based method to generate data in a controlled way. We also sample the data from an existing Japanese honorifics corpus and annotate them with the additional information to evaluate the models with more natural data. Using our datasets, we then conduct experiments to evaluate the performance of large language models, including GPT-4, on the two tasks from multiple perspectives. Our experimental results demonstrated that the models still have room for improvement in sentences with complicated structures compared to the simpler ones in the honorific conversion task.
View full abstract