Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Current issue
Displaying 1-24 of 24 articles from this issue
Preface (Non Peer-Reviewed)
General Paper (Peer-Reviewed)
  • Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Katsuhito Sudoh, S ...
    2025 Volume 32 Issue 2 Pages 404-437
    Published: 2025
    Released on J-STAGE: June 15, 2025
    JOURNAL FREE ACCESS

    Simultaneous speech translation (SimulST) translates speech incrementally, requiring a monotonic input-output correspondence to reduce latency. This is particularly challenging for distant language pairs, such as English and Japanese, as most SimulST models are trained using offline speech translation (ST) data, where the entire speech input is observed during translation. In simultaneous interpretation (SI), a simultaneous interpreter translates source language speech into target language speech without waiting for the speaker to finish speaking. Therefore, the SimulST model can learn SI-style translations using SI data. However, owing to the limited availability of SI data, fine-tuning an offline ST model using SI data may result in overfitting. To address this problem, we propose an efficient training method for the speech-to-text SimulST model using a combination of small SI and relatively large offline ST data. We trained a single model with mixed data by incorporating style tags to instruct the model to generate either SI or offline-style outputs. This approach, called mixed fine-tuning with style tags, can be extended further using the multistage self-training approach. In this case, we use the trained model to generate pseudo-SI data. Our experimental results for several test sets demonstrated that our models trained using mixed fine-tuning and multistage self-training outperformed baselines across various latency ranges.

    Download PDF (1058K)
  • Kosuke Doi, Katsuhito Sudoh, Satoshi Nakamura, Taro Watanabe
    2025 Volume 32 Issue 2 Pages 438-479
    Published: 2025
    Released on J-STAGE: June 15, 2025
    JOURNAL FREE ACCESS

    In foreign language learning, writing tasks play a crucial role in developing and assessing learners’ language abilities, but manual scoring requires significant time and effort. Automated essay scoring (AES) is a way to mitigate this problem. Although human raters consider grammatical items and their difficulties as clues for judging learners’ proficiency levels while scoring essays, it is unclear whether the current state-of-the-art AES models, which use BERT-based essay representations, consider these factors. In this paper, we propose to incorporate grammatical features into BERT-based AES models in three ways: (1) using grammatical features as additional model inputs, (2) performing multi-task learning (MTL) with holistic and grammar scores while using grammatical features as model inputs, and (3) reconstructing grammatical features through MTL with holistic scores. For grammatical features, we model learners’ grammar usage using item response theory (IRT), which measures learners’ grammar abilities and characteristics of grammatical items, including their difficulties, based on essay data without teacher labels. The experimental results show that grammatical features improve the scoring performance, and further improvements are brought by MTL with holistic and grammar scores. We also show that weighting grammatical items using IRT-estimated difficulties improve the scoring performance, and IRT-estimated grammar abilities can be used for the labels of MTL.

    Download PDF (760K)
  • Masanari Ohi, Masahiro Kaneko, Ryuto Koike, Mengsay Loem, Naoaki Okaza ...
    2025 Volume 32 Issue 2 Pages 480-496
    Published: 2025
    Released on J-STAGE: June 15, 2025
    JOURNAL FREE ACCESS

    Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics.However, the likelihood, a measure of LLM’s plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure.It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate sentences with higher likelihoods while underrating those with lower likelihoods.In this paper, we investigate the presence and impact of likelihood bias in LLM-based evaluators.We also propose a method to mitigate the likelihood bias.Our method utilizes highly biased instances as few-shot examples for in-context learning.Our experiments in evaluating the data-to-text and grammatical error correction tasks reveal that several LLMs we test display a likelihood bias.Furthermore, our proposed method successfully mitigates this bias, also improving evaluation performance (in terms of correlation of models with human scores) significantly.

    Download PDF (508K)
  • Takuya Uematsu, Hao Wang, So Fukuda, Daisuke Kawahara, Tomohide Shibat ...
    2025 Volume 32 Issue 2 Pages 497-519
    Published: 2025
    Released on J-STAGE: June 15, 2025
    JOURNAL FREE ACCESS

    To develop high-performance and robust natural language processing (NLP) models, it is important to have various question answering (QA) datasets to train, evaluate, and analyze them. Although there are various QA datasets available in English, there are only a few QA datasets in other languages. We focus on Japanese, a language with only a few basic QA datasets, and aim to build a Japanese version of Natural Questions (NQ), JNQ, consisting of questions that naturally arise from human information needs. We collect natural questions from query logs of a Japanese search engine and build the dataset using crowdsourcing. Furthermore, we construct a Japanese version of BoolQ, JBoolQ, which is derived from NQ and consists of yes/no questions. We also re-define the dataset specification of the original NQ/BoolQ to construct JNQ/JBoolQ. JNQ consists of 16,641 questions, and JBoolQ consists of 6,467 questions. We also define three tasks from JNQ and one from JBoolQ and establish baselines using competitive methods drawn from related literature. We hope that these datasets will facilitate research on QA and NLP models in Japanese. We will make JNQ and JBoolQ publicly available.

    Download PDF (776K)
  • Terufumi Morishita, Gaku Morio, Atsuki Yamaguchi, Yasuhiro Sogawa
    2025 Volume 32 Issue 2 Pages 520-571
    Published: 2025
    Released on J-STAGE: June 15, 2025
    JOURNAL FREE ACCESS

    Large language models (LLMs) are capable of solving a wide range of tasks, yet they have struggled with reasoning. To address this, we propose Additional Logic Training (ALT), which aims to enhance LLMs’ reasoning capabilities by program-generated logical reasoning samples. We first establish principles for designing high-quality samples by integrating symbolic logic theory and previous empirical insights. Then, based on these principles, we construct a synthetic corpus named Formal Logic Deduction Diverse (FLD×𝟚). Finally, we empirically show that ALT on FLD×𝟚 substantially enhances the reasoning capabilities of state-of-the-art LLMs, including LLaMA-3.1-70B. Improvements include gains of up to 30 points on logical reasoning benchmarks, up to 10 points on math and coding benchmarks, and 5 points on the benchmark suite BBH.

    Download PDF (1977K)
  • Zihan Wang, Naoki Yoshinaga
    2025 Volume 32 Issue 2 Pages 572-597
    Published: 2025
    Released on J-STAGE: June 15, 2025
    JOURNAL FREE ACCESS

    Esports, a sports competition on video games, has become one of the most important sporting events. Despite the large accumulation of esports play logs, only a small portion are accompanied by text commentaries that help the audience retrieve and understand the plays. In this study, we introduce the task of generating commentaries from esports game’s data records. We begin by building large-scale esports data-to-text datasets that pair structured data records with textual commentaries from a popular esports game, League of Legends. We then explore several generation models to produce game commentaries from structured data records while also examining the impact of pre-trained language models. To assess the generated commentaries, we designed evaluation metrics that focused on the unique characteristics of esports data, such as evaluating strategic depth. The experimental results of the data-to-text generation using our dataset revealed the remaining challenges of this novel task.

    Download PDF (1214K)
  • Lee Sangmyeong, Seitaro Shinagawa, Koichiro Yoshino, Satoshi Nakamura
    2025 Volume 32 Issue 2 Pages 598-632
    Published: 2025
    Released on J-STAGE: June 15, 2025
    JOURNAL FREE ACCESS

    In the context of Real World Understanding (RWU) for vision and language (V&L) models, accurately aligning language with the corresponding visual scene is critical. Since current models typically assume language inputs to be plain text, RWU faces potential issues with structural ambiguity, where a single sentence can have multiple meanings due to various phrase structures. This paper proposes to use linguistic formalism as input, which enriches language information and addresses the issue of structural ambiguity. Our focus is on the Contrastive Language-Image Pre-training (CLIP) model, a prominent V&L model, focusing on image discrimination tasks of RWU. Our experiments test various approaches to incorporating formalism into the CLIP model, depending on the type of formalism and its processing method. We aim to determine the effectiveness of formalism in discriminating ambiguous images and identify which formalism works best. Additionally, we employ a gradient-based method to gain insights into how formalism is interpreted within the model’s architecture.

    Download PDF (919K)
  • Yui Oka, Daiki Yanamoto, Tsutomu Hirao, Kyosuke Nishida
    2025 Volume 32 Issue 2 Pages 633-659
    Published: 2025
    Released on J-STAGE: June 15, 2025
    JOURNAL FREE ACCESS

    Implicit Discourse Relation Recognition (IDRR) involves identifying the sense label of an implicit connective between adjacent text spans. This has traditionally been approached as a classification task. However, sense labels cannot exhaustively represent all discourse. This paper presents Implicit Sense-labeled Connective Recognition (ISCR), which identifies the implicit connectives as well as their sense labels between adjacent text spans. ISCR can be treated as a classification task, but it’s actually difficult due to the large number of potential categories, the use of sense labels, and the uneven distribution of instances among them. Accordingly, this paper instead handles ISCR as a text-generation task, using an encoder-decoder model to generate both connectives and their sense labels. From our evaluation results, we found that our classification method outperforms the conventional classification-based method.

    Download PDF (464K)
  • Mai Omura, Yoshiko Kawabata, Hikari Konishi, Masayuki Asahara, Johane ...
    2025 Volume 32 Issue 2 Pages 660-678
    Published: 2025
    Released on J-STAGE: June 15, 2025
    JOURNAL FREE ACCESS

    In this study, we constructed a database of expressions referring to both location and route information through crowdsourcing, and made it publicly available as open data. Twenty maps were used as stimuli, with 40 participants per map asked to describe the location of a target point, resulting in 800 referring expressions. For route information, two routes were defined on each map, and 40 participants per route were asked to describe the route between two points, yielding 1,600 referring expressions. Each expression was evaluated to determine whether it constituted a relative reference based on landmarks on the map. Location-referring expressions were categorized into four types: first-person perspective, within-space perspective, within-space movement, and bird’s-eye view. Route-referring expressions were labeled according to the presence of information about the starting point, waypoints, and the endpoint. Additionally, a survey was conducted to assess the comprehensibility of each expression, and the resulting data were collected accordingly.

    Download PDF (764K)
Society Column (Non Peer-Reviewed)
Supporting Member Column (Non Peer-Reviewed)
Information (Non Peer-Reviewed)
feedback
Top