自然言語処理

巻頭言（査読無）

実験の評価と再現性

鈴木潤

2025 年 32 巻 1 号 p. 1-2
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.1

ジャーナルフリー

PDF形式でダウンロード (146K)

一般論文（査読有）

Visual Question Answering における視線情報を用いた質問の曖昧性解消

稲積駿, 河野誠也, 湯口彰重, 川西康友, 吉野幸一郎

2025 年 32 巻 1 号 p. 3-35
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.3

ジャーナルフリー

抄録を表示する抄録を非表示にする

画像に対する質問応答 (VQA: Visual Question Answering)のような画像を参照する会話では指示語の利用により質問に曖昧さが生じる．また，言語によっては質問の中核となる情報を持つ項の省略が行われ，問題はさらに複雑になる．こうした質問の曖昧さが生じる場合，質問の話者は相手と言外に共有している情報，例えば視線（注視）や指差しなど，を前提にしている場合が多い．本研究ではこうした視線情報の参照による質問の曖昧性解消に着目し，注視対象の物体と質問の指示語や省略が対応づいた視線情報付き VQA データセット (LookVQA) を提案する．本研究ではさらに，本データセットにおける質問応答の精度を高めるため，話者の視線元からの注視対象推定を活用する質問応答モデルを提案する．実験の結果，提案モデルは LookVQA における特定の質問タイプに精度良く回答ができ，注視対象推定を用いない既存モデルと比較して優れた性能を達成した．

抄録全体を表示

PDF形式でダウンロード (2426K)
大規模なメタファー自動推定結果に基づくメタファーに関する仮説の検証

青野広太郎, 笹野遼平, 武田浩一

2025 年 32 巻 1 号 p. 36-54
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.36

ジャーナルフリー

抄録を表示する抄録を非表示にする

どのような単語がメタファーとして使われやすいかについて，言語学においていくつかの仮説が存在する．しかし，大規模なコーパスに含まれる，日常的かつ多様なテキスト表現に対してこれらの仮説を検証した研究は少ない．本研究では Web データを収集した大規模なコーパスである Common Crawl から抽出した文に自動メタファー判別器を適用することによって，メタファーに関する既存の仮説について大規模コーパスに基づく検証と分析を行う．具体的には，動詞メタファーの目的語の具象度，心像度，親密度に関する 3 つの仮説と，メタファーが含まれる文における感情および主観性に関する 2 つの仮説の合計五つの仮説を検証する．検証を通じて，これら 5 つの仮説がすべて成立し，目的語の具象度，心像度，親密度が低い動詞のほうが，メタファーとなりやすいことや，感情，主観性を持つ文の方がメタファーが使われやすいことを示す.

抄録全体を表示

PDF形式でダウンロード (378K)
日本語日常会話コーパスのUniversal Dependencies: UD_Japanese-CEJC

大村舞, 若狭絢, 松田寛, 浅原正幸

2025 年 32 巻 1 号 p. 55-90
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.55

ジャーナルフリー

抄録を表示する抄録を非表示にする

本研究では，日本語日常会話コーパス (CEJC) を Universal Dependencies 形式に変換した日本語話し言葉のツリーバンク UD_Japanese-CEJC を開発・構築したので，そのデータについて報告する．日本語日常会話コーパスは，日本語の様々な日常会話を収録した大規模な音声言語コーパスであり，単語区切りや品詞のアノテーションが含まれている．我々は，UD_Japanese-CEJC のために，CEJC の長単位形態論情報と文節係り受け情報を新たにアノテーションした．UD_Japanese-CEJC は日本語形態論情報と文節ベースの依存構造情報および CEJC から手作業で整備された変換ルールに従って構築した．構築した UD_Japanese-CEJC に対して，日本語書き言葉コーパスとの比較や UD 依存構造解析精度の評価をおこない，CEJC におけるUD構築に関する様々な問題点を検討した．

抄録全体を表示

PDF形式でダウンロード (1601K)
訓練データを用いた言語モデル生成の確信度推定

吉川和, 岡崎直観

2025 年 32 巻 1 号 p. 91-113
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.91

ジャーナルフリー

抄録を表示する抄録を非表示にする

大規模言語モデルの性能向上に伴い，モデルの生成内容の誤りの検知や対策が喫緊の課題となっている．言語モデル生成の誤り検知の手段の一つとして，生成時に得られる情報に基づく出力内容の確信度推定がある．既存の確信度推定手法ではモデルの出力や内部状態が用いられている一方で，言語モデルの訓練データにアクセス可能な設定での確信度推定および評価については十分に検討されていない．本研究では，学習済み言語モデルの出力の確信度推定における訓練データの有用性を検討するため，中規模の言語モデルを学習し，訓練データ全文からなるデータストアを構築し，訓練データに基づく複数の確信度推定方法を検討・評価した．言語モデルの知識評価タスクを用いた実験の結果，モデルが出力する尤度と訓練データにおける関連事例の有無の情報を組み合わせて用いることで，訓練データを用いない場合と比べて確信度推定の精度を改善できることを確認した．

抄録全体を表示

PDF形式でダウンロード (896K)
文内コンテキストを利用した分割統治ニューラル機械翻訳

石川隆太, 加納保昌, 須藤克仁, 中村哲

2025 年 32 巻 1 号 p. 114-133
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.114

ジャーナルフリー

抄録を表示する抄録を非表示にする

ニューラル機械翻訳 (NMT) は柔軟な訳語選択と流暢な訳出により品質の高い翻訳が得られることが多いが，長い入力文に対しては翻訳の品質が低下することがある．この課題に対し，長文を短いセグメントに分割して翻訳し，並べ替えて繋げる分割統治的手法が提案されているものの，NMT での性能向上は限定的であった．そこで本研究では，文内コンテキストを利用することで長文の翻訳を改善する新しい分割統治的 NMT の手法を提案する．提案手法では，(1) 構文解析によって同定された節を結ぶ等位接続詞の前後で分割し，(2) 分割された各節を，その文内コンテキストを利用できるように調整された節単位翻訳用モデルを用いて翻訳し，(3) 翻訳された節を別の sequence-to-sequence モデルを使用して結合し，文全体の翻訳結果を得る．事前訓練された多言語BARTモデルを使用しASPECを対象にした英日翻訳の実験において，特に 41 単語以上の長い入力文に対して，提案手法によりベースラインの多言語 BART による NMT を上回る翻訳精度が得られた．

抄録全体を表示

PDF形式でダウンロード (1954K)
Discovering Unusual Word Usages with Masked Language Model via Pseudo-label Training

Tatsuya Aoki, Jey Han Lau, Hidetaka Kamigaito, Hiroya Takamura, Timoth ...

2025 年 32 巻 1 号 p. 134-175
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.134

ジャーナルフリー

抄録を表示する抄録を非表示にする

User-generated texts contain not only non-standard words such as b4 for before, but unusual word usages such as catfish for a person who uses fake identity online, which requires knowledge about the words to handle such cases in natural language processing. We present a neural model for detecting the non-standard usages in social media text. To deal with the lack of training data for this task, we propose a method for synthetically generating pseudo non-standard examples from a corpus, which enables us to train the model without manually-annotated training data and for any arbitrary language. Experimental results on Twitter and Reddit datasets show that our proposed method achieves better performance than existing methods, and is effective across different languages.

抄録全体を表示

PDF形式でダウンロード (443K)
JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources

Benjamin Clavié

2025 年 32 巻 1 号 p. 176-218
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.176

ジャーナルフリー

抄録を表示する抄録を非表示にする

Neural Information Retrieval has advanced rapidly in high-resource languages, but progress in lower-resource ones such as Japanese has been hindered by data scarcity, among other challenges. Consequently, multilingual models have dominated Japanese retrieval, despite their computational inefficiencies and inability to capture linguistic nuances. While recent multi-vector monolingual models like JaColBERT have narrowed this gap, they still lag behind multilingual methods in large-scale evaluations. This work addresses the suboptimal training methods of multi-vector retrievers in lower-resource setting, focusing on Japanese. We systematically evaluate and improve key aspects of the inference and training settings of JaColBERT, and more broadly, multi-vector models. We further enhance performance through a novel checkpoint merging step, showcasing it to be an effective way of combining the benefits of fine-tuning with the generalization capabilities of the original checkpoint. Building on our analysis, we introduce a novel training recipe, resulting in the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and trained in under 15 hours on 4 A100 GPUs, significantly outperforms all existing methods across all common benchmarks, reaching an average score of 0.754, significantly above the previous best of 0.720. To support future research, we make our final models, intermediate checkpoints and all data used publicly available.

抄録全体を表示

PDF形式でダウンロード (267K)
Data Augmentation for Low-Resource Languages in Multilingual Dependency Parsing

Jiannan Mao, Chenchen Ding, Hour Kaing, Hideki Tanaka, Masao Utiyama, ...

2025 年 32 巻 1 号 p. 219-251
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.219

ジャーナルフリー

抄録を表示する抄録を非表示にする

UDify (Kondratyuk and Straka 2019) is a multilingual, multi-task parser fine-tuned on mBERT that achieves remarkable performance on high-resource languages. However, on some low-resource languages, its performance saturates early and decreases gradually as training proceeds. To address this issue, this study applies a data augmentation method to improve parsing performance. We conducted experiments on five few-shot and three zero-shot languages to test the effectiveness of this approach. The unlabeled attachment scores were improved on the zero-shot language dependency parsing tasks, with the average score increasing from 55.6% to 59.0%. Meanwhile, dependency parsing tasks in high-resource languages and other Universal Dependencies tasks were almost unaffected. The experimental results demonstrate that the data augmentation method is effective for low-resource languages in multilingual dependency parsing. Furthermore, our experiments confirm that continuously increasing the quantity of synthetic data enhances UDify's performance. This improvement was particularly effective for zero-shot target languages.

抄録全体を表示

PDF形式でダウンロード (1129K)
DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation

Aru Maekawa, Satoshi Kosugi, Kotaro Funakoshi, Manabu Okumura

2025 年 32 巻 1 号 p. 252-282
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.252

ジャーナルフリー

抄録を表示する抄録を非表示にする

Dataset distillation aims to compress a training dataset by creating a few informative synthetic samples such that the neural networks trained on them perform as best as those trained on the original training dataset. Current text dataset distillation methods create each synthetic sample as a sequence of word embeddings instead of text data to apply gradient-based optimization; however, such embedding-level distilled datasets cannot be used for training other models whose word embedding weights are different from the model used for distillation. To address this issue, we propose a novel text dataset distillation approach, called distilling dataset into language model (DiLM), which trains a language model to generate informative synthetic training samples as text data, rather than directly optimizing synthetic samples. We evaluated DiLM on various text classification datasets and showed that the distilled synthetic datasets from DiLM outperformed those from the current coreset selection methods. DiLM achieved remarkable generalization performance in training different types of models and in the in-context learning of large language models. Our code is available at https://github.com/arumaekawa/DiLM.

抄録全体を表示

PDF形式でダウンロード (627K)
Dataset Distillation with Attention Labels for Fine-tuning BERT

Aru Maekawa, Naoki Kobayashi, Kotaro Funakoshi, Manabu Okumura

2025 年 32 巻 1 号 p. 283-299
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.283

ジャーナルフリー

抄録を表示する抄録を非表示にする

Dataset distillation aims to create a small dataset of informative synthetic samples to rapidly train neural networks that retain the performance of the original dataset. In this study, we focus on constructing distilled few-shot datasets for natural language processing (NLP) tasks to fine-tune pre-trained transformers. Specifically, we propose introducing attention labels, which can efficiently distill knowledge from the original dataset and transfer it to transformer models via attention probabilities. We evaluated our dataset distillation methods in four NLP tasks and demonstrated that it is possible to create distilled few-shot datasets with attention labels, yielding an impressive performance for fine-tuning BERT. Specifically, in AGNews, which is a four-class news classification task, our distilled few-shot dataset achieved up to 93.2% accuracy, which is 98.5% that of the original dataset, even with only one sample per class and only one gradient step.

抄録全体を表示

PDF形式でダウンロード (157K)
Negation Scope Conversion for Unifying Negation-Annotated Datasets

Asahi Yoshida, Yoshihide Kato, Shigeki Matsubara

2025 年 32 巻 1 号 p. 300-329
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.300

ジャーナルフリー

抄録を表示する抄録を非表示にする

Negation scope resolution is a technique that identifies the part of a sentence affected by the negation cue. The three major corpora used for it, the BioScope corpus, the SFU review corpus, and the Sherlock dataset, have different annotation schemes for negation scope. Due to the different annotations, it is difficult to use the three corpora together in the study of negation scope resolution. To address this issue by merging the corpora into a unified dataset based on a common annotation scheme, we propose a method for automatically converting the scopes of BioScope and SFU to those of Sherlock. We conducted an experiment to evaluate the accuracy of our method using a dataset obtained by manually annotating the negation scopes to a tiny portion of BioScope and SFU, verifying that our method can convert the scopes with high accuracy. In addition, we conducted another experiment to verify the effectiveness of our method from a pragmatic perspective, where we fine-tuned PLM-based negation scope resolution models using the unified dataset obtained by our method. The results demonstrated that the performances of the models increase when fine-tuned on the unified dataset, unlike the simply combined one, which supports the effectiveness of our method.

抄録全体を表示

PDF形式でダウンロード (388K)

学会記事（査読無）

言語処理学会30周年記念事業実施報告

井之上直也, 大内啓樹, 河原大輔, 黒橋禎夫, 小町守, 須藤克仁, 永田亮, 松原茂樹, 宮尾祐介

2025 年 32 巻 1 号 p. 330-341
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.330

ジャーナルフリー

PDF形式でダウンロード (466K)
Unveiling Multi-level and Multi-modal Semantic Representations in the Human Brain using Large Language Models

中木裕子, 松山卓矢

2025 年 32 巻 1 号 p. 342-347
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.342

ジャーナルフリー

PDF形式でダウンロード (426K)
Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair

坂井優介

2025 年 32 巻 1 号 p. 348-354
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.348

ジャーナルフリー

PDF形式でダウンロード (568K)
Filtered Direct Preference Optimization

森村哲郎

2025 年 32 巻 1 号 p. 355-359
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.355

ジャーナルフリー

PDF形式でダウンロード (413K)
Can Language Models Induce Grammatical Knowledge from Indirect Evidence?

大羽未悠

2025 年 32 巻 1 号 p. 360-365
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.360

ジャーナルフリー

PDF形式でダウンロード (427K)
Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes

西田光甫

2025 年 32 巻 1 号 p. 366-371
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.366

ジャーナルフリー

PDF形式でダウンロード (399K)
計算心理言語学の新地平開拓の試み：効率的なコミュニケーション仮説の検証

梶川康平

2025 年 32 巻 1 号 p. 372-378
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.372

ジャーナルフリー

PDF形式でダウンロード (344K)

後付記事（査読無）

編集後記・原稿執筆案内・編集スケジュール・統計情報・学会案内

2025 年 32 巻 1 号 p. 379-401
発行日: 2025年
公開日: 2025/03/15

DOIhttps://doi.org/10.5715/jnlp.32.379

ジャーナルフリー

PDF形式でダウンロード (561K)