Journal of Natural Language Processing

Preface (Non Peer-Reviewed)

[title in Japanese]

[in Japanese]

2024Volume 31Issue 4 Pages 1425-1426
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1425

JOURNAL FREE ACCESS

Download PDF (132K)

General Paper (Peer-Reviewed)

Investigation of the Inference Capabilities and Memorization of Pre-trained Language Models

Yusuke Sakai, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe

2024Volume 31Issue 4 Pages 1427-1457
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1427

JOURNAL FREE ACCESS

Show abstractHide abstract

Pre-trained Language Models (PLMs) can answer known problems using acquired knowledge and natural language understanding capability from pre-training, while unknown problems require pure inference capabilities to answer. To evaluate pure inference capabilities, we need to separately consider memorization capability, which is difficult with existing datasets due to its known information in PLMs. This study targets Knowledge Graph Completion (KGC), predicting unknown relations (links) from known ones in the knowledge graphs. Traditional embedding-based KGC methods predict missing links from pure inference capability, while recent PLM-based KGC methods also utilize knowledge obtained in pre-training. Therefore, KGC is suitable for evaluating the effect of memorization capability and inference capability. We propose a method to construct datasets for measuring the performance of memorized knowledge and inference capability in KGC. We discuss whether PLMs make inferences based on memorized knowledge about entities and its conclusion suggests that PLMs also learn inference capabilities for unknown problems.

View full abstract

Download PDF (1310K)
Data Augmentation by Paraphrasing with Controllable Semantic and Lexical Similarities

Yuya Ogasa, Tomoyuki Kajiwara, Yuki Arase

2024Volume 31Issue 4 Pages 1458-1486
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1458

JOURNAL FREE ACCESS

Show abstractHide abstract

Paraphrases that exhibit significant surface differences are valuable for data augmentation, yet their generation is known to be challenging. In this study, we develop a model capable of generating such desired paraphrases employing a straightforward mechanism to manage similarity: tags denoting semantic and lexical similarities are affixed to the beginning of input sentences. We compile a training corpus by selecting paraphrase pairs with distinct surface characteristics from a variety of pseudo-paraphrases generated via round-trip translation. Experimental results demonstrate the efficacy of our approach through data augmentation in contrastive learning and pre-fine-tuning of pretrained language models. Additionally, our findings indicate that (1) achieving an appropriate level of paraphrase similarity largely depends on the downstream task and (2) a mixture of paraphrases exhibiting varying degrees of similarity adversely affects downstream task performance.

View full abstract

Download PDF (2389K)
Construction and Analysis of Evaluation Dataset for Japanese Lexical Semantic Change Detection

Zhidong Ling, Taichi Aida, Teruaki Oka, Mamoru Komachi

2024Volume 31Issue 4 Pages 1487-1522
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1487

JOURNAL FREE ACCESS

Show abstractHide abstract

Research in natural language processing has seen growing interest in automatically detecting and analyzing words whose meanings evolve over time from corpora. While diachronic corpora and evaluation word lists have been established for languages like English and German, such resources are lacking for Japanese. This study addresses this gap by introducing the Japanese Lexical Semantic Change Detection Dataset (JaSemChange), which has a list of evaluation words for Japanese. Leveraging three diachronic corpora spanning near modern to contemporary Japanese, we sampled usages of target words as pairs. A team of four experts annotated a total of 2,280 usage pairs of target words with semantic similarity to gauge the degree of semantic change. Furthermore, we assessed the performance of word embedding-based methods in detecting semantic change using this dataset. In addition to using frequency-based methods as a baseline, we compared the effectiveness of typical type-based and token-based methods and explored their respective characteristics. The dataset, covering the list of words assigned a degree of semantic change and the annotation scores for the usage pairs, is publicly available on GitHub.

View full abstract

Download PDF (1125K)
Generative Data Augmentation for Aspect Sentiment Quad Prediction

An Wang, Junfeng Jiang, Youmi Ma, Ao Liu, Naoaki Okazaki

2024Volume 31Issue 4 Pages 1523-1544
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1523

JOURNAL FREE ACCESS

Show abstractHide abstract

Aspect sentiment quad prediction (ASQP) analyzes the aspect terms, opinion terms, sentiment polarity, and aspect categories in a text. One challenge in this task is the scarcity of data owing to the high annotation cost. Data augmentation techniques are commonly used to address this issue. However, existing approaches simply rewrite texts in the training data, restricting the semantic diversity of the generated data and impairing the quality due to the inconsistency between text and quads. To address these limitations, we augment quads and train a quads-to-text model to generate corresponding texts. Furthermore, we designed novel strategies to filter out low-quality data and balance the sample difficulty distribution of the augmented dataset. Empirical studies on two ASQP datasets demonstrate that our method outperforms other data augmentation methods and achieves state-of-the-art performance on the benchmarks.

View full abstract

Download PDF (4646K)
Transformer-based Live Update Generation for Soccer Matches from Microblog Posts

Masashi Oshika, Kosuke Yamada, Ryohei Sasano, Koichi Takeda

2024Volume 31Issue 4 Pages 1545-1562
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1545

JOURNAL FREE ACCESS

Show abstractHide abstract

When a sports match is broadcast, X users often enjoy sharing the comment and it is possible to roughly understand a match’s progress by reading these posts. However, because of the diverse nature of posts, it can be challenging to quickly grasp a match’s progress. In this study, we focus on soccer matches and work on building a system to generate live updates from posts so that users can instantly grasp a match’s progress. Our system is based on a large language model T5, and outputs updates at certain times by inputting posts related to a specific match. However simply applying the model to this task caused two problems of the number of generated updates and redundant updates. Therefore, we propose mechanisms that incorporate a classifier to control the number of generated updates and a mechanism that takes into account the previous updates to mitigate redundancy.

View full abstract

Download PDF (704K)
Semantic Shift Stability: Auditing Time-Series Performance Degradation of Pre-trained Models via Semantic Shift of Words in Training Corpus

Shotaro Ishihara, Hiromu Takahashi, Hono Shirai

2024Volume 31Issue 4 Pages 1563-1597
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1563

JOURNAL FREE ACCESS

Show abstractHide abstract

Auditing time-series performance degradation has become a challenge as researchers and practitioners commonly use pre-trained models. Pre-trained language models typically incur huge costs in training and inference; therefore, considering efficient auditing and retraining schemes becomes important. This study proposes a framework for auditing the time-series performance degradation of pre-trained language models and word embeddings by calculating the semantic shift of words in the training corpus and supporting decision-making regarding re-training. First, we constructed RoBERTa and word2vec models with different training corpus periods using Japanese and English news articles from 2011 to 2021 and observed the time-series performance degradation. Semantic Shift Stability, a metric that can be calculated from the diachronic word semantic shift in the training corpus, was smaller when the performance of the pre-trained models degraded significantly over time. This confirmed that the metric is useful in monitoring applications. The proposed framework has advantages of inferring the cause by using words with significant changes in meaning. The experiments conducted implied the effects of the 2016 U.S. presidential election and the 2020 COVID-19 pandemic. The source code is available at the URL https://github.com/Nikkei/semantic-shift-stability.

View full abstract

Download PDF (1662K)
Court Case Dataset for Japanese Online Offensive Language Detection

Shohei Hisada, Shoko Wakamiya, Eiji Aramaki

2024Volume 31Issue 4 Pages 1598-1634
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1598

JOURNAL FREE ACCESS

Show abstractHide abstract

The growing social concern over offensive language on digital platforms has spurred research into datasets and automated detection to better understand its nature and develop countermeasures. Existing datasets often simplify tasks and rely on subjective judgments by nonexpert annotators through crowdsourcing. This approach leads to a disconnect from actual issues and a lack of consideration for social and cultural contexts, indicating the need for approaches that adjust to individual societal contexts while utilising social science expertise. This study proposes a Japanese dataset for offensive language detection based on Japanese court cases. Our dataset utilises labels for offensive language, legal rights such as the right to reputation and sense of honour, and judicial decisions. Furthermore, by validating the automated detection methods, we identify gaps in practical issues and discuss areas for improvement. This research aims to build a dataset that reflects real societal issues, promoting fairer content moderation practices and fostering discussions on integrating expertise from other domains.

View full abstract

Download PDF (984K)
Construction and Analysis of Utterance Corpus Incorporating Arbitrary User Information for Personalized Chat-Oriented Dialogue Systems

Yuiko Tsunomori, Ryuichiro Higashinaka

2024Volume 31Issue 4 Pages 1635-1664
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1635

JOURNAL FREE ACCESS

Show abstractHide abstract

To construct a chat-oriented dialogue system that users will use for a long time, it is important to establish a good relationship between the user and the system. In this paper, aiming to realize a personalizable chat-oriented dialogue system that establishes such a relationship with users by utilizing arbitrary user information naturally in dialogues, we constructed a novel corpus designed to incorporate arbitrary user information into system utterances regardless of the current dialogue topic while retaining appropriateness for the context. We then trained a model to generate appropriate system utterances using the constructed corpus. The results of a subjective evaluation indicated that the model could successfully generate system utterances incorporating arbitrary user information and dialogue context. Furthermore, we integrated our trained model into a dialogue system and validated the effectiveness of system utterances incorporating arbitrary user information and dialogue context through interactive dialogues with users.

View full abstract

Download PDF (737K)
Effectiveness of the Geographical Specificity Index in Document Geolocation

Soichi Kageyama, Takashi Inui

2024Volume 31Issue 4 Pages 1665-1690
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1665

JOURNAL FREE ACCESS

Show abstractHide abstract

This study proposes a method to measure geographic specificity for mentions containing geographic location attributes, such as place names and landmarks that appear in documents, and evaluates its effectiveness. We first introduce the two key components of geographic specificity: geographic ambiguity and name exclusivity. We then propose an approach for calculating index values derived from Wikipedia data, utilizing techniques inspired by existing entity-linking methods. Subsequently, we conducted document geolocation experiments, integrating geographic specificity information into current document geolocation methods. Our findings indicate that both components of geographic specificity significantly enhance the accuracy of document geolocation.

View full abstract

Download PDF (455K)
How Domain Adaptation of BERT Improves Syntactic Parsing of Math Text

Runa Yoshida, Takuya Matsuzaki

2024Volume 31Issue 4 Pages 1691-1716
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1691

JOURNAL FREE ACCESS

Show abstractHide abstract

This study clarifies how the domain adaptation of bidirectional encoder representations from transformers (BERT) contributes to the syntactic analysis of mathematical texts and their limitations. Experimental results show that the domain adaptation of BERT is highly effective, even with a relatively small amount of raw in-domain data. This improves the accuracy of the syntactic dependency analysis by up to four points without any annotated in-domain data. By analyzing the improvement, we found that numerous errors involving the mathematical expressions have been corrected. Errors related to structures that are not frequent in the out-domain fine-tuning data were difficult to improve by only the domain adaptation of BERT. This study also revealed that the effectiveness of BERT depends on the representation of the mathematical expressions in the input. Among several different representations of mathematical expressions, the highest dependency accuracy was achieved using a simple method where an entire mathematical expression is replaced with a dedicated special token.

View full abstract

Download PDF (788K)

System Paper (Peer-Reviewed)

Building and Leveraging Domain-specific Pre-trained Models to Support Japanese News Summarization

Shotaro Ishihara, Eiki Murata, Yasufumi Nakama, Hiromu Takahashi

2024Volume 31Issue 4 Pages 1717-1745
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1717

JOURNAL FREE ACCESS

Show abstractHide abstract

This study presents an editing support system based on domain-specific pre-trained models to support the summarization of Japanese news articles. Specifically, we organized the real-world system requirements and presented an editing support system developed by combining existing technologies and the evaluation points to be investigated. First, we pre-trained and fine-tuned T5 models on Japanese financial news corpora to reproduce a specific writing style and observed that they outperformed general models in the headline and three-line summary generation tasks, despite the smaller size of the training corpus. Second, we quantitatively and qualitatively analyzed the hallucinations of the domain-specific T5 models to reveal the characteristics of the generated hallucinations. Finally, the usefulness of the overall system, including domain-specific BERT models for predicting click-through rates, was discussed.

View full abstract

Download PDF (798K)

Society Column (Non Peer-Reviewed)

YANS 2024

Hiroki Ouchi, Motoki Sato

2024Volume 31Issue 4 Pages 1746-1751
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1746

JOURNAL FREE ACCESS

Download PDF (289K)
Organizing INLG2024: A Report from the Local Chair’s Perspective

Tatsuya Ishigaki

2024Volume 31Issue 4 Pages 1752-1754
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1752

JOURNAL FREE ACCESS

Download PDF (211K)
Event Report on The 13th edition of the Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2024)

Tatsuki Kuribayashi, Yohei Oseki

2024Volume 31Issue 4 Pages 1755-1760
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1755

JOURNAL FREE ACCESS

Download PDF (271K)
Subspace Representations for Soft Set Operations and Sentence Similarities

Ishibashi Yoichi

2024Volume 31Issue 4 Pages 1761-1766
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1761

JOURNAL FREE ACCESS

Download PDF (249K)
Does Pre-trained Language Model Actually Infer Unseen Links in Knowledge Graph Completion?

Yusuke Sakai

2024Volume 31Issue 4 Pages 1767-1773
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1767

JOURNAL FREE ACCESS

Download PDF (470K)
On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons

Takeshi Kojima

2024Volume 31Issue 4 Pages 1774-1779
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1774

JOURNAL FREE ACCESS

Download PDF (434K)
Aiming to Balance Science and Business Contributions

Masato Mita

2024Volume 31Issue 4 Pages 1780-1785
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1780

JOURNAL FREE ACCESS

Download PDF (645K)
Memoir: Emergent Word Order Universals from Cognitively-Motivated Language Models

Tatsuki Kuribayashi

2024Volume 31Issue 4 Pages 1786-1791
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1786

JOURNAL FREE ACCESS

Download PDF (401K)
Let us spend a bit different summer with the LSJ Summer Institute

Ryo Nagata

2024Volume 31Issue 4 Pages 1792-1799
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1792

JOURNAL FREE ACCESS

Download PDF (305K)

Information (Non Peer-Reviewed)

[title in Japanese]

2024Volume 31Issue 4 Pages 1800-1805
Published: 2024
Released on J-STAGE: December 15, 2024

DOIhttps://doi.org/10.5715/jnlp.31.1800

JOURNAL FREE ACCESS

Download PDF (312K)

Register with J-STAGE for free!