2024 Volume 31 Issue 2 Pages 637-679
Temporal inference, i.e., natural language inference involving time, is a challenging task because of the complex interaction of various time-related linguistic phenomena, such as tense and aspect. Although various temporal inference datasets have been provided to assess the temporal inference ability of language models, their primary focus is on English and only on a few linguistic phenomena. Therefore, whether Japanese language models can generalize diverse temporal inference patterns is yet to be understood. In this research, we constructed a controlled Japanese temporal inference dataset considering aspect (Jamp_sp), which includes a variety of temporal inference patterns. The training and test data in Jamp_sp can be controlled based on problem attributes such as temporal inference patterns and time formats, thereby allowing a detailed analysis of the generalization capacity of the language models. To accomplish this objective, we trained the language models on the training data before and after the split, and evaluated them on our test data. The results demonstrate that Jamp_sp is a challenging dataset not only for discriminative language models but also for current generative language models, such as GPT-4, and that there is room for improvement in the generalization capacity of these models.