2021 年 28 巻 2 号 p. 380-403
Most machine translation (MT) research has focused on sentences as translation units (sentence-level MT), and has achieved acceptable translation quality for sentences where cross-sentential context is not required in mainly high-resourced languages. Recently, many researchers have worked on MT models that can consider a cross-sentential context. These models are often called context-aware MT or document-level MT models. Document-level MT is difficult to 1) train with a small amount of document-level data; and 2) evaluate, as the main methods and datasets focus on sentence-level evaluation. To address the first issue, we present a Japanese–English conversation corpus in which the cross-sentential context is available. As for the second issue, we manually identify the main areas where sentence-level MT fails to produce adequate translations in the lack of context. We then create an evaluation set in which these phenomena are annotated to alleviate the automatic evaluation of document-level systems. We train MT models using our corpus to demonstrate how the use of context leads to improvements.