Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 29, Issue 4
Displaying 1-19 of 19 articles from this issue
Preface (Non Peer-Reviewed)
General Paper (Peer-Reviewed)
  • Katsuki Chousa, Makoto Morishita, Masaaki Nagata
    2022 Volume 29 Issue 4 Pages 1052-1081
    Published: 2022
    Released on J-STAGE: December 15, 2022
    JOURNAL FREE ACCESS

    Lexically constrained machine translation is a task wherein the translation model is required to output translated sentences that contain all specified phrase constraints. In this paper, we propose a method for improving the efficiency of lexically-constrained decoding by extending the input sequence of the model. The results of experiments performed on En↔Ja indicate that the proposed method achieves higher translation accuracy with less computational cost than do the conventional methods. Furthermore, we propose a method for automatically extracting noisy lexical constraints by using the lexical constraint machine translation method. Experiments on Ja→En show that the proposed method can achieve a higher level of accuracy than do general machine translation methods.

    Download PDF (561K)
  • Hideya Mino, Kazutaka Kinugawa, Hitoshi Ito, Isao Goto, Ichiro Yamada, ...
    2022 Volume 29 Issue 4 Pages 1082-1105
    Published: 2022
    Released on J-STAGE: December 15, 2022
    JOURNAL FREE ACCESS

    Knowledge distillation is a representative approach in neural machine translation (NMT) for compressing a large model into a lightweight one. This approach first trains a strong teacher model, and then forces a more compact student model to imitate the teacher. Although the key to successful knowledge distillation is constructing a stronger teacher model, the teacher model using state-of-the-art NMT may remain inadequate owing to translation errors. Accordingly, using an inadequate teacher model severely degrades the student model due to error propagation, especially regarding words important to sentence meaning. To mitigate the degradation problem, we propose a knowledge distillation method using a lexical constraint as privileged information for NMT. The proposed method trains a teacher model with a lexical constraint, a list of words automatically extracted from a target sentence in the training data. We configure the lexical constraint according to the importance of words and the fallibility of NMT. Models trained with our proposed method result in improved translation compared with those trained with a baseline method for English↔German and English↔Japanese translation tasks under the condition without ensemble decoding and beam-search decoding.

    Download PDF (380K)
  • Taichi Nishimura, Kojiro Sakoda, Atsushi Ushiku, Atsushi Hashimoto, Na ...
    2022 Volume 29 Issue 4 Pages 1106-1137
    Published: 2022
    Released on J-STAGE: December 15, 2022
    JOURNAL FREE ACCESS

    In this study, we propose an egocentric biochemical video-and-language dataset called BioVL2 comprising eight videos for each of four experiments, with a total duration of 2.5 hours for all 32 samples. Each video corresponds to a protocol and two types of linguistic annotations are provided: (1) video-and-text alignment and (2) bounding boxes linked to objects in the protocol. As an application of the BioVL2 dataset, we consider the task of generating a protocol from an experimental video. Our experimental results show that the proposed system can generate better protocols than a weak baseline designed to output objects appearing in the video frames. The BioVL2 dataset will be released for research purposes only.

    Download PDF (9362K)
  • Qin Dai, Benjamin Heinzerling, Naoya Inoue, Kentaro Inui
    2022 Volume 29 Issue 4 Pages 1138-1164
    Published: 2022
    Released on J-STAGE: December 15, 2022
    JOURNAL FREE ACCESS

    This paper explores how the Distantly Supervised Relation Extraction (DS-RE) can benefit from the use of a Universal Graph (UG), the combination of a Knowledge Graph (KG) and a large-scale text collection. A straightforward extension of a current state-of-the-art neural model for DS-RE with a UG may lead to degradation in performance. We first report that this degradation is associated with the difficulty in learning a UG and then propose three training strategies: (1) Path Type Adaptive Pretraining, which sequentially trains the model with different types of UG paths; (2) Path Type-wise Local Loss, which is an alternative approach of the Path Type Adaptive Pretraining to generate UG path type-wise local error signals so as to prevent the reliance on a single type of UG path; and (3) Complexity Ranking Guided Attention mechanism, which restricts the attention span according to the complexity of UG paths so as to force the model to extract features not only from simple UG paths but also from complex ones. Experimental results on both biomedical and NYT10 datasets prove the robustness of our methods and achieve a new state-of-the-art result on the commonly used NYT10 dataset. The code and datasets used in this paper are available at https://github.com/baodaiqin/UGDSRE. In addition, a DS-RE toolkit developed based on this work is available at https://github.com/baodaiqin/UKG-RE.

    Download PDF (1191K)
  • Shuntaro Yada, Ribeka Tanaka, Fei Cheng, Eiji Aramaki, Sadao Kurohashi
    2022 Volume 29 Issue 4 Pages 1165-1197
    Published: 2022
    Released on J-STAGE: December 15, 2022
    JOURNAL FREE ACCESS

    Natural language processing for medical applications (medical NLP) requires high-quality annotated corpora. In this study, we designed a versatile annotation scheme for clinical-medical text and a set of associated guidelines, which address two common subtasks used in medical NLP: named entity recognition (NER) and relation extraction (RE). The annotation scheme integrates similar existing schemes and defines clinical-medical entities and relations to encode useful information for many medical NLP applications. The guidelines aim to increase the annotation feasibility by reducing the necessity of judgement based on medical knowledge so as to enable non-medical professionals to annotate the text. We adopted a recursive discussion procedure involving NLP researchers, medical professionals, and annotators to develop the scheme and guidelines based on real annotation examples while increasing the corpus size. Further, we obtained annotated corpora comprising 3,769 medical records and radiology reports of patients with serious lung diseases. For improved efficiency, preliminary NER and RE models were created after the first half was annotated; they were subsequently applied to the second half, which was then corrected manually. This two-step annotation also increased the inter-coder agreement. Finally, a joint NER + RE model trained on our corpora showed sufficiently promising performance to suggest its practical implementation.

    Download PDF (961K)
  • Masayasu Muraoka, Naoaki Okazaki, Ryosuke Kohita, Etsuko Ishii
    2022 Volume 29 Issue 4 Pages 1198-1232
    Published: 2022
    Released on J-STAGE: December 15, 2022
    JOURNAL FREE ACCESS

    We propose a new task called image-to-text matching (ITeM) to facilitate multimodal document understanding. ITeM requires a system to learn a plausible assignment of images to texts in a multimodal document. To study this task, we systematically construct a dataset comprising 66,947 documents with 320,200 images from Wikipedia. We evaluate two existing state-of-the-art multimodal systems on our task to assess the validity and difficulty of our task. Experimental results show that the systems greatly outperform simple baselines while their performances are still far from that of humans. Further, the proposed task does not contribute significantly to the existing multimodal tasks; however, detailed analysis suggests that the task becomes more complex when more images are present in a document and that the proposed task can offer a new capability for image-to-text understanding not achievable through existing tasks, such as multiple image consideration or image abstraction.

    Download PDF (5999K)
Technical Report (Peer-Reviewed)
  • Keiichi Goshima, Mototsugu Shintani, Hiroya Takamura
    2022 Volume 29 Issue 4 Pages 1233-1253
    Published: 2022
    Released on J-STAGE: December 15, 2022
    JOURNAL FREE ACCESS

    In this study, we construct a sentiment dictionary for the macroeconomic domain and present its applications. Our dictionary contains words selected by several economists from a corpus of newspaper articles on topics related to the economy. This was supplemented with additional words by using supervised learning. We use our sentiment dictionary to construct a daily business cycle index designed to capture the current state of the economy in a timely manner.

    Download PDF (1131K)
  • Chenchen Ding, Masao Utiyama, Eiichiro Sumita
    2022 Volume 29 Issue 4 Pages 1254-1271
    Published: 2022
    Released on J-STAGE: December 15, 2022
    JOURNAL FREE ACCESS

    In this study, an input method editor called AKKHARA is developed to accommodate writing systems comprising several tens to hundreds of symbols. As an engineering realization, AKKHARA accepts and applies a set of rewrite rules with priorities such that the alternation, substitution, and normalization of character strings are applied alongside the keystrokes. Compared with general key-character editors, AKKHARA provides a greater flexibility for Romanization-based rule editions. Compared with the input methods developed for Chinese and Japanese, AKKHARA is lightweight and easy to maintain. As an application case of AKKHARA, this study illustrates the realization of a Romanization-based Myanmar input method using the Unicode standard. A version of AKKHARA for Microsoft Windows was released that supports Unicode characters with customizable functions for rewriting rule editions.

    Download PDF (906K)
Society Column (Non Peer-Reviewed)
Information (Non Peer-Reviewed)
feedback
Top