Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 30, Issue 1
Displaying 1-15 of 15 articles from this issue
Preface (Non Peer-Reviewed)
General Paper (Peer-Reviewed)
  • Mai Omura, Aya Wakasa, Masayuki Asahara
    2023 Volume 30 Issue 1 Pages 4-29
    Published: 2023
    Released on J-STAGE: March 15, 2023
    JOURNAL FREE ACCESS

    Universal dependencies (UD) are part of an international project that aims to construct cross-lingual dependency treebanks. The consistent annotation standards of grammar (parts of speech, morphological features, and syntactic dependencies) are defined across different languages and compiled as treebanks of more than 100 languages. The languages written without word delimitation must define the word units of their syntactic words on the UD guideline. The preceding UD Japanese resources are based on the short-unit words by NINJAL, which is defined by their lexicon-based morphology. This study introduces UD Japanese resources UD_Japanese_BCCWJ-GSDLUW, UD_Japanese_PUDLUW, and UD_Japanese_BCCWJLUW based on the long-unit words by NINJAL, which are more suitable as syntactic words than NINJAL’s short-unit words in Japanese.

    Download PDF (1057K)
  • Ken Yano, Akira Utsumi
    2023 Volume 30 Issue 1 Pages 30-62
    Published: 2023
    Released on J-STAGE: March 15, 2023
    JOURNAL FREE ACCESS

    We propose a novel pipeline method for translating signed Japanese sentences into written Japanese. Sign languages often suppress functional words such as particles, and most words are not morphologically inflected as they are in spoken languages. Our method explicitly compares and contrasts the two languages and divides the translation process into two tasks: first, it translates glosses into lemmatized Japanese words or phrases, followed by complementing particles and conjugating predicates such as verbs, auxiliary verbs, and adjectives. Our method is especially effective when the size of the parallel corpus is very limited and costly to obtain, but there are plenty of monolingual corpora for the target. Specifically, our method first uses phrase-based statistical machine translation (PBSMT) to map sign glosses to corresponding Japanese words or phrases, and then employs a transformer-based neural machine translation (NMT) model trained with a monolingual corpus to refine the output in the first translation. Experimental results show that our pipeline method outperforms direct PBSMT and competitive NMT models with data augmentation, including back-translation and transfer learning in a low-resource setting with a corpus size on the order of 104 words.

    Download PDF (860K)
  • Kentaro Kurihara, Daisuke Kawahara, Tomohide Shibata
    2023 Volume 30 Issue 1 Pages 63-87
    Published: 2023
    Released on J-STAGE: March 15, 2023
    JOURNAL FREE ACCESS

    To develop high-performance natural language understanding (NLU) models, it is necessary to have a benchmark to evaluate and analyze NLU ability from various perspectives. The English NLU benchmark, GLUE (Wang et al. 2018), has been the forerunner, and benchmarks for languages other than English have been constructed, such as CLUE (Xu et al. 2020) for Chinese and FLUE (Le et al. 2020) for French. However, there is no such benchmark for Japanese, and this is a serious problem in Japanese NLP. We build a Japanese NLU benchmark, JGLUE, from scratch without translation to measure the general NLU ability in Japanese. JGLUE consists of three kinds of tasks: text classification, sentence pair classification, and QA. We hope that JGLUE will facilitate NLU research in Japanese.

    Download PDF (854K)
  • Masato Mimura, Tatsuya Kawahara
    2023 Volume 30 Issue 1 Pages 88-124
    Published: 2023
    Released on J-STAGE: March 15, 2023
    JOURNAL FREE ACCESS

    Because conventional automatic speech recognition (ASR) systems are designed to faithfully reproduce utterances word-by-word, their outputs are not necessarily easy to read even when they have few speech recognition errors. To address this issue, we propose a novel ASR approach that outputs readable and clean text directly from speech by removing fillers and disfluent regeons, substituting colloquial expressions with formal ones, insertintg punctuation and recovering omitted particles, and performing other types of appropriate corrections. We formalize this approach as an end-to-end generation of written-style text from speech using a single neural network. We also propose a method to guide the training of this end-to-end model using automatically generated faithful transcripts, as well as a novel speech segmentation strategy based on online punctuation detection. An evaluation using 700 hours of Japanese Parliamentary speech data demonstrates that the proposed direct approach successfully generates clean transcripts suitable for human consumption more accurately at a faster decoding speed than the conventional cascade approach. We also provide an in-depth analysis on the types of edits performed by professional human editors to create the official written records of Japanese Parliamentary meetings, and evaluate the level of achievement of the proposed system in terms of each of the edit types.

    Download PDF (1327K)
  • Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda
    2023 Volume 30 Issue 1 Pages 125-155
    Published: 2023
    Released on J-STAGE: March 15, 2023
    JOURNAL FREE ACCESS

    Sentence embeddings, which represent sentences as dense vectors, have been actively studied as a fundamental technique for natural language processing using deep learning. In particular, sentence embedding methods based on Natural Language Inference (NLI) tasks have been successful. However, these methods heavily rely on large NLI datasets and thus cannot be expected to produce adequate sentence embedding for languages for which large NLI datasets are not available. In this paper, we propose a sentence embedding method using definition sentences from a word dictionary, which is available in many languages. Experimental results on standard benchmarks demonstrate that our method performs comparably to NLI-based methods. Furthermore, we demonstrate differences in performance depending on the properties of the evaluation task and data, and even higher performance can be achieved by combining the two methods.

    Download PDF (2322K)
  • Yukun Feng, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura
    2023 Volume 30 Issue 1 Pages 156-183
    Published: 2023
    Released on J-STAGE: March 15, 2023
    JOURNAL FREE ACCESS

    In this study, we propose a simple and effective method to inject word-level information into character-aware neural language models. Unlike previous approaches, which typically inject word-level information as input to a long short-term memory (LSTM) network, we inject such information into the softmax function. The resultant model can be considered a combination of a character-aware language model and a simple word-level language model. Our injection method can be used in conjunction with previous methods. The results of experiments on 14 typologically diverse languages are provided to empirically show that our injection method performed better than previous methods that inject word-level information at the input, including a gating mechanism, averaging, and concatenation of word vectors. Our method can also be used together with previous injection methods. Finally, we provide a comprehensive comparison with previous injection methods and analyze the effectiveness of word-level information in character-aware language models and the properties of our injection method in detail.

    Download PDF (277K)
  • Jingyi You, Dongyuan Li, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu ...
    2023 Volume 30 Issue 1 Pages 184-214
    Published: 2023
    Released on J-STAGE: March 15, 2023
    JOURNAL FREE ACCESS

    Timeline summarization (TLS) is defined as a task for summarizing events in chronological order, which gives readers a comprehensive understanding of an evolutionary story. Previous studies on the timeline summarization (TLS) task ignored the information interaction between sentences and dates, and adopted pre-defined unlearnable representations for them, which significantly degrade the performance. They also considered date selection and event detection as two independent tasks, which makes it impossible to integrate their advantages and obtain a globally optimal summary. In this paper, we present a {joint learning-based heterogeneous graph attention network for TLS (HeterTls), in which date selection and event detection are combined into a unified framework to improve the extraction accuracy and remove redundant sentences simultaneously. Our heterogeneous graph involves multiple types of nodes, the representations of which are iteratively learned across the heterogeneous graph attention layer. We evaluated our model on four datasets, and found that it significantly outperformed the current state-of-the-art baselines with regard to ROUGE scores and date selection metrics.

    Download PDF (4851K)
Society Column (Non Peer-Reviewed)
Information (Non Peer-Reviewed)
feedback
Top