Journal of Natural Language Processing

Preface (Non Peer-Reviewed)

[title in Japanese]

[in Japanese]

2022Volume 29Issue 2 Pages 292-293
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.292

JOURNAL FREE ACCESS

Download PDF (183K)

General Paper (Peer-Reviewed)

Multi-Task Learning for Chemical Named Entity Recognition with Chemical Compound Paraphrasing

Taiki Watanabe, Akihiro Tamura, Takashi Ninomiya, Takuya Makino, Tomoy ...

2022Volume 29Issue 2 Pages 294-313
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.294

JOURNAL FREE ACCESS

Show abstractHide abstract

We propose a method to improve named entity recognition (NER) for chemical compounds using multi-task learning by jointly training a chemical NER model and a chemical compound name paraphrase model. Our method enables the NER model to capture chemical compound paraphrases by sharing the parameters of NER and the character embeddings based on long short-term memories (LSTM) with the paraphrase model. Experimental results on BioCreative IV CHEMDNER show that our method learning paraphrase contributes to improved accuracy.

View full abstract

Download PDF (426K)
Grammatical Error Correction with Pre-trained Model and Multilingual Learner Corpus for Cross-lingual Transfer Learning

Ikumi Yamashita, Masahiro Kaneko, Masato Mita, Satoru Katsumata, Aizha ...

2022Volume 29Issue 2 Pages 314-343
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.314

JOURNAL FREE ACCESS

Show abstractHide abstract

In this study, we explore cross-lingual transfer learning in grammatical error correction (GEC) tasks. Few studies have investigated the use of knowledge from other languages for GEC; therefore, it is unclear if useful grammatical knowledge can be transferred. There are often common grammatical items between similar languages, and it may be possible to perform cross-lingual transfer learning by exploiting their grammatical similarities. In this study, we use pre-trained model and multilingual learner corpus for cross-lingual transfer learning for GEC. Our results demonstrate that transfer learning from other languages can improve the accuracy of GEC. We also demonstrate that proximity to source languages has a significant impact on the accuracy of correcting certain types of errors.

View full abstract

Download PDF (606K)
Knowledge Distillation for Translating Erroneous Speech Transcriptions

Ryo Fukuda, Katsuhito Sudoh, Satoshi Nakamura

2022Volume 29Issue 2 Pages 344-366
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.344

JOURNAL FREE ACCESS

Show abstractHide abstract

Recent studies consider knowledge distillation as a promising method for speech translation (ST) using end-to-end models. However, its usefulness in cascade ST with automatic speech recognition (ASR) and machine translation (MT) models has not yet been clarified. An ASR output typically contains speech recognition errors. An MT model trained only on human transcripts performs poorly on error-containing ASR results. Thus, it should be trained considering the presence of ASR errors during inference. In this paper, we propose using knowledge distillation for training of the MT model for cascade ST to achieve robustness against ASR errors. We distilled knowledge from a teacher model based on human transcripts to a student model based on erroneous transcriptions. Our experimental results showed that the proposed method improves the translation performance on erroneous transcriptions. Further investigation by combining knowledge distillation and fine-tuning consistently improved the performance on two different datasets: MuST-C English--Italian and Fisher Spanish--English.

View full abstract

Download PDF (230K)
Sequential Morphological Analysis of Hiragana Strings using Recurrent Neural Network and Logistic Regression

Shuhei Moriyama, Tomohiro Ohno

2022Volume 29Issue 2 Pages 367-394
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.367

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper describes the morphological analysis of unsegmented Hiragana strings. It is known that Hiragana strings have more ambiguities than Kanji-Kana mixed strings. Certain morphological analysis methods have been developed mainly for Hiragana strings, but most have not obtained sufficient analysis accuracy. The accuracy of a prior method is higher than that of the famous conventional morphological analysis tool for Kanji-Kana mixed strings, but the prior method has the problem in that it requires considerable amount of analysis time. Aiming for high-accuracy and practical-speed analysis of unsegmented Hiragana strings, we propose a sequential morphological analysis method using RNN (Recurrent Neural Network) and logistic regression. To speed up the analysis, the proposed method sequentially estimates word boundaries for each character boundary and estimates morpheme information for each word. To improve the accuracy of the analysis, the proposed method estimates word boundaries and morpheme information by integrating the estimation based on local information by logistic regression and the estimation based on global information by RNN. The experimental results confirmed that the proposed method achieved a speed-up of more than 100 times and a higher analysis accuracy than that of the prior method.

View full abstract

Download PDF (939K)
Semantic Frame Induction using Masked Word Embeddings and Two-Step Clustering

Kosuke Yamada, Ryohei Sasano, Koich Takeda

2022Volume 29Issue 2 Pages 395-415
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.395

JOURNAL FREE ACCESS

Show abstractHide abstract

Recent studies on semantic frame induction show that relatively high performance has been achieved by using clustering-based methods with contextualized word embeddings. However, there are two potential drawbacks to these methods: one is that they focus too much on the superficial information of the frame-evoking verb and the other is that they tend to divide the instances of the same verb into too many different frame clusters. To overcome these drawbacks, we propose a semantic frame induction method using masked word embeddings and two-step clustering. Through experiments on the dataset from the English FrameNet, we demonstrate that using the masked word embeddings is effective for avoiding too much reliance on the surface information of frame-evoking verbs and that two-step clustering can improve the number of resulting frame clusters for the instances of the same verb.

View full abstract

Download PDF (590K)
Extracting Citizen Feedback from Social Media by Appraisal Opinion Type Viewpoint

Tetsuya Ishida, Yohei Seki, Wakako Kashino, Noriko Kando

2022Volume 29Issue 2 Pages 416-442
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.416

JOURNAL FREE ACCESS

Show abstractHide abstract

Citizen feedback is essential for improving hospitality in government policies and customer services. In this study, we propose a method for extracting citizen feedback from social media according to appraisal opinion type by filtering tweets based on multiple viewpoints such as regional dependency, status of citizen, and polarity. To improve the F1-score of the estimation of opinion unit viewpoints, we implement a multitask learning framework to estimate associated viewpoints using a BERT model. In the experiment, we focus on two domains of citizen life during the COVID-19 pandemic: nursery school life and restaurant takeout services. Our multitask learning approach was effective in estimating viewpoints on opinions. In addition, we demonstrate that citizen feedback filtering based on specific viewpoints is valuable in investigating chronological opinion transitions by appraisal opinion types.

View full abstract

Download PDF (2271K)
Classification of Utterances that Lead to Dialogue Breakdowns in Chat-oriented Dialogue Systems

Ryuichiro Higashinaka, Masahiro Araki, Hiroshi Tsukahara, Masahiro Miz ...

2022Volume 29Issue 2 Pages 443-466
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.443

JOURNAL FREE ACCESS

Show abstractHide abstract

This study proposes a taxonomy of errors in chat-oriented dialogue systems. Previously, two taxonomies were proposed, one theory-driven and the other data-driven. The former suffers from the fact that dialogue theories for human conversation are often not appropriate for categorizing errors made by chat-oriented dialogue systems. The latter has limitations in that it can only cope with system errors for which data exist. This paper integrates these two taxonomies to create a comprehensive taxonomy of errors in chat-oriented dialogue systems. It was determined that, with our integrated taxonomy, errors can be reliably annotated with a higher Fleiss’ kappa compared with the previously proposed taxonomies.

View full abstract

Download PDF (526K)
Joint Modeling of Emoji Position and Its Label for Better Understanding in Social Media

Jingun Kwon, Kobayashi Naoki, Hidetaka Kamigaito, Hiroya Takamura, Man ...

2022Volume 29Issue 2 Pages 467-492
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.467

JOURNAL FREE ACCESS

Show abstractHide abstract

In social media, the frequent use of small images, called emojis, in posts has played a key role in recent communications. However, less attention has been paid to their positions in the given texts although users are known to carefully choose and place emojis that match their post. Exploring the position of emojis in texts is expected to enhance our understanding of the relationship between emojis and texts. In this paper, we propose a novel task of inserting an emoji at a position in a given tweet. We extend an emoji label prediction method considering the information of emoji positions, by jointly learning the emoji position in a tweet to predict the emoji label. Additional information on emoji position can improve the performance of emoji prediction. Human evaluations validate the existence of a suitable emoji position in a tweet. The proposed task makes tweets fancier and more natural. In addition, the emoji position can further improve the performance of irony detection compared to emoji label prediction. We also report the experimental results for the modified dataset, due to the problem of the original dataset for the first shared task to predict an emoji label in SemEval 2018.

View full abstract

Download PDF (315K)
Zero-Pronoun Annotation Support Tool for the Evaluation of Machine Translation on Conversational Texts

Andre Rusli, Makoto Shishido

2022Volume 29Issue 2 Pages 493-514
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.493

JOURNAL FREE ACCESS

Show abstractHide abstract

This study proposes a support tool for building zero-pronoun evaluation sets called the zero-pronoun annotation support tool (0Past; pronounced zero-past). The proposed tool provides a chat-like user interface to facilitate the navigation of human annotators. Each conversation is displayed separately, and while the user views a certain conversation, the messages within the conversation are displayed individually with a distinct color for the newest message. Using 0Past, two zero-pronoun evaluation sets are constructed. These evaluation sets are then used to evaluate neural machine translation (NMT) models’ performance translating Japanese conversations to English with the correct pronoun. Additionally, this study builds a zero-pronoun classification model by incorporating newly constructed evaluation sets and enables the tool to provide automated pre-annotation features, which can then be improved manually by human annotators. Finally, this study reports the evaluation results of training a Japanese-English neural machine translation model and compares its performance with two publicly available pretrained models in translating parallel conversational sentences from Japanese to English, which contains many omitted pronouns. The results confirm that phenomenon-specific evaluation sets are essential for better measuring NMT models when handling conversational sentences in Japanese, which is heavy on the anaphoric zero-pronoun phenomenon.

View full abstract

Download PDF (359K)
Dependency Patterns of Complex Sentences and Semantic Disambiguation for Abstract Meaning Representation Parsing

Yuki Yamamoto, Yuji Matsumoto, Taro Watanabe

2022Volume 29Issue 2 Pages 515-541
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.515

JOURNAL FREE ACCESS

Show abstractHide abstract

The syntax-based AMR parsing approach assumes a close mapping between syntactic and semantic structures. However, syntax-semantic mapping is not evident in complex sentences, causing parsers to fail to build the correct core structure of a tree. In this paper, as an aid to AMR parsing, we propose a dependency matching system that first detects complex sentence structures in a dependency parse tree of a sentence and then returns a corresponding AMR skeleton structure. We manually designed a dictionary of dependency patterns and the corresponding AMR skeletons for the types of complex sentence constructions that appear in the AMR corpus. A disambiguation step is necessary for certain types of constructions with semantically ambiguous subordinators. We show that the disambiguation can be formulated as sentence-pair classification using the fine-tuning approach of a pretrained BERT model. The classification models were trained on data derived from AMR and Wikipedia corpora, establishing a novel baseline for future research.

View full abstract

Download PDF (676K)
Analysis of Data Augmentation for Grammatical Error Correction Based on Various Rules

Shota Koyama, Hiroya Takamura, Naoaki Okazaki

2022Volume 29Issue 2 Pages 542-586
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.542

JOURNAL FREE ACCESS

Show abstractHide abstract

Inadequate training data renders neural grammatical error correction less effective. Recently, researchers have proposed data augmentation methods to address this problem. The methods are proposed based on the following three assumptions: (1) error diversity in generated data contributes to performance improvement; (2) error generation for a certain error type affects the correction performance of same-type errors; (3) a larger corpus used in error generation results in better performances. In this study, we design multiple error generation rules for various grammatical categories and propose a method to combine those error generation rules to validate the abovementioned assumptions by varying the error types in the generated data. Results show that assumptions (1) and (2) are valid, whereas assumption (3) is associated with the number of training steps and the number of generated errors. Furthermore, our proposed method can train a high-performance model even in unsupervised settings and more effectively correct writing errors as compared with the model based on round-trip translations. Finally, it is found that the error types corrected by the models based on round-trip and back translations differ from those corrected by our method.

View full abstract

Download PDF (1093K)
Neural Machine Translation with Synchronous Latent Phrase Structure

Shintaro Harada, Taro Watanabe

2022Volume 29Issue 2 Pages 587-610
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.587

JOURNAL FREE ACCESS

Show abstractHide abstract

It has been reported that grammatical information is useful for machine translation (MT) tasks. However, the annotation of grammatical information incurs significant human costs. Furthermore, it is not trivial to adapt grammatical information to MT because grammatical annotation usually employs tokenization standards that might not capture the relation between two languages and consequently, subword tokenization such as byte-pair-encoding is used to alleviate out-of-vocabulary problems; however, this might not be compatible with those annotations. In this work, we introduce two methods to incorporate grammatical information without supervising annotation explicitly: first, the latent phrase structure is induced in an unsupervised fashion from an attention mechanism; and second, the induced latent phrase structures in the encoder and decoder are synchronized so that they are compatible with each other using constraints during training. We demonstrate that our approach performs better in two tasks: translation and word alignment, without extra resources. We found that the induced phrase structures enhance the precision of alignments through the synchronization constraint after exact phrase and alignment structure analysis.

View full abstract

Download PDF (608K)
Cross-Lingual Transfer Learning for End-to-End Speech Translation

Shuichiro Shimizu, Chenhui Chu, Sheng Li, Sadao Kurohashi

2022Volume 29Issue 2 Pages 611-637
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.611

JOURNAL FREE ACCESS

Show abstractHide abstract

End-to-end speech translation (ST) is the task of directly translating source language speech to target language text. It has the potential to generate better translation than those obtained by simply combining automatic speech recognition (ASR) with machine translation (MT). We propose cross-lingual transfer learning for end-to-end ST, where the model parameters are transferred from the ST pretraining stage for one language pair to the ST fine-tuning stage for another language pair. Experiments on the CoVoST 2 and multilingual TEDx datasets in many-to-one settings show that our model outperforms the model that uses English ASR pretraining by up to 2.3 BLEU points. Through an ablation study investigating which layer of the sequence-to-sequence architecture contains important information to transfer, it was demonstrated that the lower layers of the encoder contain language-independent information for cross-lingual transfer. Extensive studies were conducted on (1) ASR pretraining language, (2) ST pretraining language pair, (3) multilingual methods, and (4) model sizes. It was demonstrated that (1) Using the same language as the ASR pretraining language and the ST fine-tuning source language results in good performance. (2) A high-resource language pair is a good choice for the ST pretraining language pair. (3) The proposed method works well in conjunction with multilingual methods. (4) The proposed method can operate with different model sizes.

View full abstract

Download PDF (327K)
Leveraging a Bilingual Corpus to Resolve Date–Duration Ambiguity in Japanese Numeric Day Expressions

Kazutaka Kinugawa, Hideya Mino, Isao Goto, Ichiro Yamada

2022Volume 29Issue 2 Pages 638-668
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.638

JOURNAL FREE ACCESS

Show abstractHide abstract

In Japanese, time expressions are often unaccompanied by explicit temporal markers, and thus their temporal types are not always obvious. One of the most representative cases is date–duration ambiguity arising from the commonly used time expression, “** 日 [** nichi].” To build a supervised classifier for this ambiguity while minimizing the annotation burden, we introduce an automatic label generation method using a bilingual corpus. Inspired by an annotation projection technique, we associate Japanese time expressions with their corresponding English words. Ambiguity in Japanese time expressions is comparatively easily resolved using their associated English words. We prepared several simple rules to determine temporal type labels from sentence pairs, and automatically created a training set for this task. Through a human evaluation, we verified that 98.7% of the sampled labels match the hand-crafted labels. We then developed a classification model on these training examples and compared our automatically created examples with existing manually annotated data. Experimental results show that the produced examples improve classification models by up to 14.0% accuracy points. Hence, our label generation method not only minimized the annotation task but is also sufficiently reliable for building temporal type classifiers.

View full abstract

Download PDF (340K)
Unsupervised Quality Estimation via Multilingual Denoising Autoencoder

Tetsuro Nishihara, Yuji Iwamoto, Masato Yoshinaka, Tomoyuki Kajiwara, ...

2022Volume 29Issue 2 Pages 669-687
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.669

JOURNAL FREE ACCESS

Show abstractHide abstract

Supervised quality estimation methods require a corpus that manually annotates qualities of translation outputs. To avoid such costly annotation process, previous studies have proposed unsupervised quality estimation methods based on machine translation models trained on a large-scale parallel corpora. However, these methods are not applicable to low-resource or zero-resource language pairs. This study addresses this problem by utilising a pre-trained multilingual denoising autoencoder. Specifically, the proposed method constructs a machine translation model by fine-tuning the multilingual denoising autoencoder with parallel corpora. It then estimates the translation quality as a forced-decoding probability of a translation output given its source sentence. The pre-trained denoising autoencoder captures linguistic characteristics across languages, which allows our method to evaluate translation quality of low-resource and zero-resource language pairs. Evaluation results on the WMT20 quality estimation task confirm that the proposed method achieves the best unsupervised quality estimation performance for five language pairs under the black box settings. Detailed analysis shows that the proposed method also performs well on under the zero-shot setting.

View full abstract

Download PDF (468K)

Society Column (Non Peer-Reviewed)

Chatting and Accidental Meeting Promoted Study of Optimizing Word Segmentation

Tatsuya Hiraoka

2022Volume 29Issue 2 Pages 688-693
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.688

JOURNAL FREE ACCESS

Download PDF (320K)
An Explanation of “Effectiveness of Syntactic Dependency Information for Higher-Order Syntactic Attention Network”

Hidetaka Kamigaito

2022Volume 29Issue 2 Pages 694-698
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.694

JOURNAL FREE ACCESS

Download PDF (277K)
Writing of “Efficient Estimation of Influence of a Training Instance”

Sosuke Kobayashi

2022Volume 29Issue 2 Pages 699-704
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.699

JOURNAL FREE ACCESS

Download PDF (953K)
Generating Weather Comments from Numerical Weather Prediction: Research Process

Soichiro Murakami

2022Volume 29Issue 2 Pages 705-710
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.705

JOURNAL FREE ACCESS

Download PDF (388K)
JGLUE: Japanese General Language Understanding Evaluation

Kentaro Kurihara, Daisuke Kawahara, Tomohide Shibata

2022Volume 29Issue 2 Pages 711-717
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.711

JOURNAL FREE ACCESS

Download PDF (254K)
Natural Language Processing Researches on PRESTO “Fundamental Information Technologies toward Innovative Social System Design”

Koichiro Yoshino, Hiroya Takamura, Ryo Nagata, Takuya Matsuzaki, Katsu ...

2022Volume 29Issue 2 Pages 718-730
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.718

JOURNAL FREE ACCESS

Download PDF (429K)
16th “Research on Japanese Based on Diverse Language Resources: ‘How far we have progressed and how much we have discovered!’”

Takuya Iwasaki

2022Volume 29Issue 2 Pages 731-734
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.731

JOURNAL FREE ACCESS

Download PDF (232K)
Report on the 13th Technical Japanese Association Symposium

Hitoshi Isahara

2022Volume 29Issue 2 Pages 735-739
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.735

JOURNAL FREE ACCESS

Download PDF (245K)
J-TOCC: Japanese Topic-Oriented Conversation and Topic-Vocabulary Table for Japanese Language Education

Naoki Nakamata, Eri Kato, Hitoshi Horiuchi, Kazuhide Yamamoto

2022Volume 29Issue 2 Pages 740-747
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.740

JOURNAL FREE ACCESS

Download PDF (304K)

Supporting Member Column (Non Peer-Reviewed)

National Institute of Information and Communications Technology

Eiichiro Sumita, Kiyonori Ohtake, Kiyotaka Uchimoto

2022Volume 29Issue 2 Pages 748-753
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.748

JOURNAL FREE ACCESS

Download PDF (303K)

Information (Non Peer-Reviewed)

[title in Japanese]

2022Volume 29Issue 2 Pages 754-759
Published: 2022
Released on J-STAGE: June 15, 2022

DOIhttps://doi.org/10.5715/jnlp.29.754

JOURNAL FREE ACCESS

Download PDF (334K)

Register with J-STAGE for free!