A Cross-linguistic Analysis of the Effects of Character-level Information in Neural Models

Tomoya Kurosawa; Hitomi Yanaka

doi:10.5715/jnlp.31.1193

抄録

Characters are the smallest units of natural language, and humans understand texts from characters. Past studies have attempted to train language models with the information obtained from character sequences (character-level information) in addition to tokens to improve the performance of these models in various natural language processing tasks in various languages. However, they treated the performance improvement by character-level information as a performance difference between with and without characters. The extent to which these models use character-level information to solve these tasks remains unclear. The effects of linguistic features such as morphological factors on differences in the performance across languages are also under investigation. In this study, we examine existing character-employed neural models and the variation in their performance with character-level information. We focus on four languages: English, German, Italian, and Dutch, and three tasks: part-of-speech (POS) tagging, dependency parsing, and Discourse Representation Structure (DRS) parsing. The experimental results show that character-level information has the greatest effects on model performance on POS tagging and dependency parsing tasks in German and on a DRS parsing task in Italian. Based on these results, we hypothesize that the significant effects on model performance in German is caused by the average lengths of the words and the forms of common nouns. A detailed analysis reveals a strong correlation between the average lengths of the words and effectiveness on POS tagging in German.

著者関連情報

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

お気に入り & アラート

閲覧履歴

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）