JGLUE: Japanese General Language Understanding Evaluation

To develop high-performance natural language understanding (NLU) models, it is necessary to have a benchmark to evaluate and analyze NLU ability from various perspectives. While the English NLU benchmark, GLUE, has been the forerunner, benchmarks are now being released for languages other than English, such as CLUE for Chinese and FLUE for French; but there is no such benchmark for Japanese. We build a Japanese NLU benchmark, JGLUE, from scratch without translation to measure the general NLU ability in Japanese. We hope that JGLUE will facilitate NLU research in Japanese.


Introduction
To develop high-performance natural language understanding (NLU) models, it is necessary to have a benchmark (a set of datasets) to evaluate and analyze NLU ability from various perspectives. In the case of English, the GLUE (General Language Understanding Evaluation) benchmark (Wang et al., 2018) is publicly available. Once an NLU model that can achieve a certain level of high score on GLUE is developed, a more difficult benchmark, such as SuperGLUE (Wang et al., 2019), is released, creating a virtuous cycle of benchmark construction and NLU model development. Along with the trend of active NLU studies in English, benchmarks for languages other than English have been constructed, including CLUE (Xu et al., 2020) for Chinese, FLUE (Le et al., 2020) for French, and KLUE (Park et al., 2021) for Korean. Although there are many studies on Japanese, which is the 13th most spoken language in the world as of 2021, there is no benchmark such as GLUE. Japanese is linguistically different from English and other languages in the following aspects.
• The Japanese alphabet includes hiragana, katakana, Chinese characters, and the Latin alphabet.
• There are no spaces between words.
• The word order is relatively free.
Due to these differences, findings on English datasets are not necessarily applicable to Japanese. Given this situation, there is an urgent need to develop a benchmark for Japanese NLU. Although individual Japanese datasets, such as JSNLI (Yoshikoshi et al., 2020) and JSICK (Yanaka and Mineshima, 2021), have been constructed, their construction methods involve mainly machine translation or manual translation from English datasets. With either of these translation methods, the unnaturalness of a translated text and the cultural/social  discrepancy between an original language (mostly English) and a target language (Japanese in our case) are major problems, as discussed in Clark et al. (2020) and Park et al. (2021). Although there are also Japanese datasets in specific domains, such as hotel reviews (Hayashibe, 2020) and the driving domain (Takahashi et al., 2019), these are not suitable for evaluating NLU ability in the general domain. In this study, we build a general NLU benchmark for Japanese, JGLUE, from scratch without translation. JGLUE is designed to cover a wide range of GLUE and SuperGLUE tasks and consists of three kinds of tasks: text classification, sentence pair classification, and QA, as shown in Table 1. Each task consists of multiple datasets. JGLUE is available at https://randd. yahoo.co.jp/en/softwaredata#jglue. We hope that this benchmark will facilitate NLU research in Japanese.

Related Work
The first benchmark for evaluating NLU models is GLUE, which consists of two kinds of tasks, i.e., sentence classification and sentence pair classification, and nine datasets in total. SuperGLUE is a more difficult benchmark than GLUE, which contains eight datasets. It keeps the most challenging dataset of GLUE, i.e., natural language inference (NLI), and adds more difficult tasks, such as QA and commonsense reasoning. Such benchmark construction in English has stimulated the development of NLU models, including BERT (De-Label  Train  Dev  Test  Total  positive  165,477 4,832 4,895 175,204  negative  22,051  822  744  23,617  Overall  187,528 5,654 5,639 198,821   Table 2: Statistics of MARC-ja. vlin et al., 2019) and many extended models. This situation has caused a growing movement to build NLU benchmarks in many languages, such as CLUE, FLUE, KLUE, IndicGLUE (Kakwani et al., 2020), ARLUE (Abdul-Mageed et al., 2021), ALUE (Seelawi et al., 2021), and CLUB (Rodriguez-Penagos et al., 2021), in Chinese, French, Korean, Indian languages, Arabic, and Catalan. Multilingual benchmarks, such as XGLUE (Liang et al., 2020), XTREME , and XTREME-R (Ruder et al., 2021), have also been built. Although they contain datasets in various languages, only a few of them include Japanese.

JGLUE Benchmark
JGLUE consists of the tasks of text classification, sentence pair classification, and QA, as shown in Table  1. In the following sections, we explain how to construct the datasets for each task. As one of the text classification datasets, JCoLA (the Japanese version of CoLA (Warstadt et al., 2019), the Corpus of Linguistic Acceptability) will be provided by another research organization. Since it is still under construction, this paper does not explain it. We use Yahoo! Crowdsourcing 1 for all crowdsourcing tasks in constructing each dataset.

MARC-ja
As one of the text classification datasets, we build a dataset based on the Multilingual Amazon Reviews Corpus (MARC) (Keung et al., 2020). MARC is a multilingual corpus of product reviews with 5-level star ratings (1-5) on the Amazon shopping site. This corpus covers six languages, including English and Japanese. For JGLUE, we use the Japanese part of MARC and to make it easy for both humans and computers to judge a class label, we cast the text classification task as a binary classification task, where 1-and 2-star ratings are converted to "negative", and 4 and 5 are converted to "positive". We do not use reviews with a 3-star rating.
One of the problems with MARC is that it sometimes contains data where the rating diverges from the review text. This happens, for example, when a review with positive content is given a rating of 1 or 2. These data degrade the quality of our dataset.
To improve the quality of the dev/test instances used for evaluation, we crowdsource a positive/negative judgment task for approximately 12,000 reviews. We adopt only reviews with the same votes from 7 or more out 1 https://crowdsourcing.yahoo.co.jp/ of 10 workers and assign a label of the maximum votes to these reviews. We divide the resulting reviews into dev/test data. We obtained 5,654 and 5,639 instances for the dev and test data, respectively, through the above procedure. For the training data, we extracted 187,528 instances directly from MARC without performing the cleaning procedure because of the large number of training instances. The statistics of MARC-ja are listed in Table 2.
For the evaluation metric for MARC-ja, we use accuracy because it is a binary classification task of texts.

JSTS and JNLI
For the sentence pair classification datasets, we construct a semantic textual similarity (STS) dataset, JSTS, and a natural language inference (NLI) dataset, JNLI.
3.2.1. Overview STS is a task of estimating the semantic similarity of a sentence pair. Gold similarity is usually assigned as an average of the integer values 0 (completely different meaning) to 5 (equivalent meaning) assigned by multiple workers through crowdsourcing. NLI is a task of recognizing the inference relation that a premise sentence has to a hypothesis sentence. Inference relations are generally defined by three labels: "entailment", "contradiction", and "neutral". Gold inference relations are often assigned by majority voting after collecting answers from multiple workers through crowdsourcing.
For the STS and NLI tasks, STS-B (Cer et al., 2017) and MultiNLI (Williams et al., 2018) are included in GLUE, respectively. As Japanese datasets, JSNLI (Yoshikoshi et al., 2020) is a machine translated dataset of the NLI dataset SNLI (Stanford NLI), and JSICK (Yanaka and Mineshima, 2021) is a human translated dataset of the STS/NLI dataset SICK (Marelli et al., 2014). As mentioned in Section 1, these have problems originating from automatic/manual translations. To solve this problem, we construct STS/NLI datasets in Japanese from scratch. We basically extract sentence pairs in JSTS and JNLI from the Japanese version of the MS COCO Caption Dataset (Chen et al., 2015), the YJ Captions Dataset (Miyazaki and Shimizu, 2016). 2 Most of the sentence pairs in JSTS and JNLI overlap, allowing us to analyze the relationship between similarities and inference relations for the same sentence pairs like SICK and JSICK. The similarity value in JSTS is assigned a real number from 0 to 5 as in STS-B. The inference relation in JNLI is assigned from the above three labels as in SNLI and MultiNLI. The definitions of the inference relations are also based on SNLI.

1-5: 歩道の反対側を車が走っている。
A car is running on the other side of the sidewalk.

2-5: 山の麓に木々が生えている。
The trees are growing at the foot of the mountain.

i-1: 夕焼けに照らされている男性。
A man is illuminated by the setting sun.

i-2: 短髪の男性が立っている。
A man with short hair is standing there. ..

i-5: 黒い服を着た男性が笑っている。
A man dressed in black is laughing.

Sentence 1 / Premise
Sentence 2 / Hypothesis Similarity Relation Origin 街中の道路を大きなバスが走っています。 道路を大きなバスが走っています。 4.4 entailment A A big bus is running on the road in the city.
There is a big bus running on the road.

テーブルに料理がならべられています。 テーブルに食べかけの料理があります。
3.0 neutral A The food is laid out on the table.
There are some dishes on the table that are about to be eaten. 野球選手がバットをスイングしています。 野球選手がキャッチボールをしています。 2.0 contradiction C A baseball player swings a bat.
A baseball player plays catch. フリスビーをくわえた犬がいます。 建物の前にバスが一台停車しています。 0.0 -B There is a dog with a Frisbee in its mouth.
There is a bus parked in front of the building.

Method of Construction
Our construction flow for JSTS and JNLI is shown in Figure 1. Basically, two captions for the same image of YJ Captions are used as sentence pairs. For these sentence pairs, similarities and NLI relations of entailment and neutral are obtained by crowdsourcing. However, it is difficult to collect sentence pairs with low similarity and contradiction relations from captions for the same image. To solve this problem, we collect sentence pairs with low similarity from captions for different images. We collect contradiction relations by asking workers to write contradictory sentences for a given caption. The detailed construction procedure for JSTS and JNLI is described below.
1. We crowdsource an STS task using two captions for the same image from YJ Captions. We ask five workers to answer the similarity between two cap-tions and take the mean value as the gold similarity. We delete sentence pairs with a large variance in the answers because such pairs have poor answer quality. We performed this task on 16,000 sentence pairs and deleted sentence pairs with a similarity variance of 1.0 or higher, resulting in the collection of 10,236 sentence pairs with gold similarity. We refer to this collected data as JSTS-A.
ference relations in both directions for sentence pairs. As mentioned earlier, it is difficult to collect instances of contradiction from JSTS-A, which was collected from the captions of the same images, and thus we collect instances of entailment and neutral in this step. We collect inference relation answers from 10 workers. If six or more people give the same answer, we adopt it as the gold label if it is entailment or neutral. To obtain inference relations in both directions for JSTS-A, we performed this task on 20,472 sentence pairs, twice as many as JSTS-A. As a result, we collected inference relations for 17,501 sentence pairs. We refer to this collected data as JNLI-A. We do not use JSTS-B for the NLI task because it is difficult to define and determine the inference relations between captions of different images. 3 4. To collect NLI instances of contradiction, we crowdsource a task of writing four contradictory sentences for each caption in YJ Captions. From the written sentences, we remove sentence pairs with an edit distance of 0.75 or higher to remove low-quality sentences, such as short sentences and sentences with low relevance to the original sentence.
Furthermore, we perform a one-way NLI task with 10 workers to verify whether the created sentence pairs are contradictory. Only the sentence pairs answered as contradiction by at least six workers are adopted. Finally, since the contradiction relation has no direction, we automatically assign contradiction in the opposite direction of the adopted sentence pairs. Using 1,800 captions, we acquired 7,200 sentence pairs, from which we collected 3,779 sentence pairs to which we assigned the one-way contradiction relation. By automatically assigning the contradiction relation in the opposite direction, we doubled the number of instances to 7,558. We refer to this collected data as JNLI-C.

For the 3,779 sentence pairs collected in
Step 4, we crowdsource an STS task, assigning similarity and filtering in the same way as in Steps 1 and 2.
In this way, we collected 2,303 sentence pairs with gold similarity from 3,779 pairs. We refer to this collected data as JSTS-C.
We constructed JSTS from JSTS-A, B, and C and JNLI from JNLI-A and C. Finally, we filtered out 12 sentence pairs from JSTS and 44 pairs from JNLI based on automatic matching and manual checking. Table 3 shows examples of the JSTS and JNLI datasets. The statistics of JSTS and JNLI are listed in Tables 4 and 5, respectively.    To examine the quality of JSTS, we calculated the variance of the similarities of each sentence pair answered by 10 crowdworkers and took the mean and standard deviation for all the pairs. The resulting values were sufficiently small as listed in Table 6. These results guarantee the quality of our annotation.
To assess the inter-annotator agreement of JNLI, we calculated Fleiss' Kappa values for 10 crowdworkers' answers of all the sentence pairs. Its value was 0.399, demonstrating fair to moderate agreement. Although this result showed that each answer was not very reliable, aggregated labels obtained by majority voting could be reliable as shown in the human scores (reported in Section 4.2).

Evaluation Metric
The evaluation metric for JSTS is the Pearson and Spearman correlation coefficients, following STS-B, and that for JNLI is accuracy, following SNLI and MultiNLI.

JSQuAD
As QA datasets, we build a Japanese version of SQuAD (Rajpurkar et al., 2016), one of the datasets of reading comprehension, and a Japanese version of CommonsenseQA, which is explained in the next section. Reading comprehension is the task of reading a document and answering questions about it. Many reading comprehension evaluation sets have been built in English, followed by those in other languages or multilingual ones. In Japanese, reading comprehension datasets for quizzes (Suzuki et al., 2018) and those in the driving  First, to extract high-quality articles from Wikipedia, we use Nayuki 4 , which estimates the quality of articles on the basis of hyperlinks in Wikipedia. We randomly chose 822 articles from the top-ranked 10,000 articles. For example, the articles include "熊本県 (Kumamoto Prefecture)" and "フランス料理 (French cuisine)". Next, we divide an article into paragraphs, present each paragraph to crowdworkers, and ask them to write questions and answers that can be answered if one understands the paragraph. Figure 2 shows an example of JSQuAD. We ask workers to write two additional answers for the dev and test sets to make the system evaluation robust.  Japanese, the value differs depending on the word segmenter used. Therefore, we calculate it on a character level.

Overview
JCommonsenseQA is a Japanese version of Common-senseQA (Talmor et al., 2019), which consists of fivechoice QA to evaluate commonsense reasoning ability. Figure 3 shows examples of JCommonsenseQA. In the same way as CommonsenseQA, JCommonsenseQA is built using crowdsourcing with seeds extracted from the knowledge base ConceptNet (Speer et al., 2017). ConceptNet is a multilingual knowledge base that consists of triplets of two concepts and their relation. The triplets are directional and represented as (source concept, relation, target concept), for example (bullet train, AtLocation, station).

Method of Construction
The construction flow for JCommonsenseQA is shown in Figure 4. First, we collect question sets (QSs) from ConceptNet, each of which consists of a source concept and three target concepts that have the same relation to the source concept. Next, for each QS, we crowd-source a task of writing a question with only one target concept as the answer and a task of adding two distractors. We describe the detailed construction procedure for JCommonsenseQA below, showing how it differs from CommonsenseQA.
1. We collect Japanese QSs from ConceptNet. Com-monsenseQA uses only forward relations (source concept, relation, target concept) excluding general ones such as "RelatedTo" and "IsA". JCom-monsenseQA similarly uses a set of 22 relations 5 , excluding general ones, but the direction of the relations is bidirectional to make the questions more diverse. In other words, we also use relations in the opposite direction (source concept, relation −1 , target concept). 6 With this setup, we extracted 43,566 QSs with Japanese source/target concepts and randomly selected 7,500 from them.
2. Some low-quality questions in CommonsenseQA contain distractors that can be considered to be an answer. To improve the quality of distractors, we add the following two processes that are not adopted in CommonsenseQA. First, if three target concepts of a QS include a spelling variation or a synonym of one another, this QS is removed. To identify spelling variations, we use the word ID of the morphological dictionary JumanDic 7 . Second, we crowdsource a task of judging whether target concepts contain a synonym. As a result, we adopted 5,920 QSs from 7,500.
3. For each QS, we crowdsource a task of writing a question sentence in which only one from the three target concepts is an answer. In the example shown in Figure 4, "駅 (station)" is an answer, and the others are distractors. To remove lowquality question sentences, we remove the following question sentences.
• Question sentences that contain a choice word (this is because such a question is easily solved). • Question sentences that contain the expression "XX characters". 8 (XX is a number) • Improperly formatted question sentences that do not end with "?". 5 The relations are Antonym, AtLocation, CapableOf, Causes, CausesDesire, DefinedAs, DerivedFrom, Desires, DistinctFrom, EtymologicallyDerivedFrom, HasA, HasFirst-Subevent, HasLastSubevent, HasPrerequisite, HasProperty, InstanceOf, MadeOf, MotivatedByGoal, NotDesires, PartOf, SymbolOf, and UsedFor. 6 For example, from triplets such as (station, AtLocation −1 , bullet train), we obtain the target concepts "bullet train", "timetable", and "ticket gate" for the source concept "station". 7 https://github.com/ku-nlp/JumanDIC 8 This is set up to exclude questions like "What is a word that means overpriced in two Chinese characters?". As a result, 5,920 × 3 = 17,760 question sentences were created, from which we adopted 15,310 by removing inappropriate question sentences. 4. In CommonsenseQA, when adding distractors, one is selected from ConceptNet, and the other is created by crowdsourcing. In JCommonsenseQA, to have a wider variety of distractors, two distractors are created by crowdsourcing instead of selecting from ConceptNet.
To improve the quality of the questions 9 , we remove questions whose added distractors fall into one of the following categories: (a) Distractors are included in a question sentence.
(b) Distractors overlap with one of existing choices.
As a result, distractors were added to the 15,310 questions, of which we adopted 13,906.
5. We asked three crowdworkers to answer each question and adopt only those answered correctly by at least two workers. As a result, we adopted 11,263 out of the 13,906 questions.
Finally, we filtered out 14 questions based on automatic pattern matching and manual checking.

Evaluation Metric
The evaluation metric for JCommonsenseQA is accuracy following CommonsenseQA.

Evaluation using JGLUE
By using the constructed benchmark, we evaluated several publicly available pretrained models.

Experimental Settings
The pretrained models used in the experiments are shown in Table 7. These models were fine-tuned in accordance with each task/dataset as follows 10 : • Text classification and sentence pair classification tasks: classification/regression problems with vector representations of the [CLS] tokens.
• JSQuAD: the classification problem of whether each token in a paragraph is a start/end position of an answer span. 11 9 A question here refers to a set of a question sentence and choices.
10 Fine-tuning was performed using the transformers library provided by Hugging Face. https://github. com/huggingface/transformers 11 XLM-RoBERTa BASE and XLM-RoBERTa LARGE use the unigram language model as a tokenizer and they are excluded from the targets because the token delimitation and the start/end of the answer span often do not match, resulting in poor performance.   (Kudo et al., 2004) and Juman++ (Morita et al., 2015) are Japanese word segmenters. "CC" in pretraining texts represents Common Crawl.   The best hyperparameters were searched using the dev set, and the performance was evaluated on the test set using the best hyperparameters. The used hyperparameters are listed in Table 8. Table 9 shows the performance of each model along with human scores. The human scores were obtained using crowdsourcing in the same way as the dataset construction. The comparison of the models is summarized as follows:

Results
• Overall, XLM-RoBERTa LARGE performed the best. This may be due to the LARGE model size and the use of Common Crawl as pretraining texts, which is larger than Wikipedia.
• As for the basic unit, the subword-based model (Tohoku BERT BASE ) performed consistently better than the character-based model (Tohoku BERT BASE (char)).
• Since JCommonsenseQA requires commonsense knowledge that is hard to be described in Wikipedia, the models pretrained on Common Crawl performed better. Figure 5 shows an example where the output of XLM-RoBERTa LARGE (which uses Common Crawl as pretraining texts) was correct while the output of Tohoku BERT BASE (which does not use Common Crawl) was incorrect.
• In all the datasets other than JCommonsenseQA, the performance of the best model equaled or exceeded the human score.

Discussion
Is the amount of training data enough? The amount of training data was changed by a factor of 0.75 and 0.5 to see how the performance changed. The model with the best performance for each dataset was used. The learning curve is shown in Figure 6. The performance is almost saturated for all the datasets, indicating that the amount of the constructed data is sufficient.
Annotation artifacts in JNLI In datasets constructed by asking crowdworkers to write sentences, a problem called annotation artifacts arises, especially in NLI (Poliak et al., 2018;Tsuchiya, 2018). If hypothesis sentences are written by workers and include annotation artifacts, a system looking at only hypotheses could achieve moderate performance. We tested this hypothesis-only baseline on JNLI. First, we extracted a subset of JNLI for this experiment. Specifically, from the sentence pairs whose relation is contradiction, we extracted the sentence pairs in which   Note that the performance of XLM-RoBERTa LARGE in JCommonsenseQA at a fraction of 0.5 is extremely low, and thus this datapoint is excluded from this graph.  a worker-generated contradictory sentence is a hypothesis. From the sentence pairs whose relation is entailment or neutral, we extracted one-way sentence pairs. We then compared the hypothesis-only baseline with the majority baseline, where all the outputs are neutral. The results are shown in Table 10.
Since the hypothesis-only baseline using Tohoku BERT BASE model outperformed the majority baseline, it is presumed that annotation artifacts are present. We hope that studies on the mitigation of annotation artifacts are conducted based on our constructed dataset.

Lexical overlap in JSQuAD
To assess the quality of JSQuAD, we investigated lexical overlap, which was pointed out for SQuAD (Clark et al., 2020). Lexical overlap is the ratio of word overlap between a paragraph and a question. It is reported that the larger the ratio is, the more easily it can be solved by a model. We calculated the ratio of lexical overlap for each para-graph and question pair of JSQuAD by segmenting them into words. 12 As a result, its average value was 0.795, indicating that JSQuAD contains the same problem as SQuAD. Because there has been no benchmark in Japanese so far, it is expected that studies on this problem in Japanese will proceed from our benchmark as a starting point.

Conclusion and Future Work
This paper described the construction procedure of JGLUE, a general language understanding benchmark for Japanese. We hope that JGLUE will be used to comprehensively evaluate pretrained models and construct more difficult NLU datasets, such as Hot-potQA (Yang et al., 2018), a multi-hop QA dataset, and Adversarial GLUE . In the future, we plan to build Japanese datasets for generation tasks such as GLGE ) and for few-shot tasks such as FLEX (Bragg et al., 2021).

Acknowledgements
This work was carried out in a joint research project between Yahoo Japan Corporation and Waseda University.