Database of Human Evaluations of Machine Translation Systems for Patent Translation

This paper discusses a database of human evaluations of patent machine translation, from Chinese to English, Japanese to English, and English to Japanese. The evaluations were conducted for the NTCIR-9 Patent Machine Translation Task (PatentMT). Different types of systems, such as research systems and commercial systems, and rule-based systems and statistical machine translation systems were evaluated. Since human evaluation results are important when investigating automatic evaluation of translation quality, the database of the evaluation results is valuable. From the NTCIR project, resources including the human evaluation database, translation results, and test/reference data are available for research purposes.


Introduction
Automatic evaluation of translation quality is important for development of machine translation systems; thus, it is an active area of research (Papineni, Roukos, Ward, and Zhu 2002;Lin and Hovy 2003;Isozaki, Hirao, Duh, Sudoh, and Tsukada 2010a).Human evaluation resources, which can be used as fundamental data for the research, have been published.NIST Metrics for Machine Translation (MetricsMATR) 1 published human evaluation resources for the news domain and for Arabic/Chinese-to-English translations.The Workshop on Statistical Machine Translation (WMT) 2 (Callison-Burch, Koehn, Monz, Post, Soricut, and Specia 2012) published human evaluation resources for news and European Parliament Proceedings and translations between European languages.
The database of human evaluations that we introduce in this paper is different from the above-mentioned human evaluation resources in that it targets the patent domain and focuses on translations that include the Asian languages of Japanese and Chinese.This human evaluation Goto et al.

Task design
PatentMT had three patent machine translation subtasks: Chinese to English (CE), Japanese to English (JE), and English to Japanese (EJ).Participants chose the subtasks in which they wished to participate and were provided with training data, development data, and test data.
Participants translated the test data using their machine translation systems and submitted the translations to the PatentMT organizers.The PatentMT organizers evaluated the submitted translations and returned the evaluation results to the participants.Finally, the participants presented their research results at the NTCIR-9 workshop.

Data provided to the participants
The provided data consisted of training data, development data, test data, context documents, and reference data.The reference data was provided after the submission of translation results.The training data consisted of a parallel corpus and a monolingual corpus.The parallel sentence pairs for the training, development, and test/reference data were drawn from patent description sentences (patent documents consist of a title, abstract, claim, and description).The parallel sentence pairs for the training data were automatically extracted from patent documents using bilingual dictionaries.The Chinese-English parallel sentence pairs were extracted from Patent Cooperation Treaty (PCT) patents in Chinese and English (Lu, Tsou, Jiang, Kwong, and Zhu 2010).The Japanese-English parallel sentence pairs were extracted from the patent family in Japanese and English (Utiyama and Isahara 2007).The training data was built from patent documents published between 1993 and 2005.The number of patent parallel sentence pairs for the training data was: 1 million for Chinese-English and approximately 3.2 million for Japanese-English.The training data of the monolingual corpus was a monolingual patent corpus in the target language spanning 13 years (1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005).The test data was built by randomly selecting parallel sentences from a portion of the automatically built patent parallel sentence pairs published in 2006 and 2007, manually judging whether the sentence pairs were correct translations, then selecting 2,000 correct sentence pairs as the test data and their reference data.The patent documents from which the test sentences were extracted were provided as context documents for the test data.

Evaluation methodology
We conducted human evaluations and regarded these as the primary evaluation.
Human evaluations were carried out by paid evaluation experts 6 and employed the criteria of adequacy and acceptability, which will be explained later.For each criterion, three evaluators evaluated 100 sentences per system.The three evaluators evaluated different sentences.Thus, 300 sentences were evaluated per system.The 300 sentences were randomly selected from the test sentences.In this evaluation, the evaluators looked at a source sentence and its translation results to be evaluated.

Adequacy
We conducted a 5-scale (1 to 5) adequacy evaluation.The main purpose of the adequacy evaluation was to compare the systems.
Adequacy can be defined in multiple ways.White (White, O'Connell, and O'Mara 1994) defined it as how much of the information from a fragment of a reference sentence is contained in the translation results.They insisted that fragmentation is intended to avoid biasing the results in favor of linguistic compositional approaches (which may do relatively better on longer, clauselevel strings) or statistical approaches (which may do better on shorter strings not associated with syntactic constituency).However, this evaluation cannot evaluate whether the sentence meaning is correct or not because simply containing all of the fragments of the reference information does not guarantee a correct sentence meaning.The NTCIR-7 Patent Translation Task (Fujii et al. 2008) conducted adequacy evaluations using a criterion based on the degree of preservation of sentence-level meaning instead of the degree of fragments of the reference information contained.
We believed that the degree of sentence-level meaning preservation was better than that of fragments of reference information contained for the evaluation of translation quality.However, since the cost of checking sentence meanings is high, we evaluated quality considering the clauselevel meanings for adequacy.
The instructions for the adequacy criterion are given in Appendix A.1.Examples of adequacy values and translations are shown in Appendix A.2.
The systems were ranked based on adequacy using the average system scores.
6 Because we evaluated machine translations and not human translations, we did not perform evaluations in a similar manner as human translations.The evaluators were not patent experts for the domains.This evaluation did not check whether the translations of technical terms were perfect.The writing style for the test data is a general style because all of the test sentences were from description sections of patents and were not from claim sections.Therefore, the evaluators, who are not patent experts, can distinguish whether a translated sentence represents the source sentence meaning.The evaluator profiles were as follows: For CE, adequacy: Chinese native speakers who can understand English; acceptability: Chinese native speakers whose English abilities are very high.For JE, adequacy and acceptability: English native speakers who can understand Japanese.For EJ, adequacy and acceptability: Japanese native speakers who can understand English.

Acceptability
We conducted a 5-scale acceptability evaluation as shown in Fig. 1.The main purpose of an acceptability evaluation is to clarify the percentage of translated sentences for which the source sentence meanings can be understood from randomly selected test sentences. 7Acceptability is an evaluation of sentence-level meaning.The acceptability criterion used in this evaluation is aimed more at practical evaluation as opposed to adequacy.For example, if the requirement of a translation system is that the source sentence meaning can be understood, translations of C or higher are useful; however, if the requirement is that the source sentence meaning can be understood and the sentence is grammatically correct, then only translations of A or higher are useful.We can then know the number of sentences from a system would be useful for each requirement.An adequacy criterion cannot answer these requirements.
Acceptability also contains an evaluation of fluency that measures fluency in the target language, since it also affects the differences in grading from C to AA.If the adequacy of a translation is very low, then the translation is not correct even if the fluency is high.If the integrated evaluation score is calculated by averaging the adequacy and fluency scores, then those translations Informative, shown in the ALPAC report (Pierce, Carroll, Hamp, Hays, Hockett, Oettinger, and Perlis 1966) p.70, is also an evaluation criterion of translation quality.Informative is a measure of how informative the original version is perceived to be after the translation has been seen using a scale from 0 to 9. The upper grades are decided on the basis of whether a translation includes word-level or sentence structure-level errors.
Because there are cases where the source sentence meaning can be understood as well as cases where it cannot be understood when a translation includes sentence structure-level errors, informative cannot clarify the difference in understandability.In contrast, acceptability can judge whether the source sentence meaning can be understood independent of the existence of word-level or sentence-structure-level errors.We ranked the systems based on acceptability using a pairwise comparison, which will now be explained.The pairwise score for a system A reflects how frequently it was judged to be better than or equal to other systems.Suppose there are five systems to be compared.For each input sentence, system A is included in four pairwise comparisons (against the other systems).
System A is rewarded as 1.0 for each of the comparisons in which system A is ranked the highest of the two, and 0.5 for each of the comparisons in which system A is in a tie.System A's score is the total rewarded score in the pairwise comparisons divided by the total number of pairwise comparisons involving system A.
Note that the average score of acceptability was not used for system ranking.The reason is as follows.Here we assume that the differences between the grades are measured by general usability.It is important to be able to understand the contents from the source sentence.There is a large difference in usability between F and C.However, at the A-level, while the translations are at a non-native level, the contents from the source sentences can be understood and they are grammatically correct; thus, they have the potential to be useful in many cases.Thus, it is believed that the difference in usability between A and AA is smaller than that between F and C. In addition, we think that useful grades depend on specific usage.Therefore, it is difficult to give an appropriate score for each grade, and we avoided the conversion of grades to scores and calculation of averages.

Human evaluation procedure
We conducted human evaluation training before the main evaluation to normalize the evaluators' criteria.In the training, all evaluators evaluated 100 translations, and a meeting was held to determine common results for each subtask.The main evaluation was then performed.
The common results produced at the training were used as the reference results for the main evaluation.
The instructions for the human evaluation procedure are shown in Appendix C.

Schedule
Translations were done over a two-week period in May 2011.

Participants and submissions
We received submissions from 21 groups.The number of groups for each subtask was: 18 for CE, 12 for JE, and 9 for EJ.Table 2 shows the Group IDs, the participant organizations, system description papers, and the subtasks in which they participated.The types of translation systems are statistical machine translation (SMT), rule-based machine translation (RBMT), examplebased machine translation (EBMT), or hybrids of two or more types (HYBRID).
In addition to the submissions from the participants, the organizers submitted results for baseline systems that consisted of 2 SMT systems, 5 commercial RBMT systems, and 1 online SMT system.The baseline systems are shown in Table 3.The SMT baseline systems consisted of publicly available software, and the procedures for building the systems and translating using the systems were published on the PatentMT web page8 so that those with the training data can build the SMT baseline systems and compare their results.The commercial RBMT systems and the Google online translation system9 were operated by the organizers.The translation results from the Google translation system were created by translating the test data via their web interface.We note that these RBMT companies and Google did not submit themselves.
Since our objective does not include comparing the commercial RBMT systems of companies who did not themselves participate, the System IDs of the commercial RBMT systems are kept anonymous in this paper.
Each participant is allowed to submit as many translated results ("runs") as desired, but the submitted runs should be prioritized by the group.In this paper, we distinguish their runs using a Run ID expressed by Group ID (or System ID for the baseline systems) and a priority number connected by "-".The resource information used by each run is indicated by • Resource B : The system used the bilingual training data provided by the organizers.
• Resource M : The system used the monolingual training data provided by the organizers.
• Resource E : The system used external knowledge other than data provided by the organizers or the system uses a rule-based system.

Human evaluation results
We evaluated the adequacy for at least all of the first priority submissions.However, because of budget limitations, acceptability was evaluated for only selected systems.

Adequacy evaluation
Table 4 shows the results of the adequacy evaluation.Table 5 shows the results of the statistical significance test of the adequacy evaluation using a sign test.In the tables showing the results of a statistical significance test, the marks (" ", ">", "-") indicate whether the Run ID to the left of a mark is significantly better than that above the mark.
From these results, we can observe the following: • All of the top systems are SMT systems.The top system, BBN-1, shows a significantly higher adequacy than the other systems.
To improve translation quality, the top BBN-1 system (Ma and Matsoukas 2011) used the following techniques: generalization of infrequent numerical expressions, optimization of Chinese word segmentation, adaptation of language models, addition of features, and utilization of English dependency structures.Effectiveness of the system using these techniques was shown.

Acceptability evaluation
Table 6 shows the results of the acceptability evaluation.Table 7 shows the results of the statistical significance test of the acceptability evaluation using a sign test.Table 7 Sign test of CE acceptability." ": significantly different at α = 0.01, ">": significantly different at α = 0.05, "-": not significantly different at α = 0.05.
From the results, we can see that the meaning of the source language could be understood (C-rank and above) for 79.7% of the translated sentences in the best-ranked system (BBN-1).This result significantly surpasses the others.

Adequacy evaluation
Table 8 shows the results of the adequacy evaluation.Table 9 shows the results of the statistical significance test of the adequacy evaluation using a sign test.
The top five systems, JAPIO-1, RBMT1-1, EIWA-1, RBMT3-1, and RBMT2-1, are either commercial RBMT systems or systems that use commercial RBMT systems.From these results, the following are observed: • The commercial RBMT systems had higher adequacies than the state-of-the-art SMT systems for patent machine translation from Japanese to English.
The reason that the SMT systems could not achieve adequacy scores as high as those from the top RBMT systems is thought to be because of word ordering.Since the word order in Japanese and English is significantly different (Japanese is a Subject-Object-Verb (SOV) language and English is a Subject-Verb-Object (SVO) language), word ordering is difficult for Japanese-English translation.The current SMT performs well for word selection, but not for difficult word ordering Table 9 Sign test of JE adequacy." ": significantly different at α = 0.01, ">": significantly different at α = 0.05, "-": not significantly different at α = 0.05.
of Japanese-English translation.On the other hand, the baseline commercial RBMT systems perform well for difficult word ordering of Japanese-English translations.
The results showing that RBMT systems were better than SMT systems were the same as the previous human evaluation results at NTCIR-7 (Fujii et al. 2008).

Acceptability evaluation
Table 10 shows the results of the acceptability evaluation.Table 11 shows the results of the statistical significance test of the acceptability evaluation using a sign test.
From the results, we can see that the source sentence meaning could be understood (C-rank and above) for 63% of the sentences in the best-ranked system using RBMT (JAPIO-1).For the best-ranked SMT system (NTT-UT-1), the source sentence meaning could be understood for 25% of the translated sentences (C-rank and above).
There was a large difference in the ability to retain the sentence-level meanings between the top-level commercial RBMT systems and the SMT systems for Japanese-to-English patent translation.Table 11 Sign test of JE acceptability." ": significantly different at α = 0.01, ">": significantly different at α = 0.05, "-": not significantly different at α = 0.05.

Adequacy evaluation
Table 12 shows the results of the adequacy evaluation.Table 13 shows the results of the statistical significance test of the adequacy evaluation using a sign test.
• The top SMT systems NTT-UT-1 and NTT-UT-3 achieved human evaluation scores (adequacy) that were equal to or better than the top-level commercial RBMT systems.This was not the case for any SMT system at NTCIR-7, and it is believed to be the first time that this is being achieved.
• The adequacy scores for the commercial RBMT systems were higher than those for SMT systems other than NTT-UT-1 and NTT-UT-3.
English-to-Japanese translation is difficult for SMT because the English and Japanese word order is significantly different.However, the top SMT systems achieved results equal to or better than the RBMT systems.There was one feature in the top SMT systems that improved translation quality.This feature, used in NTT-UT-1 and NTT-UT-3 (Sudoh et al. 2011) 10 , is that the systems utilize a method that pre-orders English input sentences using parse results and head finalization rules (Isozaki, Sudoh, Tsukada, and Duh 2010b) and then translates in almost monotone word orders.Since NTT-UT-1 uses a combination of three MT systems (two preordering systems and one forest-to-string system), the effectiveness of the pre-ordering method was not clear from just the NTT-UT-1 evaluation result.However, NTT-UT-3 consisted of only an MT system with the pre-ordering method using the head finalization rules.This allowed the effectiveness of the pre-ordering method using the head finalization rules to be seen from the results.

Acceptability evaluation
Table 14 shows the results of the acceptability evaluation.Table 15 shows the results of the statistical significance test of the acceptability evaluation using a sign test.
For the best SMT system (NTT-UT-1), the source sentence meaning could be understood (C and above) for 60% of the sentences.Of the systems using RBMT, the source sentence meaning could be understood (C or above) for 60% of the translated sentences in the best system (RBMT6-1).
The translation quality of the top SMT system (NTT-UT-1) was equal to or better than that of the top-level commercial RBMT systems for retaining the sentence-level meanings.Table 15 Sign test of EJ acceptability." ": significantly different at α = 0.01, ">": significantly different at α = 0.05, "-": not significantly different at α = 0.05.

Validation of Human Evaluation Results
To discuss reliability of the human evaluation, we present the correlation between the evaluation results for divided data.We validated the reliability of human evaluation as follows: (1) The human evaluation data was divided into the first half data (Half-1) and the second half data (Half-2).Each contains half of all of the sentences evaluated by each evaluator.
(2) Scores for the systems based on the halved data were calculated.
(3) Correlation of system comparisons between the halved data was calculated.
Since the test data were built by random selection, it is assumed that the evaluation is not affected by differences in the halved data.Under this assumption, the following is true: If the evaluation is reliable, the top systems based on the first half data will also be the top systems based on the second half data, and the lower-ranking systems based on the first half data will also be the lower-ranking systems based on the second half data, i.e., there is good correlation between system comparison results of the two halved data.On the other hand, if the evaluation is not reliable, the top systems based on the first half data would be the lower-ranking systems based on the second half data, or the lower-ranking systems based on the first half data would be the top systems based on the second half data, i.e., there is poor correlation between system comparison results of the two halved data.Therefore, we validated the reliability based on the correlation between the evaluation results for the divided data.In this section, pairwise scores for systems were used for normalization purposes.A pairwise score for a system reflects the frequency with which it was judged to be better than or equal to other systems.A detailed explanation of the pairwise score is given in Section 2.2.2.
Figures 2-7 show the evaluation results for the first half of the data (Half-1), the second half of the data (Half-2), and all of the data (All).In the figures, the vertical axis is the pairwise score, and the horizontal axis is the Run ID.Although there are slight differences between the half data, there are no large differences that reverse the high-ranked and low-ranked systems.
Table 16 shows the Pearson correlation coefficients of the system evaluation scores between the half data.The Pearson correlation coefficients are close to 1.0 for all of the data pairs.These indicate that the evaluations of 150 sentences are thought to be consistent for system comparison, and this consistency shows the reliability of the evaluation results.The evaluation results of 300 sentences are thought to be more reliable than the evaluation results of 150 sentences because the number of sentences is larger.In addition to the above main validation for reliability, we also checked the differences between evaluators.For each subtask and criterion, three evaluators evaluated the translations of 100 different source sentences.We checked the correlation between the evaluation results based on the 100 source sentences evaluated by the same evaluator.Table 17 shows the Pearson correlation coefficients for the system evaluation scores between evaluators.These values indicate that there is a high correlation between evaluators.Thus, even when the evaluators and the data are different, the evaluations are thought to be consistent for system comparison.

Meta-Evaluation of the Automatic Evaluation Measure of BLEU
We calculated the BLEU scores based on the 2,000 test sentences to investigate the reliability of the automatic evaluation measure of BLEU (Papineni et al. 2002), which is widely used to evaluate translation quality, in the patent domain for the language pairs of CE, JE, and EJ.
The Spearman rank-order coefficients and the Pearson correlation coefficients between human evaluations (average adequacy scores) and the BLEU scores are shown in Table 18.From Table 18, it can be seen that the BLEU scores have a high correlation with the human evaluation for the CE evaluation, but do not have a high correlation with the human evaluation for the JE and EJ evaluations including RBMT systems.The Spearman rank-order correlation coefficients and the Pearson correlation coefficients between human evaluation and the BLEU scores excluding the RBMT systems for JE and EJ are shown in Table 19.The correlations excluding RBMT systems for JE and EJ are higher than those including the RBMT systems.Therefore, the reliability of the BLEU scores of the comparisons between systems without the RBMT systems is higher than that of the BLEU scores of the comparisons between systems including the RBMT systems for the automatic evaluation of the quality of the JE and EJ patent translations.

Method for Obtaining the Database
This section explains the method used to obtain resources.The available resources for research purposes consist of a human evaluation database, test data, reference data, and submission data (translated data).Resources are provided by the NTCIR project11 .The method used to obtain the resources is given at the URLs shown in Table 20.Applicants are asked to sign a user agreement (memorandum on permission to use) to obtain the resources.The use of these resources is free of charge.

Conclusion
This paper presented information regarding the database of human evaluations from the NTCIR-9 Patent Machine Translation Task and the knowledge obtained from these evaluations.
The evaluations showed the effectiveness of a number of machine translation systems in the patent translation field.Database of human evaluations is valuable for translation quality evaluation research.The resources will also be useful for system combination research.Resources including the human evaluation database, translation results, and test/reference data are available from the NTCIR project for research purposes.
(b) Relative comparison: • A sentence whose sentence-level meaning is not correct would be evaluated as 1-4 not only by the absolute criterion (most, much, little, and none) but also a relative comparison among the multiple translation outputs.
• The relative comparison must be consistent in all of the data.

A.2 Example Values of Adequacy
Examples of adequacy values and translations are shown in Table 22.Generally, the closer to the end point, the greater the amount of fall of the crowning.Translation 3 Generally, the amount of clowning omissions is as large as an end.Source 各々の二次電池１を単独で順番に充電する組電池は、スイッチング回路６のス イッチング素子を利用して充電できる。 Reference A battery assembly which sequentially charges each individual rechargeable battery 1 can utilize the switching devices of the switching circuitry 6 for charging.Translation 2 Each of the secondary battery 1 to charge the battery can be charged by utilizing a switching element of the switching circuit 6 is solely order .Source この場合も、隣り合う貫通孔４との最小間隔が小さい貫通孔４には、最小間隔が 大きい貫通孔４よりも少量の樹脂ペースト７を充填するようにする。 Reference Also in this instance, the through holes 4 arranged at relatively smaller intervals are filled with a smaller amount of resin paste 7 than the through holes 4 arranged at relatively larger intervals.Translation 1 Also in this case, the minimum interval is greater than the minimum interval between adjacent through holes 4 is small, a small amount of resin paste 7 to fill up the through hole 4 through hole 4 .
code is not the same.e.g., "１２３" and "123" are considered to be the same.
(9) Special characters such as Greek letters in the source sentences are replaced as letters enclosed by periods or enclosed by ampersands and semicolons.These replacements are permissible.e.g., "５μｍ" → "5 .mu.m" or "5 &mu;m" (10) Some translations mistakenly include segments of characters from the source language.
These segments are ignored if the translation works out appropriately without the segments.

B.2 Example Values of Acceptability
Examples of acceptability values and translations are shown in Table 23.

C.1 Evaluation Method for Training and Main Evaluations
• The criteria for evaluation are based on the guidelines.
• One input sentence (or one reference sentence) and all of the system outputs are shown simultaneously to compare systems.
• An evaluator evaluates all of the translations for the same input sentence.
• The MT output sentences for each input sentence are given to the evaluators in a random order.Generally, the closer to the end point, the greater the amount of fall of the crowning.Translation F Generally, the amount of clowning omissions is as large as an end.

Fig. 1
Fig. 1 Acceptability could be overvalued.Acceptability avoids this problem, allowing us to consider fluency.The instructions for the acceptability criterion are shown in Appendix B.1.Examples of acceptability values and translations are shown in Appendix B.2.

Table 2
Participants and subtasks participated in.
ISTIC Institute of Scientific and Technical Information of China (He et al. 2011) LIUM University of Le Mans (Schwenk and Abdul-Rauf 2011)

Table 3
Baseline systems.

Table 4
Results of CE adequacy.

Table 6
Results of CE acceptability.

Table 8
Results of JE adequacy.

Table 10
Results of JE acceptability.

Table 12
Results of EJ adequacy.

Table 14
Results of EJ acceptability.

Table 16
Pearson correlation coefficient between data.

Table 17
Pearson correlation coefficient between evaluators by different data sets.

Table 18
Correlation coefficients between adequacy and the BLEU scores

Table 22
Examples of adequacy values and translations.

Table 23
Examples of acceptability values and translations.