Our English to Japanese machine translation system uses surface verbal case frames (case frames) to select a Japanese translation for an English verb. The need to acquire and accumulate case frames leads directly to two problems:
How to obtain transparency in the case frames? Case frames are sometimes changed after they are written and it is hard to predict how the translation selection is affected by these changes.
How to keep consistency in the case elements and their restrictions? These elements should be used consistently because the matching calculation between case frames and syntactic structures (parser output) expects fair use of these elements.
To solve these problems, we propose two methods.
Use of decision tree for case frame representation (case frame tree).
Use of a statistical inductive learning algorithm to derive a case frame tree from a bilingual corpus.
The first method solves problem one: a change at any node in a case frame tree will affect only the translations under the node which is changed. The second method solves problem two: the case elements and their restrictions are evaluated on the same basis according to their ability to distinguish the verb translations in the corpus. We used the learning algorithm C4.5, devised by Quinlan. C4.5 takes as input a table listing attribute, value, and class. To acquire a case frame tree, we replace attributes with case categories, values with restrictions of the case categories, and classes with Japanese translations of English verbs. We termed such a table a
Primitive Case Frame Table (
PCFT). Before doing acquisition experiments on seven English verbs (“come”, “get”, “give”, “go”, “make”, “run”, “take”), we constructed an English and Japanese bilingual corpus from the AP (Associated Press) wirenews texts, a corpus that turned out to be about 6, 000 translation pairs with syntactic tags. In the first experiment, we converted the corpus into the PCFT using all case categories appearing in the corpus and word forms for their restrictions. The acquired case frame trees basically duplicated the human work, but were far more precise in discriminating verb translations appearing in the corpus. Although the results indicate the basic effectiveness of our approach, the acquired case frame trees did not seem to have enough prediction power on open data since a lot of the word forms could be unknown words. To solve this problem, we generalized the word forms in the PCFT using semantic codes (Ruigo-Kokugo-Jiten, consisting of 4 digits) and then used C4.5. The five-fold cross-validation method was used to ensure the evaluation (error rate) precision. The error rate on open data for each verb was between 2.4% and 32.2%. Comparison of these figures with the baseline errors (error rates obtained by simply putting out the most frequent translation of a verb) showed a gain of between 13.6% to 55.3%, which indicates the basic effectiveness of using semantic codes. To lower error rate, we are devising an algorithm that can integrate word forms and semantic codes in an acquired case frame tree.
View full abstract