A Generalized Dependency Tree Language Model for SMT

In this paper we describe a generalized dependency tree language model for machine translation. We consider in detail the question of how to deﬁne tree-based n -grams, or ‘ t -treelets’, and thoroughly explore the strengths and weaknesses of our approach by evaluating the eﬀect on translation quality for nine major languages. In addition, we show that it is possible to attain a signiﬁcant improvement in translation quality for even non-structured machine translation by reranking ﬁltered parses of k -best string output.


Introduction
Since the early days of word-based translation, the research community has been moving towards more and more syntactic approaches to translation.Classic n-gram language models are effective at capturing translation fluency at the word level, however such approaches often fail at the syntactic and semantic level.In this study we abstract the traditional definition of a classic n-gram to dependency trees and show how our approach is able to improve more challenging issues such as long-distance word agreement.The primary motivation for using structured language models is that we can reduce the 'distance' between words that are interdependent on a syntactic (and often semantic) level.While tree-based models have more complicated structure than their string-based counterparts, the sparsity of the most important information is reduced, giving a more compact and relevant representation of context.Despite recent advances in neural and structured language modeling technology, the most widespread language modeling paradigm in major translation systems is still classic n-gram modeling.Classic n-gram models are combined with a variety of smoothing methods, the most popular being modified Kneser-Ney (Chen and Goodman 1996) and Stupid Backoff (Brants, Popat, Xu, Och, and Dean 2007), to form simple and robust models of linear word context.There exist highly optimized implementations, such as KenLM (Heafield 2011), making n-gram models popular in modern systems (Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi, Cowan, Shen,    † Graduate School of Informatics, Kyoto University † † Google, Inc. Moran, Zens, Dyer, Bojar, Constantin, and Herbst 2007;Neubig 2013;Richardson, Cromières, Nakazawa, and Kurohashi 2014).
There have been a number of issues that have prevented the widespread adoption of tree-based language models.We believe that the two main problems are the lack of an agreed standard on the most effective definition of tree-based context, and the requirement for a syntax-based decoder for full integration into a machine translation system.The use of parsers as language models has shown little improvement in translation experiments (Och, Gildea, Khudanpur, Sarkar, Yamada, Fraser, Kumar, Shen, Smith, Eng, Jain, Jin, and Radev 2004;Post and Gildea 2008) and there have been previous attempts to use syntax-based language modeling that have failed to show any statistically significant increase in BLEU (Schwartz, Callison-Burch, Schuler, and Wu 2011) (see errata1 ).In this paper we show that we can still achieve a significant improvement in translation quality judged by humans without requiring a syntax-based decoder.
In this paper we frequently refer to the problem of long-distance word agreement.This is a tricky issue for string-based machine translation, which does not consider long-distance dependencies and therefore is susceptible to errors such as incorrect noun/verb agreements.We show that our model is most effective for languages, such as those in the Slavic family, that are morphologically rich and allow for free word order, because they contain more examples of non-local dependencies that effect fluency.See Figure 1 for an example of such a long-distance word agreement error.

Related Work
The idea of capturing structure in a language model has been around since the late 90's (Chelba 1997), particularly in the speech recognition community.Such tree-based, or 'structured' language models have mainly considered only limited word histories, such as ancestors (Gubbins and Vlachos 2013) or specific parent/sibling relationships (Shen, Xu, and Weischedel 2008), however more recent attempts have started to define more general syntactic word histories (Sidorov, Velasquez, Stamatatos, Gelbukh, and Chanona-Hernández 2013).The beginnings of such generalized approaches can be traced back to 'arbori-context' trees (Mori, Nishimura, and Itoh 2001), which are designed to select optimal partial histories.
Other effective syntactic approaches in recent years have included a bilingual language model (Mariño, Banchs, Crego, de Gispert, Lambert, Fonollosa, and Costa-jussà 2006;Niehues, Herrmann, Vogel, and Waibel 2011) enhanced with some dependency information (Garmash and Monz 2014), specifically the POS tags of parent/grandparent and closest left/right siblings, and modeling a generative dependency structure on top of a classic n-gram language model (Ding and Yamamoto 2014).
While not directly designed as 'syntax-based' language models, approaches based on neural networks have also been shown to be effective at capturing a more general word history than classic n-gram models.Such approaches include feed-forward (Bengio, Ducharme, Vincent, and Janvin 2003;Schwenk 2007) and recurrent (Mikolov, Karafiát, Burget, Cernocký, and Khudanpur 2010) neural network language models and more recently LSTM-based language models (Sundermeyer, Schlüter, and Ney 2012).The most recent approach at the time of writing considers a hybrid approach of syntactic and neural network components (Sennrich 2015).
Two major drawbacks of neural network based approaches are that it can be difficult to 'reverse engineer' the syntactic knowledge learned and that model training can struggle computationally on large data.
Our approach expands on existing studies by proposing a more generalized framework for syntactic context and analyzing its effectiveness for a large range of language pairs.We show that it is applicable even to systems without target-side syntax.
3 Model Details

Classic n-grams with Linear History
The classic generative story for language models is as a Markov process.Generation of a sequence of words is based on the notion of 'history' or 'context', i.e., the ordered list of words already generated.It would be desirable to consider complete histories, however in practice this is not tractable.To counter sparsity and computational issues, a selective history must be used.
These ideas inspire the design of the classic n-gram language model.We assume an (n − 1)th order Markov property and use a linear context, i.e., model the probability of any given word as being conditional on the previous n − 1 words.The probability of a sentence w 1 , ..., w m can be written as: In many cases, this linear history can be helpful in determining the next word.For example, the word 'Francisco' is more likely to appear after 'San' than 'Los'.But when it comes to modeling other issues affecting fluency, such as word agreement, we must also use non-local context, ideally at the same time without increasing model sparsity.This is the primary motivation for our treebased language model.

Syntax-Based History and t-treelets
We now consider how to define the history of a word on the syntactic level.In this paper we consider generalized tree n-grams, or 't-treelets', as the syntax-based equivalent of the classic n-gram.Our definition of t-treelets is similar to the concept of syntactic n-grams (Sidorov et al. 2013), which we formalize and expand over arbitrary tree structures.This is in contrast to previous work that considers only a limited subset of possible syntactic relations.
Let us assume a sentence S = {w 1 , w 2 , ..., w m } of length m with a virtual root R and a connected tree structure T : S → S ∪ {R} mapping each word w i to one head T (w i ) ∈ S ∪ {R} with T (w i ) ̸ = w i .The design of T can be motivated by any arbitrary set of standards, however a natural choice for machine translation applications would be dependency parses.
We now define the 'history' H i for w i as the subset of S consisting of the words visited by in-order depth-first traversal of {S, T } starting at R and ending at w i .The diagram on the right of Figure 2 shows the tree-based history of an example sentence (shown on the left).The core reasoning behind this definition of history is that we wish to ensure that the t-treelet history of all words respects a well-defined ordering (in this case the order of visiting nodes by depth-first traversal).This ensures that we never encounter any cyclic dependencies or ambiguity when calculating the probability of an entire tree.Note that this well-defined ordering is trivial in the case of classic n-gram models, however this is an important consideration in tree-based models.
While in this paper we use in-order depth-first traversal, any well-defined ordering could be used, for example to reflect the ordering used for hypothesis combination in a tree-based decoder.
As an example of differences caused by this choice, an in-order depth-first traversal allows us to include the children of left-side siblings into word history, and this is useful for many word agreement problems, however this is not possible with breadth-first traversal.
Conversely we cannot make use of the right-side siblings of a word.This could be useful in rare cases such as 'le prix fixe' ('the set price') when we need to use a modifier to the right of a determiner (the adjective 'fixe') to determine the gender/number of its head noun ('prix'), which in this case could be either singular or plural.
We now define the 't-treelets of size l for w i ', as all connected subtrees S ′ ∈ H i where w i ∈ S ′ and |S ′ | = l, along with the tree structure T .See the lower half of Figure 2 for the t-treelets of order l ≤ 3 extracted for the word 'mice' in the example sentence.Our t-treelet definition captures in particular the difference between left and right dependencies, e.g., whether w i is to the left or right of T (w i ) (we treat each case separately), and relative sibling positions, which are to our knowledge not considered in previous work.
While there is only one possible linear n-gram of given size for any word, the same cannot be said for t-treelets.Furthermore, the number of possible t-treelet shapes increases with l and depends on the sentence structure.For convenience we normalize the {S ′ , T } by renumbering the words and dependencies from 1 to l.This allows us to classify t-treelet shapes into groups.
The possible shapes for t-treelets for l ≤ 3 are shown with natural language examples in Figure 3.For completeness we also add the empty t-treelet.There are 10 possible such t-treelets: 1 of size 1, 2 of size 2 and 7 of size 3.A major benefit of this approach is that we are able to turn each shape g 'on' and 'off', a process which is described in Section 3.2.1 below.
We can now model the probability of generating a given word w i with t-treelets G as Note that the different g ∈ G are not always independent, so this probability can be rather complicated to calculate.We found that the approximation works well, although it could also be possible for example to treat the individual P (w i | g) as scores and combine with a log-linear model.In our case we found that it was difficult to learn weights for a log-linear model, since in our experiments BLEU was not sensitive to changes in long-range word agreements.

Task-specific shape selection
Previous definitions of tree-based histories have considered subsets of the possible t-treelet shapes that we have defined above, such as ancestor chains {w i , T (w i ), ..., T l−1 (w i )} (Gubbins and Vlachos 2013) and restricted parent/sibling relations (Shen et al. 2008).Our definition not only expands upon these, but also adds flexibility as we are able to turn each shape type 'on' and 'off' depending on the requirements of the task.An additional benefit is that it becomes possible to compare directly with previous work by simply selecting t-treelet shapes.
In particular, we found that there are types of dependency relations that may or may not affect word agreement depending on languages and parsing standards.The natural language examples shown in Figure 3 give classic cases where certain t-treelet shapes are important for determining word morphology.There equally are types that are never (or very rarely) used in certain languages, and we found that considering these types at times caused unnecessary noise.This is similar to using unnecessarily long n-gram sizes in classic language models, where there is often not enough gain in expressiveness to warrant the additional errors caused by sparsity and irregular smoothing.
See Section 5 for experimental results and a more detailed analysis of the relative performance of various special cases of our model, including comparison with previous work.

Smoothing
Since we conducted our translation experiments on web-scale data, we designed our model to be used with Stupid Backoff (Brants et al. 2007) smoothing, which has been shown to perform well on this kind of data.
The mathematical formulation of Stupid Backoff smoothing is shown below.The formula below is applied recursively until a known n-gram is found, and unigram scores are defined as , where c(w) is the observed frequency of word w in the training corpus.The parameter α controls the degree to which we penalize backing off to shorter n-grams.

SB(w
(1) The primary advantage of this smoothing method is that it does not require the calculation or lookup of modified t-treelet counts.This allows for fast and simple calculation of backoff probabilities, and works well when using each t-treelet type score as a feature.Note that Stupid Backoff smoothing was also used in similar work (Gubbins and Vlachos 2013) that we use for comparison.
When backing off to shorter t-treelets, note that we calculate probabilities based on the shorter shape type, and that this type is always unique because of the (in-order depth-first) ordering constraint.For example (in Figure 3), 'il est grand' (type 4) is backed off to 'est grand' (type 2) non-ambiguously.

Application to SMT: Filtering and Reranking
In our experiments (see Section 4) we measure the translation improvement gained by reranking k-best machine translation output using our language model.
Reranking is a very flexible and simple approach.In particular we do not make any assumptions about the decoding algorithm of the underlying MT system and we are able to use a standard string-to-string system by simply parsing the output.As mentioned in the introduction, we believe that a major stumbling block for syntax-based language modeling has been the lack of applicability to string-based MT systems (which are still the most common), and we show that for our model this is not an issue.
The obvious problem of using string output is that we cannot guarantee reliable parsing, particularly of (poorly formed) machine translation output.We propose the simple approach of using a filtering heuristic based on dependency tree consistency to reduce this problem.
We parse all k-best candidates and extract the dependency treelets of size l centered on each word that differs between each candidate and the 1-best (baseline) translation.We then discard any k-best candidates that contain any such treelets with a different dependency structure to the corresponding treelet in the 1-best translation.This simple heuristic was very effective at reducing errors caused by bad translations/parses, as simple word changes (e.g., changing the gender of a definite article) should not affect the parse tree.Naturally this filtering leads to a small reduction in recall.

Experimental Setup
We performed a series of experiments to measure the improvement in translation quality obtainable by reranking MT output using our proposed tree-based language model.In particular we were interested in improving morphological errors such as word agreement.

Language Choice
In our experiments we built and evaluated models for nine major languages.This allowed us to analyze clearly the types of morphological error that the proposed model was able to improve.The languages were selected from a variety of language families and all display word agreement to various degrees.
The languages chosen were: Czech and Russian [Slavic]; Hungarian [Uralic]; Dutch and German [Germanic]; French, Portuguese and Spanish [Romance]; and Hebrew [Semitic].For consistency we used English as the source language for all translation experiments.See Table 1 for an overview of the characteristics of these languages affecting (long-distance) word agreements.
All language models were trained on mixed domain monolingual web corpora of 5-10 billion unique sentences per language.

Automatic Evaluation Metrics
Translation quality was measured with BLEU (Papineni, Roukos, Ward, and Zhu 2002) and the language model was intrinsically evaluated using a method of evaluation we call 'win-rate' (see below).
The BLEU metric has the following formulation: where BP is a brevity penalty, w n are weights and p n are n-gram precisions.As can be seen from the definition, BLEU considers only the precision of local n-grams.BLEU has been shown in the past to be ineffective in evaluating syntax-based approaches, with improvements being 'invisible to [such] an n-gram metric' (Sennrich 2015).
We also found that BLEU was unreliable at reflecting changes in translation quality for long-distance dependencies, and that the sensitivity was low because only a small fraction of words were changed by using the proposed model (for example many sentences do not contain word agreement errors).Nonetheless it was practical to use such an automatic measurement for parameter tuning.
As another point of reference, we also used a method of intrinsic language model evaluation we call 'win-rate'.For each sentence we calculated the language model score (using the proposed model) of the baseline MT system output and the reference translation.The win rate was then calculated as follows, giving the ratio of number of times our model gives a higher score to the reference translation than to the baseline output.The model can be considered useful if it can successfully give a higher score to the reference translation than the baseline MT output.While we do not claim that this metric is strongly correlated with human judgment, we believe it gives useful information and is very simple to implement.
The classic method of intrinsic language model evaluation is perplexity, however we chose not to use this measure because it assumes normalized probabilities, which we cannot strictly guarantee when using our model approximations and Stupid Backoff smoothing.

Training and Lookup
Prior to model training, we tokenized the entire training corpus and collected word frequencies.
Tokens with frequency less than or equal to a certain threshold (in our case 1) were replaced with an 'unknown' token in order to model t-treelet counts during lookup that include out-ofvocabulary tokens.
Training was conducted by parsing training sentences and counting all t-treelets of size ≤ 3.
It would be possible to use longer t-treelets however we found that there were not many cases where longer context was necessary for determining correct word agreement.To save memory t-treelets could be pruned based on frequency, however we found that this negatively impacted performance (see Table 2).Parsing was conducted with the shift-reduce dependency parser described in (Lerner and Petrov 2013).

Reranking SMT Output
In order to evaluate the effectiveness of the proposed model we tested the ability of our language model to rerank the 1,000-best translation output of a string-based SMT system.The baseline translation system was a state-of-the-art in-house phrase-based translation system trained on large web data.The baseline used a standard 5-gram language model trained on the same data as the proposed tree-based model and for comparison also used Stupid Backoff smoothing.
The 1,000-best translation candidates were filtered using the dependency tree consistency heuristic described in Section 3.4.We also removed noisy sentences consisting over 50% nonalphanumeric characters, and evaluated on sentences with length between 10 and 30 words.

Optimization of Model Parameters
We first explored the effects of varying our model parameters, in particular the selection of ttreelet shapes, comparing with previous work.The experiments were conducted on a development data set consisting of approximately 10,000 sentences per language that were held out from our baseline and language model training data.
For comparison with previous work, we first experimented with settings enabling various sets of t-treelet shapes (for size l ≤ 3).The setups 'Ancestors' and 'Siblings' were designed to correspond to the models of (Gubbins and Vlachos 2013) and (Shen et al. 2008) respectively.
Note that there are some slight differences, in particular the smoothing algorithm for 'Siblings' (Shen et al. did not mention any smoothing) and the fact that our models are more general than previous work, differentiating between left and right children.

Results
Table 3 shows the results for the four system variants.We can see that the most effective settings were to use all t-treelets or all trigrams, and these more general setups performed better than the more restrictive settings based on previous work.We found that using only trigrams gave better results than for all t-treelets because the lower order t-treelets often gave less reliable information (i.e., we need longer context).
Additional tuning experiments showed that improvements were made by increasing model size (reducing t-treelet filtering threshold frequency f for training, see Table 2).An increase of on average 0.1 BLEU per language was observed by varying the beam width from 1 to 100, and we used a beam width of 100 for all our evaluation results.We note that the parsing quality was roughly the same for all languages.For detailed parser evaluation, see (Lerner and Petrov 2013).
We also found empirically that it was effective to penalize unseen t-treelets more heavily than in previous work (Brants et al. 2007) by changing the backoff parameter α from the standard 0.4 to 0.004 (see Section 3.3 for more details).We did not conduct a full-scale experiment to find the optimum value.

Experimental Settings
We conducted a full evaluation of our proposed approach on nine language pairs.For the final evaluation we used the 'Trigrams' settings that were shown in Section 5 to be the most effective overall.We decided to use this setting for all language pairs, since we did not believe that the BLEU and win-rate scores gave a clear enough winner for each individual language pair.The experiments were conducted by translating mixed-domain English web sentences that were held out from the baseline SMT system and language model training data.As we were interested in evaluating the differences between the baseline and proposed models, we translated a large test set then for evaluation randomly selected (on average) 400 sentences per language that had different output between the baseline and proposed systems.The change rates in Table 5 show the percentages of sentences that were translated differently.

Human Evaluation
Translation quality was measured by skilled human raters in order to maximize the reliability of the evaluation.The raters were bilingual speakers of each language pair but not professional translators.
For each sentence the raters were instructed to give a score between 0 and 6 (inclusive), given the source sentence and translation, one rater per sentence.The rating guidelines are shown in Table 4.The number of raters per language pair are shown in Table 5.
We calculated the following scores for each language pair: • 'mean-diff-score': The mean difference between the sentence-level human ratings of the proposed and baseline systems.
• 'mean-diff-sign': The mean difference between the number of sentence-level wins and losses (in terms of human ratings) of the proposed and baseline systems.
• 'change-rate': Percentage of sentences that were different between the baseline and proposed systems.
• 'baseline': Mean sentence-level human evaluation score for the baseline system.
Table 4 Rating guidelines for human evaluation.

Score
Rating guidelines 0 Nearly all the information is lost between the translation and source.Grammar is irrelevant.

2
The sentence preserves some of the meaning of the source sentence but misses significant parts.Grammar may be poor.

4
The sentence retains most of the meaning of the source sentence.It may have some grammar mistakes.

6
The meaning of the translation is completely consistent with the source, and the grammar is correct.
• 'proposed': Mean sentence-level human evaluation score for the proposed system.

Results
Table 5 shows the results sorted by language family.Significantly positive (p < 0.05) results for 'mean-diff-score' and 'mean-diff-sign' are shown in bold type.Table 6 shows the exact number of test sentences with each score difference (−6 to +6) between proposed and baseline systems.The results show a significantly positive4 improvement for Czech, Russian, Hungarian, German and French, with more fluent output as judged by human raters.In particular, all the languages displaying noun declension were significantly improved (Czech, Russian, Hungarian and German).The translation quality for Romance languages (with the exception of French), Hebrew and Dutch did not change significantly when using a tree-based language model.In the next section we analyze these findings in detail.

Discussion
The results of the human evaluation showed a noticeable difference between the effectiveness of the tree-based language model for different languages.In particular, the morphological characteristics of each language (to some extent captured by the language family) appear effect greatly the utility of a structured language model in comparison to the baseline, which uses a standard n-gram model.
All nine languages require correct word agreement for high fluency, however the type and nature of these agreements varies between language.For example, adjective-noun and noun-verb agreement in Romance languages can be expressed relatively simply with an n-gram model as the words in question normally appear together, which could explain why we saw less improvement for this language family.In contrast, case choice in Slavic languages requires consideration of complicated and long-distance dependencies, which is consistent with the large improvement shown by our proposed system.Similar observations that such approaches are more effective for languages with relatively free word order have been made in previous work (Sennrich 2015).
The magnitude of improvement per language also showed some correlation with the baseline translation quality.Analysis of the scores given by human raters showed a tendency for lowered sensitivity to word endings when the translation quality was high.We found many examples of generally well-translated sentences (particularly in Portuguese and Spanish) that were given a score of 6 (out of 6) irrespective of word ending errors.This means that there were a number of improvements not reflected in our results.Conversely, Russian sentences with a lower average score tended to gain or lose a whole point when word endings were changed.This could also be due to the type of errors themselves, as for example adjective agreement mistakes are unlikely to impede understanding as much as case errors, and shows the importance of accurate long-distance agreement.Overall the most common cause for errors was out-of-vocabulary t-treelets, particularly phrases involving rare nouns.Our current model does not attempt to guess deep information, such as the gender or case of out-of-vocabulary words, however this would make for interesting future work.The second most common error was changing the original sentence meaning by for example modifying the tense or changing singular to plural.This could be improved in the future by considering a bilingual approach incorporating source tokens.

Error Categorization
Despite parse error reduction by our filtering method, errors in the parsing of the training corpus still led to learning some incorrect t-treelets.In particular, structures such as 'the JJ NN and NN' (for example 'the unusual character and Amy') caused the most errors, as they were often incorrectly parsed and caused word agreement errors (e.g., 'unusual' had plural agreement).
While in the majority of cases the proposed model formulation picked sufficient and appropriate context, there were some difficult cases requiring larger context that was not covered by length 3 t-treelets, in particular for words with many (> 3) siblings.

Example Sentences
See Appendix A for examples of sentences improved using the proposed method.Appendix B gives examples of worsened translations with an explanation of each error.

Conclusion and Future Work
In this paper we have described a generalized dependency tree language model for machine translation.We performed a thorough human evaluation on nine major languages using models trained on large web data, and have shown significantly positive improvement in translation quality for five morphologically rich languages.
Analysis suggests that a generalized tree-based language model is best suited to languages groups such as Slavic and Uralic that display many non-local features such as cases, as we saw no significant improvement over a classic n-gram language model for groups such as Romance languages with high baseline quality and few non-local word agreements.Despite the common concern that tree-based language models are incompatible with string-based MT systems, we have shown that our model is capable of performing well even in this scenario by using filtered parses of string MT output.
As future work we would like to experiment with other methods of integrating the language model score into machine translation systems.The natural starting point is to query the tree language model during decoding, as the reranking method proposed in this paper has access to a limited number of hypotheses and does not integrate other features that are available to the decoder.
In addition, as we have shown that BLEU is insensitive to changes made using a syntax-based language model, we would like to try in the future using metrics based on syntactic n-grams (Sennrich 2015).This would allow for improved model tuning.
It would also be interesting to use the language model to generate word endings and then use these to edit the 1-best translation.This has the benefit of increasing the search space without affecting the decoding complexity.Our preliminary experiments show that source-side information must also be used so as not to generate candidates that change the meaning of the original sentence.

B.2 French
While the change is grammatically correct, it incorrectly modifies the original tense (past into present).
• Input: The song is sung by a soldier on the island of Corfu, following the tactical withdrawal of the Serbian Army through east and west Albania.

B.4 Portuguese
This is caused by a parse error around the infinitive construction 'não saber' ('not knowing').

B.5 Russian
This is a tricky example where the Russian preposition 'na' ('to/on') can take either of two grammatical cases (accusative or prepositional) depending on meaning, and the incorrect meaning is selected.

Fig. 1
Fig. 1 Example of long-distance agreement error in a French translation.The grammatical gender of the adjective utilisées (f.pl.) should be corrected to utilisés (m.pl.) to agree with the noun minéraux (m.pl.).The verb sont also demonstrates long-distance agreement that would be difficult to capture with a classic n-gram model.

Fig. 2
Fig. 2 Example of t-treelet extraction.The left figure shows an example sentence {S, T } and the right figure shows the history H for 'mice'.The five possible t-treelets of order l ≤ 3 are shown beneath.

Fig. 3
Fig.3The shapes of all possible t-treelet types for l ≤ 3 with natural language examples and word-byword glosses.The words marked with dark nodes represent the wi around which the t-treelets are centered.

Table 1
Word agreement/ordering characteristics of the nine languages selected for translation experiments.

Table 2
Result of varying model size by changing t-treelet filtering threshold f in training.These results used the setting 'AllTypes'.

Table 3
Comparison of model formulations enabling various t-treelet types.BLEU and win-rate are shown for each proposed system.The best results are shown in bold type.

Table 5
Human evaluation results, comparing mean difference between proposed and baseline systems, sorted by language group.

Table 6
Number of test sentences with each score difference (−6 to +6) between proposed and baseline systems.
Table7gives an error categorization for a random sample of ten incorrect sentences for each of five languages (Dutch, French, German, Portuguese and Russian).The categories are defined

Table 7
Error categories for analyzed test sentences.