Sub-Subword N-Gram Features for Subword-Level Neural Machine Translation

Neural machine translation (NMT) systems often use subword segmentation to limit vocabulary sizes. This type of segmentation is particularly useful for morphologically complex languages because their vocabularies can grow prohibitively large. This method can also replace infrequent tokens with more frequent subwords. Fine segmentation with short subword units has been shown to produce better results for smaller training datasets. Character-level NMT, which can be considered as an extreme case of subword segmentation in which each subword consists of a single character, can provide enhanced transliteration results, but also tends to produce grammatical errors. We propose a novel approach to this problem that combines subword-level segmentation with character-level information in the form of character n-gram features to construct embedding matrices and softmax output projections for a standard encoderdecoder model. We use a custom algorithm to select a small number of effective binary character n-gram features. Through four sets of experiments, we demonstrate the advantages of the proposed approach for processing resource-limited language pairs. Our proposed approach yields better performance in terms of BLEU score compared to subwordand character-based baseline methods under low-resource conditions. In particular, the proposed approach increases the vocabulary size for small training datasets without reducing translation quality.


Introduction
Neural machine translation (NMT) has made remarkable progress in recent years (Sutskever et al. 2014;Bahdanau et al. 2014;Vaswani et al. 2017) and has become a popular approach for general machine translation tasks.
NMT systems take a sequence of source language tokens as inputs and predict a sequence of target language tokens as corresponding translation outputs. The granularity of such input and output tokens can be smaller than words, such as subwords (Sennrich et al. 2015). Replacing † Nara Institute of Science and Technology † † Riken Center for Advanced Intelligence Project words with sequences of subwords has become a popular method. Subword segmentation methods, such as byte-pair encoding (BPE), replace infrequent words with more frequent subword segments to limit vocabulary sizes. Limiting vocabulary sizes is particularly important for morphologically complex languages because their vocabularies may contain hundreds of thousands of entries. The optimal size for a vocabulary depends on the particular training data. As indicated by Denkowski and Neubig (2017), NMT models trained on smaller corpora tend to perform better with smaller BPE vocabularies, because subwords with very low frequencies are avoided.
However, BPE segmentation does not consider the context of a word or its morphemes.
Subwords with similar surface forms are likely to contain the same morphemes, but if these subwords are represented by unique IDs, their similarities are not considered.
Character-level NMT (Chung et al. 2016;Lee et al. 2016), can be regarded as an extreme case of subword segmentation in which each subword corresponds to a single character. Characterlevel systems can perform better transliteration results than subword-level systems, as shown by Sennrich (2017). Sennrich (2017) compared the character-level model proposed by Lee et al. (2016) and the BPE-to-character model proposed by Chung et al. (2016) to their own BPE-only model (Sennrich et al. 2016a). They concluded that character-level decoders "perform worse at modelling morphosyntactic agreement, where information needs to be carried over long distances." In a more recent paper, Cherry et al. (2018) demonstrated that character-level models with large capacities (large numbers of layers in their encoders and decoders) can outperform BPE when their training datasets are very large (in their experiments, they considered 39.9 million sentence pairs). However, their model suffers from a disproportionate reduction in speed.
In this paper, we explore a novel training method for NMT models that allows us to utilize the character n-gram information of each subword to learn corresponding embeddings. This method is described in detail in Section 3. Our proposed models operate at the subword level and make use of short n-gram features within subwords. In this paper, we refer to the character n-gram features of subwords as sub-subword features.
We propose a subword-level NMT training method that utilizes sub-subword n-gram information to regularize subword embeddings and the output layer. Sub-subword-level features are selected from all n-grams in the subword vocabulary using a custom feature selection algorithm.
The set of selected sub-subword features can unambiguously identify every subword in the vocabulary. Sub-subword binary features are fed through a multilayer feed-forward network to produce subword embeddings. Because subword embeddings are produced exclusively from constant sub-subword features, the subword embeddings can be pre-computed once the model has been trained. Therefore, our method does not need to modify baseline architectures or the numbers of parameters in trained models.
Another method that works well on low-resource language pairs is back-translation (BT) (Sennrich et al. 2016b;Poncelas et al. 2018;Edunov et al. 2018). BT is a simple method that uses an NMT model to translate sentences in a target language into source language sentences, allowing one to synthesize a parallel corpus. We explored how our proposed model can be combined with BT.
Through four sets of experiments, we determined the optimal architecture for applying subsubword features to subword-level NMT. We determined that this method works particularly well with small training sets. Our main contributions can be summarized as follows: • We propose an NMT model training approach combining subword segmentation to limit vocabulary sizes with n-gram features to construct subword embeddings using a standard multilayer feed-forward network (Subsection 3.1).
• We use a feature selection algorithm to select a small number of character n-gram features.
Selected features should uniquely identify every word in the vocabulary (Subsection 3.2).
• We demonstrate that a standard multilayer feed-forward network works better than a self-attention mechanism (Subsection 4.2).
• We demonstrate that constructing subword embeddings strictly from n-gram features works better than using n-gram features to complement standard embeddings (Subsection 4.2).
• We demonstrate that our method facilitates large vocabulary sizes with small training datasets, thereby improving translation results (Subsection 4.3).
• We demonstrate that our method works well for small training datasets (Subsection 4.4).
• We explore how the proposed method can be combined with BT (Subsection 4.5).
This paper consists of five sections. The Introduction (1) is followed by a Preliminaries section (2). Section 3 describes the proposed approach. Four sets of experiments are detailed in Section 4. Finally, Section 5 presents a summary of our findings and discusses future work.

Neural Machine Translation
NMT models are used to translate sequences of tokens, such as words, in a source language into corresponding sequences of tokens in a target language. The sets of all distinct tokens in the source and target language sentences are called the source and target vocabularies, respectively.
NMT models operate based on closed vocabularies. The tokens in each vocabulary are assigned unique identification numbers.
The input tokens of a source sentence x are represented as a sequence of one-hot vectors of the corresponding tokens x := (x 1 , . . . , x Tx ), where T x is the number of tokens in the sentence. To output a probability distribution p t for a token y t , the previous output tokens y := (y 1 , . . . , y t−1 ), as well as the input x, are conditioned. These tokens are then projected into dense vectors e where the function NMT represents a model that, given an input embedding sequence e (x) and previous translation tokens' embedding sequence {e (y) 0<j<t }, approximates a dense vector q t corresponding to the next token in the translation.
The technique called weight tying (Press and Wolf 2017) consists in defining W y as the transpose of E y . If both the source language and target language are represented by the same set of tokens, they can share an embedding matrix, resulting in E x = E y = W ⊺ y .

Subword Segmentation
As mentioned in Section 1, word-level vocabularies can be very large, particularly for morphologically complex languages. Subword segmentation is used to reduce vocabulary sizes as an alternative to selecting the top-N most frequently used words.
The number of different tokens can be reduced by splitting words into smaller subword units.
Subword segmentation increases the length of input and output sequences, and smaller vocabularies result in longer sequences. The optimal vocabulary size depends on the training data, where smaller datasets typically favor smaller vocabularies.
One very popular subword segmentation method is BPE (Sennrich et al. 2015 There are various approaches to word segmentation in the literature that are similar to BPE. One such approach is the WordPieceModel (Schuster and Nakajima 2012). Instead of merging subwords based on their frequency, this model merges subwords to optimize a likelihood value for a language model. Different merging policies were explored in (Wu and Zhao 2018), including frequency, accessor variety and descriptor length gain. Another approach to word segmentation is the unigram language model (Kudo 2018), which can produce multiple segmentation candidates.
In this model, the subword vocabulary is selected to optimize the probabilities of words by considering the probability of one word to be the product of the unigram probabilities of its subwords.
Once data have been segmented into subword units, they are treated as independent tokens in the manner described in Section 2.1. The word embedding matrix can then be considered as a subword embedding matrix containing subword features. The information regarding which characters comprise a token is ignored. The potential benefits of taking advantage of this information represent the main motivation for the use of sub-subword features.

Character n-gram Features
Using n-gram features to generate word vectors is not a new concept: Wieting et al. (2016) used n-gram count vectors to represent words for different natural language processing (NLP) tasks. Bojanowski et al. (2017) trained word vectors on a skipgram language model by representing each word as the sum of the vectors representing its n-grams. This method takes word-boundary symbols into consideration. They considered n-gram sizes ranging from three to six characters. Because representing each n-gram using a unique vector can consume excessive memory, n-grams are grouped into buckets using a hash function, where they share vector repre- embeddings and output layer weights for language modeling. They also conducted experiments on NMT using trigram features to produce embeddings. In contrast to other approaches, they used a self-attention mechanism to calculate a weighted sum of feature embeddings.
We briefly describe the model proposed by Takase (1) and (2) with the following two equations: where c(ω, θ c ) is a neural network with a parameter set θ c that produces dense vector inferred from the n-gram features of word ω.
The n-gram features augment word embeddings without replacing them. To produce c(x i , θ where ⊙ denotes element-wise product of vectors, [·] i is the i-th column of a given matrix and RowSoftmax is a softmax function applied row wise. The matrices S (ω) ∈ R De×I(ω) represent the embeddings of the n-gram features in the word ω. The dimension of the embeddings is D e .
These n-gram feature embeddings and the square matrix W c ∈ R De×De are trainable parameters, that can differ between input and output layers.

Training Method
In this study, we explored different approaches utilizing n-gram features to represent translation tokens instead of learning their representations as embedding matrix parameters or using n-gram features to augment embeddings. Because we used BPE subwords to limit vocabulary sizes in our experiments, we will refer to the translation tokens as "subwords" and to their n-gram features as "sub-subword features." We denote V x and V y as the vocabulary sizes, and d x and d y as the numbers of n-gram features for the input and output languages, respectively. During training, our model learns to produce embeddings and output matrices from two sparse binary feature matrices F x ∈ R Vx×dx and F y ∈ R Vy×dy for the source and target language vocabularies, respectively. Each element in the feature matrices indicates whether an n-gram feature is included in a subword. When the input and output vocabularies are the same, such as when training BPE subwords jointly, the feature matrices F x and F y are equal. We will refer to this part of the model as the feature-toembedding (FTE) network. Unlike the method presented by Takase et al. (2019), we use a simple feed-forward network.
Our proposed approach does not train feature matrices. Equations (1), (2), and (4) are replaced with the following equations: The parameter weights for the feed-forward network, namely ϕ x , ϕ y , and ϕ o , are updated during training. If weight tying is applied, then Figure 1 presents a diagram of the training model using weight tying.
Once the model is trained, we revert to equations (1), (2), and (4) by pre-computing the values of E x , E y , W y , and b y using the following equations: After these calculations are completed, the FTE weights are no longer necessary and can be discarded. The resulting model is the same size as the base model.
Our goal is to compare different architectures and n-gram features and to determine their utility. As shown later in Section 4, we determined that a three-layer feed-forward network yields the best results among the compared architectures. We use a rectifier activation function between layers.
Using all of the fixed-width n-grams can result in very large feature matrices. In practice, when using all available features, we can only train models on bigram features based on memory constraints. To be able to use longer n-grams, we use a custom feature selection algorithm. We extract all n-grams for every possible n and select a final subset using the algorithm described in Section 3.2.

Feature Selection
The goal of the proposed algorithm is to select a small number of features that unambiguously represent a given vocabulary. As an example, consider a vocabulary that contains the subwords ana and anana. When only using bigram features, both subwords are represented as {^a, an, na, a$ }. In order to disambiguate these two subwords, the trigram feature nan may be selected. The algorithm selects the feature that disambiguates the maximum number of subword occurrences.
Given a large set of potential character n-gram features, the proposed feature selection algorithm selects a small subset. The initial feature-set contains features for all n-grams contained in the subword vocabulary, where n ranges from one up to the maximum subword length in the vocabulary. in Algorithm 1, summing up to a predefined threshold T g . When selecting a candidate n-gram feature f , we also consider a fixed number (N Ψ ), which is denoted as Ψ(G). Ψ(G) consists of the features with the top-N Ψ highest Ω(f, G) values. Ω(f, G) ranks the relevance of the features with respect to their frequency within the partitions, and it is defined as follows:

Consider a vocabulary
where Q(v) represents the n-gram features that appear in v.
These optimizations are optional and can be disabled by defining Φ(G) = G and Ψ(G) as the set of all the features that have not been selected.
To illustrate the selections made by the proposed algorithm, next we present some statistics Data: V , a set of subword tokens; v ∈ V ; Q(v), n-gram features that appear in v; pv(v), unigram probability of v; and M , maximum number of n-gram features Algorithm 1 Feature selection algorithm. Ψ and Φ, defined in the text, are used to prune the search space. The notation [ ] represents an empty sequence, [f i ] represents a sequence of a single element f i , and the operator · denotes sequence concatenation.
for an English-Turkish dataset when using a vocabulary size of approximately 64,000 words and T g = 0.95.The dataset contained 2,399 bigrams and 14,933 trigrams. The number of trigram features was too large to train a model using the proposed approach on a single GPU. The number of features selected by the feature selection algorithm was 598, out of which 28.8% were unigrams, 43.5% were bigrams, 14.5% were trigrams, 8% were four-grams, 3.8% were five-grams, and the remaining 1.4% were longer n-grams. Other datasets yielded similar statistics. Table 1 shows some features of more than two characters.

Experiments and Analysis
To investigate the effectiveness of the proposed approach, we conducted four sets of experi-

Experimental Setup
We ran experiments targeting five languages: English (En), Turkish (Tr), German (De), French (Fr) and Finnish (Fi). The source language was Turkish for the models targeting English.
For the rest of the models, the source language was English. The Tr-En and En-tr models were trained using the same dataset. Table 2 shows the number of sentences of each dataset. These datasets were sampled to make smaller training sets to simulate a low-resource setting.
The training dataset for the English-Turkish translation task comes from (Tiedemann 2012) and contains 205,000 parallel sentences. The German and French data were distributed as part In our experiments, the input and target languages shared a subword vocabulary and subsubword features. All related weights were shared too. As a result, E x = E y = W y = E and The significance testing was done using bootstrap resampling (Koehn 2004). We consider improvements with P-value smaller than 0.05 to be significant.

Approach Comparison Experiments
In this subsection, we compare different approaches to determine the best architecture. All models were trained using a vocabulary size of approximately 32,000 subwords. All training datasets contained approximately 200,000 sentences. Excluding the Turkish-English pairs, the training data were sub-sampled from larger datasets. Table 3 lists the results for different models divided into three sections. Section A compares the feed-forward models (l-FF, where l ∈ [1, 4]) to the self-attention (att) approach proposed by Takase et al. (2019). The 1-FF approach is equivalent to summing feature embeddings, similar to some of the models introduced in Section 1 (Wieting et al. 2016;Morishita et al. 2018). One can see that the best results in Section A are provided by the three-layer feed-forward models.
Adding extra layers does not result in any significant gains in terms of BLEU scores.
Section B in Table 3 compares the proposed feature selection algorithm to a naive approach of selecting the features with highest frequency. Both model sets use a similar number of features, but the proposed selection algorithm results in scores comparable to those achieved when using all bigrams, whereas the naive approach yields poor results. When using the frequency-selected bigrams approach, many subwords contain non-unique representations. As an example, if the features "po", "pa", "oi" and "ai" were not selected, the words point and paint would have the same representation: {"^p", "in", "nt", "t$"}. In such cases, the subwords are indistinguishable  Table 3 Different approaches compared in terms of BLEU score results. The models starting with "2-gram" used all known bigrams as features, while those starting with "selection" use the proposed feature selection algorithm. The l-FF models use l-layered feed-forward networks to produce embeddings, while the those containing "att" use the approach proposed by Takase et al. (2019). For the +comp models, the embeddings produced from the features are used to complement the traditional embeddings through addition. It should be noted that the numbers of parameters for all models are the same after training. Best scores excluding subword regularization are shown in bold. and the output layer will always output the one with highest unigram probability.
While the cited approaches (Wieting et al. 2016;Morishita et al. 2018;Takase et al. 2019) use n-gram features to complement standard trainable embeddings, our proposed approach avoids training subword embeddings directly. Section C reveals that for small training datasets, using features to complement standard embeddings is sub-optimal. For example, consider the results for models 2-gram 3-FF and 2-gram 3-FF +comp, where the later complements n-gram features with subword features. The results are consistently better for the model without subword embeddings. We believe this occurs because without subword embeddings, models are forced to infer subword features from n-gram features, resulting in better generalization. In contrast, when subword embeddings are available subword features can be represented directly based on subword embeddings without using n-gram features, resulting in poor generalization and the learning of training data biases.
Section D in Table 3 explores the subword regularization technique proposed by Kudo (2018).
Their approach produced good results with significant improvements with respect to the baseline, in accordance with their report. The sampling hyperparameters were α = 0.2 and l = ∞.
Their method obtained significantly better BLEU scores than our proposed approach for Turkish-English, English-Turkish, and English-Finnish. The subword regularization model also obtained a higher BLEU score for English-German but the gain (+0.3) is not significant. For English-French, the subword regularization model obtained significantly worse results than our proposed model.
The reason for this may be that both English and French are not morphologically complex. Combining our proposed approach with subword regularization improved the results non-significantly.

Vocabulary Size Experiments
When using BPE subword segmentation, the vocabulary size can be regulated. The results in Table 4 reveal the effects of varying vocabulary sizes.
The English-German NMT models are trained on approximately 4.5 million sentences and the English-Turkish models are trained on approximately 0.2 million sentences. Having a larger vocabulary leads to less word segmenting and shorter token sequences. When the training dataset is sufficiently large, a larger vocabulary results in higher BLEU scores. However, performance drops when there is insufficient data for training the rarest subwords. This effect is diminished by the proposed method because embeddings are inferred from more frequent sub-subword features.
The character-level models (char ) yield poor BLEU scores, particularly for the low-resource settings. As shown by Cherry et al. (2018), character-level NMT requires deeper models.  test time column shows the seconds needed to translate the test set of 3,007 sentences. Larger vocabulary sizes produce shorter subword sequences, which results in faster decoding speeds.
Decoding was done using beam search with a beam size of 5. The updates/sec column shows the times the parameters are updated per second during training. The penalty on training time for the proposed model is bigger for larger vocabulary sizes.

Grammatical Mistakes
Sennrich (2017) demonstrated that character-level NMT performs poorly in terms of morphosyntactic agreement as the distance between tokens increases. Their evaluation data, denoted as lingeval97, contain sentences with categorized synthetic errors.
We evaluated our models to determine how vocabulary size affects grammatical accuracy and to elucidate the effects of the proposed method. The evaluate model is the large English-German corpus of approximately 4.5 M sentences. Table 6 lists accuracy metrics for different models for various error categories. Out of the 13 error categories, the proposed method performs the best on ten. The absolute best results are shown in bold font and the best results among the baseline models are underlined. The results are divided into four groups.
Some errors seem to be less common with smaller vocabulary sizes, even when the BLEU score is lower. When comparing each vocabulary size independently, we observe that our proposed model is superior to the baseline model with vocabulary sizes larger than 2 K subwords. The effect of including sub-subword information is larger for larger vocabulary sizes. As Table 4 shows, our proposed model does not significantly improve the BLEU scores for this large corpus.
However, our proposed model is more resilient to the grammatical errors evaluated by lingeval97.
The baseline and proposed models of 64 K subwords both have the same BLEU score of 26.5 but  Table 6 Grammatical accuracy for different error categories. The models correspond to the English-German results in Table 4. The best result out of the baseline models is underlined and the absolute best result is shown in bold.
the proposed model behaves better for 10 out 13 grammatical mistake categories.
The first group of results is related to morpheme order (Compound ) and transliteration (Transl ). Small vocabularies work best for these error categories based on the nature of the target problem. Overall, including character n-gram information improves transliteration accuracy.
In the German language, the pronoun "Sie" can mean "they" or "she." NMT models must choose the correct grammatical number for verbs by referring to source sentences. Including character n-gram features improves accuracy because the third-person singular English verbs in the source sentences include features indicating that they end with the letter "s." The third group of results is related to agreement. Morphosyntactic agreement between a subject and verb (SubjVerb), and between determiners and nouns in noun phrases (NP ), considers number and gender. Agreement between verbs, particles (VerbPart), and auxiliary verbs (Aux ) is improved by the proposed method. Excluding auxiliary verb agreement, these error categories are less common with larger vocabularies (i.e., shorter sequences).
The final group of errors changes the polarity of a sentence by either inserting or deleting polarity markers. When the negating word "nicht" or other polarity affixes are inserted (NichtIns, AffixIns), the models operating on longer sequences perform better because they are more likely to drop source sentence information.
There is no optimal vocabulary size. When considering only the BLEU scores, the larger vocabulary sizes are better. However this, improvements come as a trade-off with some error categories such as NichtIns and AffixIns. Table 7 compares the results of different methods for different dataset sizes. We sampled random sentences from the English-German training corpus to construct training sets of increasing sizes. We then trained NMT models using these sets. The vocabulary size was approximately 64,000 subwords. We trained three models for each dataset.

Corpus Size Experiments
One can see that the benefit of applying the proposed method decreases for larger training datasets. For the full dataset of approximately 4.5 M sentences, the proposed method does not improve the results of the baseline model.
The models trained on smaller datasets are more prone to overfitting. We believe the proposed method provides better generalization because subword features are derived from n-gram features.
This regularization effect is more prominent for small datasets. For large datasets, the need for regularization is lessened and the proposed method is less salient.

BT Experiments
BT (Sennrich et al. 2016b;Poncelas et al. 2018;Edunov et al. 2018) has been widely used to improve NMT. BT is particularly helpful for low-resource language pairs, but it requires a big monolingual corpus of the target language. A large monolingual corpus may not exist for some low-resource languages.  17.8* proposed-proposed (BPE 32K) 17.7* Table 8 Results for different BT approaches for English-to-Turkish translation. Using the proposed approach to perform BT on the monolingual sentences to generate a synthetic corpus (proposedbaseline) yields the best performance. It should be noted that the number of parameters for the proposed model after training is the same as that for the baseline. The scores marked with * have a statistically significant improvement with respect to baseline-baseline.
We measured how the proposed approach behaves in combination with BT. Table 8 compares the BLEU scores for five English-to-Turkish NMT models. The first two rows contain the best results from Table 4. The three models using BT are represented by the remaining rows. These models have vocabulary sizes of approximately 32,000 words. We translated four million randomly sampled sentences from the Turkish monolingual common crawl corpus. The model baselinebaseline uses the baseline model with an 8,000 word vocabulary from Table 4 to perform BT on sentences. The models proposed-baseline and proposed-proposed use the proposed model with a vocabulary size of 64,000 words to perform BT on the monolingual sentences. We determined that there is no benefit to using the proposed method with synthetically enhanced training data.

Conclusion
We explored different approaches for utilizing character n-gram features for subword-level NMT without changing baseline model architectures or number of parameters. We analyzed the best approach for utilizing n-gram features and determined that a nonlinear function seems to perform better than a (weighted) sum of plural feature embeddings. For low-resource language pairs, translation accuracy (BLEU score) can be improved by utilizing n-gram features. This approach also improves the grammatical accuracy of the produced translations.
We explicitly analyzed how the vocabulary size affects grammatical accuracy and found that although models using larger vocabularies tend to produce fewer grammatical errors, they are difficult to train when training data is scarce. Using n-gram features to infer embeddings can help with training.
By varying the dataset size, we determined that the proposed method works particularly well on small datasets, but its benefits decrease as the dataset size increases. When combining the proposed method with BT, we determined that it is helpful to use the proposed method with a BT model, but it provides no benefits when applied to a forward translating model after incorporating a large synthetic dataset into the training data.
In the future, we wish to explore how the proposed approach can be applied to tasks other than NMT, such as named entity recognition. Additionally, it would be of interest to compare different feature selection algorithms.