A Selection Support System for Enterprise Resource Planning Package Components using Ensembles of Multiple Models with Round-trip Translation

An enterprise resource planning (ERP) package consists of software to support day-to-day business activities and contains multiple components. System engineers combine the most appropriate software components for system integration using ERP packages. Because component selection is a very di(cid:14)cult task, even for experienced system engineers, there is a demand for machine-learning-based systems that support appropriate component selection by reading the text of requirement speci(cid:12)cations and predicting suitable components. However, su(cid:14)cient prediction accuracy has not been achieved thus far as a result of the sparsity and diversity of training data, which consist of speci(cid:12)cation texts paired with their corresponding components. We implemented round-trip translation at both training and testing times to alleviate the sparsity and diversity problems, adopted pre-trained models to exploit the similarity of text data, and utilized an ensemble of diverse models to take advantage of models for both the original and round-trip translated data. Through experiments with actual project data from ERP system integration, we con(cid:12)rmed that round-trip translation alleviates the problems mentioned above and improves prediction accuracy. As a result, our method achieved su(cid:14)cient accuracy for practical use.

• The domain adaptation of translation models and N-best translation improve the quality of augmented data.
• Our ensemble method with round-trip translation is applicable to different neural architectures, including models based on both long short-term memory (LSTM) and bidirectional encoder representations from transformers (BERT).
Our error analysis reveals that the domain-adapted NMT generates diverse expressions for a single component, which may increase generalization ability for unseen inputs. Subjective analysis revealed that our method achieves accuracy levels that can satisfy the requirements for practical use.

Background
ERP packages are very flexible in their ability to combine various functions and components, and it is simple for a system designer to construct a system at a low cost using such packages.
Based on these features, ERP packages have been widely used since the 1990s for the purposes of standardization and system cooperation (Mabert et al. 2000;Olhager and Selldin 2003). The number of components has also increased with the introduction of new technologies such as automation and prediction (Madakam et al. 2020). Therefore, the number of components has become increasingly large, reaching as high as tens of thousands of enterprise system functions provided by a single ERP package. Furthermore, the detailed functionality of each component can be changed by changing parameter settings. Therefore, no individual engineers can understand all components in detail. When constructing a system, system designers must select components that satisfy customer requirements for every project because requirements differ from customer to customer. Expert system designers can select and set parameters effectively based on their knowledge of the business area and past projects. However, non-experts, even if they are expert software engineers, cannot narrow the scope of investigation to select appropriate components. The incorrect selection of components leads to additional implementation work and inferior system quality, as well as increased cost and delivery time. System designers require many years of training to work as experts in this field based on the need to acquire knowledge and experience from many projects.
Therefore, the number of experts cannot increase rapidly.
Selecting components is an important part of a project when designing an ERP system and it tends to be a bottleneck. ERP package component selection support systems are designed to support designers by automatically suggesting appropriate components for a project. Once a candidate component is selected, it is trivial for standard software engineers, even non-experts, to verify whether the selected component satisfies customer requirements by actually operating the component.
To support software development, various recommendation systems for software libraries or APIs have been developed (Zhou and Resnick 2009;Ouni et al. 2017;Almarimi et al. 2019;Chen and Xing 2016). Zhou and Resnick (2009) found that conversation co-mentions and text matching from Drupal.org online forums are effective for software module recommendation based on collaborative filtering. In particular, co-mentions work well with popular modules and text matching works well with less popular modules. Ouni et al. (2017) and Almarimi et al. (2019) used software descriptions as a data source to calculate the similarity between third party libraries. Chen and Xing (2016) used tags for online articles and constructed a knowledge base to find analogical libraries in other programming languages. The main targets of previous studies on software development support are open-source software, so the usage histories and descriptions of software can be easily obtained. In contrast, the usage histories and descriptions of ERP components are difficult to obtain because many ERP components are used to construct proprietary systems and the information of such systems is kept secret. Therefore, to support software development using ERP components, it is necessary to use limited resources effectively.
In a previous study on ERP software development support, Nakamura et al. (2014) used a business function ontology to suggest related components and graphically display the relationships between components as an explanation for suggestions. They combined business domain knowledge and the text similarity of ERP component descriptions and customer requirements, and achieved a high recall because their business function ontology covered related functions, but the precision was low because all components in the same category were not necessary for every project. Sakamoto et al. (2018) simplified ERP software development support as a classification task that uses ERP component text descriptions and introduced round-trip translation to increase the amount and variation of FRSs. Round-trip translation is a method for augmenting data by translating text from one language to another language, and then translating backward. This is a well-known data augmentation technique in the field of machine translation. Their work demonstrated that data augmentation and ensemble prediction yield improved accuracy, although the final accuracy still did not satisfy the requirements for practical use. In this study, we developed a system that performs a classification task, similar to Sakamoto et al. (2018), to suggest ERP components.

Approaches to Data Sparsity and Diversity Problems
There are three well-known approaches to improving the accuracy of text classification: data augmentation, using a pre-trained model with fine-tuning, and using ensembles of prediction results. To solve the problem of data sparsity and diversity, data augmentation increases the amount and diversity of data, and the use of a pre-trained model exploits the semantic similarity of text from a large amount of pre-trained data. Finally, using ensembles of prediction results enhances the generalization ability of models.
Data augmentation methods for text data include the use of round-trip translation and word replacement (Liu et al. 2020). Most existing data augmentation methods (Lu et al. 2006;Ragni et al. 2014;Ohga et al. 2018) increase the accuracy of learning models on general public data such as conversation data. However, we focus on how data augmentation contributes to increasing the accuracy of business data.
Round-trip translation increases the amount of data by obtaining paraphrases using machine translation (Yu et al. 2018;Wieting et al. 2017;Wieting and Gimpel 2018;Hu et al. 2019aHu et al. , 2019b. Recently, it has become easier to perform NMT locally because NMT libraries  and parallel corpora (Morishita et al. 2020) can be obtained easily. Therefore, applying NMT round-trip translation to business data has also become easier because local NMT maintains data confidentiality. Furthermore, we can expect to obtain high-quality translations of target domain text through the domain adaptation of NMT (Koehn and Knowles 2017;Chu and Wang 2018), which leverages an in-domain corpus in addition to a general domain corpus.
Word replacement generates similar text data by either substituting a word in text (Kobayashi 2018;Wei and Zou 2019), or by swapping or deleting words (Wei and Zou 2019). Wei and Zou (2019) introduced easy data augmentation (EDA), which improves text classification accuracy and consists of four simple operations. Two of the operations use synonyms to replace or insert words in the text data. The other operations are the deletion and swapping of words.
Pre-trained language representation models, which are trained on large amounts of text, have been shown to improve various tasks in natural language processing (NLP). The transformer model (Vaswani et al. 2017) was recently proposed and has become the state-of-the-art method in NMT. Additionally, pre-trained models such as BERT (Devlin et al. 2019), XLNet (Yang et al. 2019), RoBERTa (Liu et al. 2019), ALBERT (Lan et al. 2020), and ELECTRA (Clark et al. 2020) are used in various NLP tasks. For handling Japanese, there are some BERT models available that are pre-trained on Japanese text (Shibata et al. 2019;Suzuki 2019; National Institute of Information and Communications Technology 2020).
Constructing ensembles of multiple models yields greater generalization ability than using each model individually (Sagi and Rokach 2018). Different ensemble methods vary in complexity.
Simpler methods calculate the linear sum of model results (Breiman 1996;Wolpert 1992). The linear sum is analogous to the "or" operation and works well for integrating different predictions.
The product of experts (PoE) (Hinton 2002) multiplies similar probability distributions from multiple models. PoE is analogous to the "and" operation and works well for integrating similar predictions. More complex methods use deep learning in ensemble models (Coscrato et al. 2020).

Task Definition of ERP Package Component Selection Support
For an ERP package component selection support system, we considered component selection as a classification task. The data used for training were labeled FRSs from past projects and the data for prediction were FRSs without labels from new projects. Each labeled FRS is a sample (x, y), where x is an FRS that specifies a component with the necessary function, y ∈ L is a component ID, which is treated as a class label, and L is the set of all labels. The task requires the return of (x, y * ) from an input x, where x is an FRS from a new project and y * is the predicted label.
To develop an ERP package component selection support system, we divided the labeled data D into training data D train and testing data D test . The prediction for x is defined as The ensemble result of predictions from two models for x is defined as where E denotes the ensemble method. For data augmentation, we denote the augmented data as Aug(D), where Aug represents augmentation and is replaced by the name of the augmentation method. For example, if the augmentation method is round-trip translation (RT in the equation) and the LSTM model is trained using augmented data combined with original data, we define the prediction for x as

Overview of EMMRT: Ensemble of Multiple Models with Roundtrip Translation
We propose a method called EMMRT to improve classification accuracy based on text data.
EMMRT uses ensembles of multiple models, where each model is trained either on the original data or on data augmented by round-trip translation. An overview of EMMRT is presented in  EMMRT has the following features.
1. It uses pre-trained BERT for computing contextualized embedding and a fully connected linear network that minimizes cross-entropy loss for multi-class classification.
2. Each of the two classifiers is fine-tuned for FRSs using the augmented training data from round-trip translation based on domain-adapted translation models.
3. It uses ensembles of separately trained models: one trained on the original training data and the other trained on the augmented training data.
First, we prepared an NMT system trained on a large general bilingual corpus and a small ERP-related bilingual corpus, and then obtained the round-trip translation data for the training and testing data. Next, we separately trained models on the original training data and augmented training data. Finally, we applied the ensemble principle to the prediction scores of the models.
One score was obtained from the original testing data and predicted by the model that was trained on the original training data. The other score was the average score from the augmented testing datasets that was predicted by the model that was trained on the augmented training data. We defined a hyperparameter to determine the best mixing ratio based on a linear sum of prediction scores.
EMMRT can be defined as follows according to our notation: where Linear(a:b) denotes a linear sum of the prediction scores with a ratio of a:b, jBERT denotes BERT trained on Japanese text, and adaptRTn(D) denotes the round-trip translation result of D from the domain-adapted NMT using N-best translation. The mixture of the original data and round-trip translated data is denoted as {D ∪ adaptRTn(D)}.

Data Augmentation for Training Data
Improving the accuracy of prediction requires a large amount and diversity of training data.
In other words, the prediction of component IDs requires more labeled FRSs as training data.
However, it is difficult to prepare a large amount of business data. To increase the number of labeled FRSs, we used ja-en and en-ja round-trip translation to obtain FRSs with similar meanings, but different expressions. By using N-best translation, we controlled the number of FRSs in the augmented training data. The number of FRSs became N×M times larger when N-best translation for ja-en and M-best translation for en-ja were performed.
As mentioned previously, D train denotes the original training data and adaptRTn(D train ) denotes the training data augmented by the domain-adapted NMT using N-best translation. For comparisons to a baseline method, we also used RT(D train ) and RTn(D train ), which are outputs from round-trip translation without domain adaptation. The character n following RT indicates that N×M-best round-trip translation is used for augmentation.

Data Augmentation for Testing Data
In EMMRT, data augmentation is applied to both the training data and testing data, which are round-trip translated and fed into the model trained on round-trip translated data. This means that round-trip translation is also executed in the prediction phase. The final prediction is the ensemble result of two predictions: one from the FRSs in a new project and the other from the round-trip translation results of the FRSs. Because the NMT model is trained on both an ERP-related corpus and general corpus, the round-trip translated FRSs are expected to contain more diverse expressions than the original FRSs. Because we apply round-trip translation to both the training data and testing data, we can narrow the gap between the expressions in the training data and those in the testing data. N×M-best round-trip translation is also useful for testing data augmentation because translation candidates further reduce this gap.
Similar to the training data, we denote the original testing data as D test and augmented testing data (augmented by domain-adapted NMT using N-best translation) as adaptRTn(D test ).
RT(D test ) and RTn(D test ) are the outputs of round-trip translation without domain adaptation.

Experiments
In this section, we demonstrate that each element of EMMRT contributes to improving the accuracy of the LSTM and BERT methods by comparing it to a baseline method and the results of an ablation study. The experimental settings are explained first, followed by the experimental results.

Data Preparation
We collected data from actual ERP package system integration data and explanatory descrip-  Table 1. We gathered FRSs from three types of documents: 765 FRSs from requirement documents for real projects, 3,318 FRSs from explanations of components written in operational workflow and incident logs of specific components, and 1,067 FRSs from public information provided by SAP and METI.
The total number of classes in the training data was 521 and multiple labels were assigned to many instances. However, more than 85% of the instances had only a single label, so we simplified the task to single-label classification by expanding a multi-label instance into multiple single-label instances, where all labels were correct. For example, a training instance x with two labels y 1 and y 2 was treated as two instances (x, y 1 ) and (x, y 2 ), and both expanded instances were assessed as correct for loss calculation if either y 1 or y 2 was predicted. There were 314 classes with one to five instances, 107 classes with 6 to 10 instances, and 100 classes with 11 or more instances. Therefore, more than half of the classes contained five or fewer instances and there was significant imbalance between classes. To alleviate this imbalanced distribution, we increased the number of instances of minority classes by applying N×M-best round-trip translation to the training data according to the number of instances in the original training data to obtain more than 10 instances of each class after round-trip translation. For example, a 4×3-best round-trip translation was applied to a class that had only one instance in the training data and a 3×2-best round-trip translation was applied to a class that had two instances in the training data. We used GiNZA 3 (Matsuda 2020) as a tokenizer for LSTM-based models.
The testing data consisted of FRSs extracted from the same source documents as the training data under the condition that the components had two or more instances. A total of 506 instances were included. We applied a 2×2-best round-trip translation to the testing data and calculated an average score from four testing sets. To prepare NMT systems, we used the JParaCrawl Parallel corpus (v2.0) 4 as a general corpus for training and an approximately 40,000 word in-house ERP parallel corpus (collected from different sources of FRSs) for domain adaptation.

NMT Systems for Round-trip Translation
We prepared two NMT systems for our experiments. One was a general NMT system that was trained on a general corpus and the other was a domain-adapted NMT system that was trained on a general corpus and an ERP corpus. We selected fairseq 5 as a modeling toolkit and SentencePiece 6 (Kudo and Richardson 2018) as a sentence tokenizer. We split sentences with a vocabulary size of 32,000 and set the learning parameters based on the WMT19 NMT robustness task (Berard et al. 2019;Ott et al. 2018;Vaswani et al. 2017). We then removed sentences with lengths exceeding 250 sub-words. The validation and testing sets, both of which contained 10,000 sentences, were extracted from the complete data of JParaCrawl. After training the translation model for 50 epochs on approximately 10 million sentences, the BLEU scores (Papineni et al. 2002) on the testing set were 29.4 for ja-en and 30.9 for en-ja.
We also investigated the effects of NMT domain adaptation. First, we split the ERP corpus into a training set (99.0%), validation set (0.5%), and testing set (0.5%). We oversampled (by 250 times) the ERP training set and merged it with the JParaCrawl training set before training the domain-adapted translation model. This is because some previous studies (Morishita et al. 2020;Song et al. 2020) reported that oversampling is better than fine-tuning for domain adaptation.
The BLEU scores for the testing set of the ERP corpus increased to 46.8 for ja-en and 34.6 for en-ja. These scores are 15.2 and 8.2 points higher, respectively, than those of the model trained using only JParaCrawl.

Classification using BERT Models
We used the simple transformers 7 implementation of a BERT-based classifier. In this model, the BERT output layer is followed by a linear layer that transforms an input vector into another vector whose dimension is equal to the number of classes, as shown in Fig. 1. Japanese text preprocessing for each BERT model was performed based on the experimental procedure used in the NICT BERT. 8 First, we fine-tuned the BERT models using the original data. All BERT models produce outputs with poor accuracy after one epoch, so we changed the number of fine-tuning epochs to 50. To compare accuracies more precisely, because the fine-tuned BERT base model sometimes produced better results than the fine-tuned BERT large model using all training data, we divided the training data into multiple training and validation sets using a stratified 10-fold split (Kohavi 1995) and calculated the average accuracy of experiments using the BERT models. We then calculated the prediction scores of multiple models that were fine-tuned either on the original data or augmented data for the ensemble based on the testing data. In this paper, we report the results of the Kyoto large BERT model. 9

Evaluation Metrics
We adopted an accuracy indicator that can evaluate ranking accuracy because it outputs the ranking of the predicted component IDs ordered by certainty. Specifically, the results are evaluated by the P@k indicator, which is defined as the proportion of results in which the correct component ID is within the top-k-ranked component IDs predicted by the proposed system. We used P@1, P@3, and P@20 for comparisons to the baseline method and from the perspectives of whether the predictions of EMMRT are correct, whether the correct predictions lie near the top, and whether system designers can find correct components by examining prediction results. The target prediction accuracy for EMMRT is P@3=80% (80% within the top three) because poor prediction accuracy would waste subsequent work. We also calculated the mean reciprocal rank (MRR) as a total ranking score.

Main Results
We selected the method from Sakamoto et al. (2018)  from the original testing data and round-trip-translated testing data with a ratio of 2:1. The formulation for their method is defined as follows: where {D train ∪ RT(D train )} represents the mixture of the original training data and round-triptranslated training data. The baseline method constructs one model from the mixed training data and combines the prediction scores from the original testing data and the round-trip-translated testing data.
We also compared our results to those of three other methods. The first was fine-tuned BERT in the form of two fine-tuned pre-trained BERT models using the original training data and an ensemble of two models with a mixing ratio of 0.5 for fair comparison because EMMRT uses two BERT models. The second was EDA as an alternative data augmentation method.  EMMRT improves the accuracy by 13.6 points for P@1, 13.8 points for P@3, and 4.4 points for P@20, compared to the baseline method. EMMRT also outperforms the fine-tuned BERT without data augmentation by a significant margin (p < 0.001) based on the Wilcoxon signedrank test. EDA cannot improve the accuracy of fine-tuned BERTs without data augmentation. 11 PoE works well when it is used to integrate the results of a linear sum. PoE improves the P@20 accuracy, but the calculation cost is three times greater than that of EMMRT.

Ablation Studies
In this subsection, we report the results of ablation studies using LSTM-based and BERTbased models. We prepared a general NMT model trained using only JParaCrawl and used the same neural network architecture as the baseline method for classification. For comparison to the baseline method, we trained an LSTM-based model for 50 epochs using all training data. To confirm the effects of EMMRT, we changed the experimental settings one by one from EMMRT to the baseline method. Oversampling is denoted by oversample(D) in the table, which increases the number of instances by the same amount as N×M-best round-trip translation performed on the training data. The results of the ablation study for the LSTM-based model are presented in Table 3, whose format is the same as that of Table 2 ("baseline+DS+DA" corresponds to the same model as EMMRT, except for the use of LSTM instead of BERT). In Table 3, DS indicates data separation, meaning an ensemble of two models separately trained using either the first or second set of training data (designated in the ID column in Table 3), whereas the vanilla baseline is an ensemble of two models, both of which were trained on a mixture of two types of training data. DA indicates domain adaptation and OS indicates oversampling.
In our experiment on the LSTM-based model, the oversampled training data improved classification accuracy. The accuracy was also improved by the domain adaptation of NMT and by handling augmented data separately. The results from models that combined the original data with augmented data were worse than those of models using separate data. Compared to the results in Table 2, the most effective improvement was obtained by using BERT, which improved 11 We also confirmed the experiment settings for fine-tuned BERT without an ensemble and fine-tuned BERT + EDA with an ensemble, but their accuracy did not exceed the accuracy of EMMRT.
P@1 by more than 8 points.
Similar to the LSTM-based model, the results of an ablation study on the BERT-based model are presented in Table 4, whose format is the same as that of Table 3. It is worth noting that these scores cannot be directly compared to those in Table 3

Comparison to Other Augmentation Methods
We compared EMMRT to two other data augmentation methods: fine-tuned BERT and EDA (Wei and Zou 2019). For fine-tuned BERT, we fine-tuned two pre-trained BERT models using the original training data and used an ensemble of the two models with a mixing ratio of 0.5.
For EDA, we used the original EDA source code. 12 EDA provides four simple operations for data augmentation in English and all of them have positive effects on prediction. We adopted a random swap from EDA because this operation does not depend on WordNet and is applicable to Japanese text, and half of the EDA operations are based on WordNet (Miller 1995) to obtain synonyms. We then ensembled the augmented data using EDA instead of round-trip translated data. In Table 5, EDA(D train ) denotes the randomly swapped stratified 10-fold training data.
Applying the EDA and EDA + ensemble methods yielded similar accuracy to the fine-tuned BERT without data augmentation, but the scores were inferior to those of EMMRT according to the Wilcoxon signed rank test (p < 0.001). This experiment revealed that the ensemble component of EMMRT can also be effective with other data augmentation methods and that EDA can be used as an alternative ensemble method.

Combination with Other Data
We examined the extent to which accuracy can be improved by EMMRT if fewer data are available.  of training data is reduced. D train1K indicates that the amount of training data is reduced to approximately 1,000 instances and D train2K indicates that it is reduced to approximately 2,000 instances. In both cases, EMMRT improves all P@k scores and improves the MRR score by more than 0.004 points.
Additionally, we evaluated methods using English language representation models and English augmented data translated by the ja-en translator instead of Japanese language representation models and Japanese data augmented by round-trip translation. Because various pre-trained language representation models are available for English, we used BERT, XLNet, RoBERTa, ALBERT, and ELECTRA. We used the Simple Transformers implementation in the same manner as the other experiments. The ensemble results are listed in Table 7. adaptTn(D) denotes the stratified 10-fold translated training data, which are increased by N-best translation with domainadapted NMT. We omitted the results from ALBERT and ELECTRA because they could not improve accuracy.   Table 7 Comparisons using 10-fold translated data and English language representation models Similar to round-trip translation, the classification accuracy of the ensemble results for BERT, XLNet, and RoBERTa, when fine-tuned with translated data, is higher than that of the finetuned BERT without data augmentation or ensembling. Although it is not reflected in the table, domain adaptation improves the classification accuracy of the translated data and has a good effect on using ensembles of prediction results. For example, BERT and XLNet are improved by approximately three points in terms of P@1. In particular, RoBERTa significantly improves the accuracy (approximately 9 points at P@1) with domain adaptation. This indicates that the improvement in NMT translation quality facilitates the use of RoBERTa and increases the availability of language representation models.

Combination with Other Ensemble Methods
We also investigated whether multiple ensembles could improve accuracy. Based on the large number of potential combinations, we used PoE (Hinton 2002)  C. Use a linear sum ensemble to integrate the results of predictions from the original data works well when it is used to integrate the results from a linear sum. P@3 and P@20 are improved by combining the ensemble results from multiple data and models. Although it is not reflected in the table, we also attempted to use an ensemble for integrating EDA results (from Table 5) and found that this also improved accuracy slightly. However, the accuracy at low epochs was lower than that before the calculation of PoE, so PoE should be used only after learning has converged. Furthermore, the combination of more predictions makes interpretation more difficult and increases the amount of work required to update a model, so the best balance for practical use should be considered. We judged that round-trip translation and ensemble are the best combinations for ERP packages because interpretation can be performed in Japanese and it is possible to increase the amount of training data while continuing to use a model. In future work,

Discussion
In this subsection, we evaluate the classification accuracy of the proposed method from the perspective of its practical use for ERP package system integration and describe its prospects for additional benefits. We received a response from experienced system designers that "83% of P@3 is useful for work." This indicates that EMMRT exceeds the target accuracy of P@3=80% and achieves sufficient accuracy for practical use. As mentioned in Section 1, after referring to the suggestions from EMMRT, system designers conduct subsequent work using the suggested components by selecting and operating them to check whether the components satisfy the requirements of customers. Improving the accuracy of P@1 and P@3 compared to the baseline method leads to a reduction in the time required for subsequent work and improves the quality of work. Additionally, we received positive feedback that the proposed method can speed up the training of system designers by helping them understand and notice related functions.

Analysis
First, we confirmed the relationship between accuracy improvement and the number of instances in the training data. We divided all classes into three groups: classes containing five or fewer instances, classes containing 6 to 10 instances, and classes containing 11 or more instances.
Compared to fine-tuned BERT without an ensemble and round-trip translation, EMMRT improved the MRR of classes containing five or fewer instances in the training data without reducing the MRR of classes containing 11 or more instances in the training data.
Second, we analyzed the predictions based on round-trip translated data to check the number of classes whose classification accuracies changed. Because linear summation produced a better result than PoE in ensembles of predictions based on round-trip translated data, the characteristics of the round-trip translation data appear to be different from those of the original data.
According to the differences between the predictions from the original data and those from the round-trip translated data, we analyzed how round-trip translation affects accuracy. We also analyzed the classes and instances that were significantly different before and after the domain adaptation of NMT. Because the data considered in this study contained a large number of 13 https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html classes, over half of the classes contained five or fewer instances, so we counted the number of times the correct labels (P@1) were predicted during 10-fold iterations. The number of correct labels varied between 0 and 10 because in the worst case, none of the 10-fold models would predict the correct label and in the best case, all of the 10-fold models would predict the correct label.
We then calculated the differences between the models. The results are listed in Table 9. Table 9 presents the model performance before adaptation, model performance after adaptation, and differences between the two results from left to right. The comparison source and comparison destination are described by the training data and testing data that they use. The comparison results are divided into stages from ">+5" to "<−5," where ">+5" indicates that the comparison destination is dramatically improved and "<−5" indicates that the comparison destination is dramatically deteriorated compared to the source. For example, the case in which the correct label is predicted in 2 of 10 iterations in the source and 8 of 10 iterations in the destination is categorized as ">+5." The results of this analysis demonstrate that classification based on round-trip translated data is inferior to that based on original data, but new correct answers are acquired in some cases. A comparison between the results before and after domain adaptation reveals both improvement and deterioration, but the improvement is greater than the deterioration. This confirms that the superior prediction results from round-trip translated data also apply to ensemble results.
We also compared the unique word n-grams in the original data and round-trip translated data to examine whether the sparsity and diversity problems were alleviated. For diversity, we examined how many new n-grams that were not in the original data were generated by roundtrip translation. For sparsity, we calculated the average number of shared n-grams between the training data and round-trip translated data for each class of ERP components. The results are listed in Table 10.
As a result of round-trip translation, the number of unique n-grams in the training data decreased and those of the testing data increased. In the training data, approximately 31% and 47% of 1-grams and 2-grams were new expressions, respectively, and approximately 45% and 70%   In contrast, in classes where the number of shared n-grams per class increased, the accuracy during the 10-fold iterations did not improve in all cases. These results indicate that roundtrip translation produces new and more similar expressions for training data and testing data.
Therefore, round-trip translation alleviates the problems of diversity and sparsity in some cases.
For example, the original expression " " (shikin guri hyo) 'a cash flow table' is changed to " " (shikin sukejuru) 'a schedule of funding' by round-trip translation and another original expression " " (nyusyukkin yotei) 'a plan of receiving and making payment' is changed to " " (nyusyukkin sukejuru) 'a schedule of receiving and making payment.' Both of the original expressions are related to cash flow and funding management, and new expressions are relatively similar. Conversely, in other cases, expressions were changed to new expressions that had different meanings by round-trip translation and the classification accuracy deteriorated. For example, " " (sho-kai) 'refer' was changed to " " (toiawase) 'inquiry' because " " is also used to express inquiry. Consequently, in some cases, more accurate predictions that could not be derived from the original data were produced from round-trip translated data and the ensemble model took advantage of both types of results.
14 The same trends were observed in shared 3-grams and 4-grams.

Conclusion
To support system designers in ERP package integration, we improved the accuracy of the automatic suggestion of software components. To improve accuracy, we applied data augmentation through round-trip translation to both training data and testing data using a domain-adapted NMT and proposed an effective ensemble method for round-trip translated data. As a result, we achieved a sufficient level of accuracy for practical use.
The proposed method can also be applied to text classification tasks other than ERP component suggestion and may improve accuracy in various business areas where accuracy is currently inadequate based on a lack of data for machine learning.
This study demonstrated that machine translation, which has already been used to support human-to-human communication, can also be used to support human-machine interaction. We found that NLP can even benefit areas with limited data by using a large-scale trained model with rich text data, such as BERT, and using NMT round-trip translation for data augmentation.