Probing Simple Factoid Question Answering Based on Linguistic Knowledge

Recent studies have indicated that existing systems for simple factoid question answering over a knowledge base are not robust for diﬀerent datasets. We evaluated the ability of a pretrained language model, BERT, to perform this task on four datasets, Free917, FreebaseQA, SimpleQuestions, and WebQSP, and found that, like other existing systems, the existing BERT-based system also can not solve them robustly. To investigate the reason for this problem, we employ a statistical method, partial least squares path modeling (PLSPM), with 24 BERT models and two probing tasks, SentEval and GLUE. Our results reveal that the existing BERT-based system tends to depend on the surface and syntactic features of each dataset, and it disturbs the generality and robustness of the system performance. We also discuss the reason for this phenomenon by considering the features of each dataset and the method that was used to evaluate the simple factoid question answering task.

answering, a subtask in the QAKB field, is considered to be a task that has already been solved (Petrochuk and Zettlemoyer 2018). This task is a simplified version of QAKB because simple factoid questions require only one fact (subject, relation, object) to be solved. Many papers have reported successful accuracies with SimpleQuestions (Bordes et al. 2015), the largest benchmark dataset of simple factoid question answering. However, the high accuracy of SimpleQuestions does not mean that simple factoid question answering has been conquered.
Simple factoid questions in other datasets applicable to the QAKB task, such as Free917 (Cai and Yates 2013), WebQSP (Yih et al. 2016), and FreebaseQA (Jiang et al. 2019), are not solved, even by the systems that are successful for SimpleQuestions. Han et al. (2020b) discussed this problem by involving four datasets and systems, respectively. They revealed that existing systems such as BuboQA (Mohammed et al. 2018), HR-BiLSTM (Yu et al. 2017), KBQA-Adapter , and KEQA ) cannot reach the upper bound accuracies for simple factoid questions in WebQSP and FreebaseQA. Moreover, they reported that existing systems showed a lack of transferability in the experiment across two datasets. Although their results indicate that the systems proposed in previous studies are limited in terms of general simple question answering, the effectiveness of pretrained language models for this problem has not yet been examined.
In this paper, we examine the effectiveness, transferability, and robustness of BERT by assessing its performance with respect to simple factoid question answering. Because many studies have reported successful results for various natural language processing tasks with BERT (Devlin et al. 2019;Wolf et al. 2020), we expected that BERT would be able to solve simple factoid question answering regardless of the differences among datasets. We employed an advanced version of BuboQA (Mohammed et al. 2018), which employs an LSTM and a CNN to encode a given question, by replacing all the encoders of BuboQA with BERT following Lukovnikov et al. (2019) to test our hypothesis in the same experimental setting as Han et al. (2020b). In our experiments, even though our BERT-based system attained accuracies higher than those of BuboQA on the four datasets, we also found that our BERT-based system still limited in robustness and transferability, similar to the original BuboQA.
We conducted a statistical analysis to examine the inner working of BERT to determine why BERT fails in general simple factoid question answering. Recently, several studies (Jawahar et al. 2019;Ravishankar et al. 2019;Kovaleva et al. 2019) attempted to explain the inner working of BERT using probing tasks such as SentEval (Conneau and Kiela 2018) and GLUE (Wang et al. 2019a). As a result of depending on one or a few observations on which to base their conclusions, their results may be conflicting, similar to those of previous studies that were conducted to prove word embeddings (Schnabel et al. 2015;Chiu et al. 2016;Wang et al. 2019b;Han et al. 2020a).
Here, we propose a different approach in which this problem is defined as a statistical examination of causal relationships between the result of probing tasks and the result of simple factoid question answering. Han et al. (2020a) employed partial least squares path modeling (PLSPM) (Wold 1982) to investigate causal relationships between linguistic knowledge and NLP downstream tasks on word embeddings. Compared with other statistical methods, the advantage of PLSPM is its robustness to a smaller sample size (Tenenhaus et al. 2005a). In addition, it is easy to apply this method to analyze experimental results because it requires fewer assumptions on the observed variable (Tenenhaus et al. 2005b). Hence, we estimated PLSPM models using 24 BERT models (Turc et al. 2019) to explain the results of our BERT-based system based on the result of the probing tasks. As a result, we found that the accuracy of our BERT-based system can be causally explained by the accuracy of probing tasks on the surface and the syntactic information in our PLSPM models. This indicates that existing simple factoid question answering systems may to a large extent depend on the surface and syntactic information of the target dataset.
Our study makes the following contributions.
• Our findings showed that the system that employs pretrained language models, such as BERT, still experiences the same problems as other existing systems with respect to transferability and robustness.
• We employ PLSPM, a statistical method to probe and examine the inner working of BERT and linguistic knowledge.
• The PLSPM analysis with SentEval and GLUE revealed that the accuracies of probing tasks for semantic understanding are not causally related to the accuracies of our BERT-based system, whereas the accuracies of surface and syntactic tasks can explain the accuracies of our BERT-based system with significant path coefficients.
• Our findings suggest that the method to evaluate simple factoid question answering and the source of each dataset play an important role in this phenomenon, according to additional error analyses.

Question answering over knowledge base
Question answering over a knowledge base is one task of semantic parsing. This task aims to find the correct answer for a given question from a target knowledge base, such as Freebase.
Free917 (Cai and Yates 2013) is one of the early datasets for this task over Freebase. Free917 was constructed by two annotators who wrote 917 questions using Freebase Commons, a subset of Freebase. However, this dataset contains only 917 questions and is too small to train a machinelearning model sufficiently. Furthermore, Berant et al. (2013) mentioned that the questions in Free917 tend to contain the label of the gold relation directly. Therefore, other datasets, such as WebQuestion (Berant et al. 2013), SimpleQuestions (Bordes et al. 2015), and FreebaseQA (Jiang et al. 2019), have been proposed on an ongoing basis.
First, WebQuestion was proposed. This dataset contains 5,810 questions that were created using Google Suggest API to ensure the naturalness of the question. Although WebQuestion succeeds in aggregating more naturally written questions than Free917, this dataset does not contain formal queries to be executed. To overcome this problem, Yih et al. (2016) suggested WebQSP, a subset of WebQuestion containing annotated SPARQL queries. SimpleQuestions is the most popular dataset for this task because of its size of over 100,000 questions. The questions in SimpleQuestions were generated by crowd workers referring to randomly sampled facts from Freebase. The approach FreebaseQA uses to aggreate questions differs from that of the above-mentioned datasets. Jiang et al. (2019) annotated 28,348 questions in TriviaQA (Joshi et al. 2017) with Freebase knowledge because Jiang et al. (2019) aimed to suggest a dataset that includes more difficult and naturally written questions such as those used in trivia quizzes.
Previously, researchers tried to solve this task by transforming questions into logical forms (Berant et al. 2013;Reddy et al. 2016;Trivedi et al. 2017). Recently, many studies have reported state-of-the-art accuracies for the above datasets by using systems based on neural network (Yu et al. 2017;Petrochuk and Zettlemoyer 2018;Mohammed et al. 2018;Huang et al. 2019). For example, BuboQA (Mohammed et al. 2018), which is well-known benchmark system for this task, consists of the following four submodules: • entity detection, a sequence-tagging network to find the span of an entity in a given question.
• entity linking, a string-match module to link an entity in a knowledge base with the span of an entity.
• relation prediction, a sentence classifier network to predict a relation in a knowledge base from a given question.
• evidence integration, a scoring module to rerank predicted entities and relations. Petrochuk and Zettlemoyer (2018) argued that their neural network-based model nearly solved SimpleQuestions. They examined the upper bound accuracy of SimpleQuestions and showed that their system, which employs a similar network to BuboQA, almost reached the upper bound accuracy of SimpleQuestions. However, the success of SimpleQuestions does not mean that the

BERT and BERTology
Contextual embeddings, such as ELMo (Peters et al. 2018) and BERT (Devlin et al. 2019), have become indispensable tools in natural language processing, in addition to non-contextual distributional word representations, such as word2vec (Mikolov et al. 2013) and fastText (Bojanowski et al. 2017). Despite many studies in which contextual embeddings were employed resulting in state-of-the-art performance for a variety of natural language processing tasks, their usefulness was not clearly explained. Therefore, researchers have investigated the inner working of contextual embeddings to explain their effectiveness, especially with respect to BERT.
These studies, known as BERTology (Rogers et al. 2020), usually accepted the traditional hypothesis that encoded linguistic knowledge in a language model can explain the accuracies of downstream tasks in NLP (Chiu et al. 2016). Thus, BERTology has included various analyses involving both the structural features of BERT and the linguistic knowledge encoded in BERT.
The linguistic knowledge encoded in BERT has been examined in various ways, such as attention analysis (Liu et al. 2019), edge probing (Tenney et al. 2019b), and in comparison with intrinsic evaluations such as SentEval and GLUE. For example, Tenney et al. (2019a) showed that BERT encodes syntactic information, such as the part-of-speech, chucking span, and its syntactic and semantic role by edge probing analysis.
Although BERTology has succeeded in determining the type of and the way in which linguistic knowledge is encoded in BERT, they usually derived their conclusions from one or a few observations without any statistical analysis or verification. Because many language models have been proposed, observation with one or a few samples is not sufficient to make a general conclusion for BERTology. Previous studies conducted to probe word embeddings sometimes reported conflicting results (Schnabel et al. 2015;Chiu et al. 2016;Rogers et al. 2018;Wang et al. 2019b;Han et al. 2020a). Therefore, the general relationship between encoded linguistic knowledge and the performance of downstream applications is not yet clear.

Statistical analysis of word embeddings
Traditionally, encoded linguistic knowledge in the language model is believed to be helpful for solving the downstream tasks of natural language processing (Chiu et al. 2016). Before BERTology, many researchers (Schnabel et al. 2015;Chiu et al. 2016;Rogers et al. 2018;Wang et al. 2019b) have attempted to prove this intuition with distributional word representations, such as word2vec (Mikolov et al. 2013). Han et al. (2020a) argued that previous studies have two limitations. First, they only conducted a correlation analysis between the results of two tasks, for example, the accuracy of word similarity and the accuracy of POS tagging. Because a downstream task in NLP usually requires multiple linguistic knowledge to be solved, correlation analysis between the probing task and the downstream task in NLP limited ability to understand their causal relationship. Second, as we mentioned in Section 2.2, researchers sometimes reported conflicting results for the same issue because their conclusions were usually based on few observations (Schnabel et al. 2015;Chiu et al. 2016;Rogers et al. 2018;Wang et al. 2019b;Han et al. 2020a).
In the field of statistics, researchers employed structural equation modeling (SEM) (Jöreskog 1970) to prove causal assumptions between variables. In SEM, a causal diagram representing causal assumptions for the target variables is first suggested by a user. Figure 1 shows an example of a causal diagram involving probing tasks for semantic knowledge and the QAKB task. Figure 1 contains the causal assumptions listed below: • Encoded semantic information (y1) affects the accuracies of probing tasks (x1 1 , x1 2 , ..., x1 n ).
Based on a given causal diagram and the observed variables, SEM estimates the regression for- Fig. 1 Example of a causal diagram mulas for each causal hypothesis. We refer to an estimated regression as a structural equation.
Compared to correlation analysis, the advantage of SEM is its ability to handle multiple variables at once. Moreover, it can provide many reliable indexes for proving causal assumptions.
For example, the score of y2 can be predicted by the following structural equation: where β 1 is the estimated weight, and ζ 1 is the estimated error term. The use of analysis enables us to analyze the number of latent variable scores that can be predicted by the structural equation, which the researcher has referred to as the explainability of that latent variable. Han et al. (2020a) employed partial least squares path modeling (PLSPM) (Wold 1982), one of the methods of SEM, to investigate the relationship between the encoded linguistic knowledge in a language model and the accuracies of NLP tasks. They selected PLSPM because of its fewer requirements for observed variables and robustness with small samples. They examined their causal diagrams, which assume the causal relationship between the accuracies of probing tasks and downstream tasks, by using training algorithms, a corpus, and hyperparameters as variables. Although they proved some relationships between linguistic knowledge and the accuracy of downstream tasks in their causal diagrams, they did not consider contextual embeddings.

Experimental settings
In this study, we employ the system suggested by Lukovnikov et al. (2019), an extended version of BuboQA using BERT. BuboQA, the benchmark system for simple factoid question answering (Mohammed et al. 2018), consists of four submodules: entity detection, entity linking, relation prediction, and evidence integration, as mentioned in Section 2.1. Among them, entity detection and relation prediction use a machine-learning algorithm based on a neural network, whereas the other submodules employ string matching or rule-based weight calculation. The original BuboQA employed LSTM and CNN-based encoders for entity detection and relation prediction; however, Lukovnikov et al. (2019) reported that the performance was increased by replacing LSTM and CNN-based encoders with the BERT model. Therefore, we employ the model proposed in this paper as a BERT-based simple factoid question answering system.
Note that we implemented it ourselves because the official GitHub repository of Lukovnikov et al. (2019) is no longer available. Our implementation followed the instructions in the original paper as much as possible, except for the design of the network for entity detection and relation prediction. In their original paper, they combined two submodules into one classifier to improve the accuracy of the proposed system. However, our aim was to examine the performance of BERT compared with the original BuboQA. Therefore, we did not change the original design of BuboQA and changed only its encoder. Hereinafter, we refer to this system as BertQA to distinguish this system from the original BuboQA and that suggested by Lukovnikov et al. (2019).
The other experimental settings we used are the same as those of of Han et al. (2020b). They prepared four datasets, Free917, FreebaseQA, SimpleQuestions, and WebQSP to examine the robustness and transferability of existing systems. Because Free917, FreebaseQA, and WebQSP were not designed for simple factoid question answering, they contain questions that require multiple facts to be answered. To guarantee the same level of domain and difficulty as Simple-Questions, Han et al. (2020b) removed all questions that require two or more facts as the answer or cannot be solved by the FB2M dataset (Bordes et al. 2014), which is the source dataset of SimpleQuestions and a subset of Freebase. We employed their filtered datasets, hereinafter F917, FBQ, SQ, and WQ, for our experiments.   predicted entity can reach answer, but is not the same as gold entity 5 ambirel predicted relation can reach answer, but is not the same as gold relation 24 null no predicted entity or relation 13 total 75 Table 3 Labeling for questions on WQ validation split, which fails when the training data changes from WQ to SQ.

Results
However, FreebaseQA annotated this question as a simple factoid question.
Because it indicates that the problems BertQA experiences when solving FBQ may be caused by FBQ itself, we excluded FBQ from further analyses in this study. Table 3 indicates that approximately 70% of questions failed with resprect to relation prediction. This problem was also reported by Han et al. (2020b), who argued that the errors in

Statistical Analysis of Simple Factoid Question Answering with BERT
In this section, we present a statistical analysis of the results of BertQA to understand why BertQA still has the same problem as other existing systems. Han et al. (2020b) attempted to explain the results they obtained for simple factoid question answering systems, but their analyses were limited to ad-hoc error analysis. This prompted us to employ PLSPM analysis, which is a priori way to investigate the inner working of existing systems.

Causal diagram and data preparation
In this study, we aimed to investigate the inner working of BertQA when solving simple factoid question answering. As explained in Section 2.3, causal assumptions for the target variables should be expressed as a causal diagram as motivation to employ PLSPM. We drew inspiration for our causal diagram from previous studies, which assumed that the accuracies of probing tasks can explain the accuracies of NLP tasks (Chiu et al. 2016) on the analysis of word embedding models. Following the traditional assumption, we suggest the use of the causal diagrams shown in Figure  BertQA into its submodules, including entity detection, entity linking, relation prediction, and evidence integration, following the original paper of BuboQA. Because entity linking and evidence integration do not use the BERT-based machine-learning system, we set causal hypotheses only between the probing tasks and two submodules of BertQA, including entity detection and relation prediction. Therefore, our PLSPM models based on the diagrams shown in Figure 2 aim to estimate structural equations for the following hypotheses: • The accuracies of probing tasks, SentEval and GLUE, can explain the accuracy of entity detection and the accuracy of relation prediction. For the observed variables in our PLSPM models, we prepared the results of SentEval, GLUE, and submodules of BertQA with 24 BERT models.  to define the original indicator for measuring the performance of each task. Furthermore, we also structured our employed tasks according to their original paper to obtain the composite latent variables, except for QQP and WNLI in the GLUE dataset. These tasks have low correlation coefficients with other tasks in the same category, which negatively affects the reliability of PLSPM models. Our code will be shared in the GitHub repository for reproducibility.

Results
First, we examine the extent to which the accuracy of SentEval can predict the accuracy of BertQA by PLSPM analysis. Table 5 lists the path coefficients of each path in SentEval-FBQ, SentEval-SQ, and SentEval-WQ. When the PLSPM model is interpreted in the statistical field, the path coefficient is considered as the explainability of the target path. We do not list paths for which the p-value is larger than 0.05. Note that it is considered to be a meaningful index even though the path coefficient is a negative value because a meaningless path would be rejected on the basis of the p-value. As a result, we can conclude that the accuracies of SentEval can meaningfully explain the accuracies of BertQA, except for the semantic tasks of SentEval. This indicates that BertQA cannot overcome the discrepancy between different datasets because the inconsistency in the distribution of questions between different datasets is related to surface and syntactic knowledge.
We find it to be more difficult to interpret the result of PLSPM models using the GLUE dataset as is clear from Table 6. Only the accuracies of the inference tasks can explain the accuracy of entity detection with the p-value < 0.05 among GLUE tasks, yet they also report negative coefficients for explaining the accuracies of relation prediction. In this regard, the accuracies of single-sentence tasks, such as CoLA, mainly explain the relation prediction. One   interesting point is that the accuracies of the similarity and paraphrase tasks are rejected to explain the accuracy of BertQA with a p-value > 0.05. These datasets of the similarity and paraphrase tasks, such as MRPC, require an understanding of the semantic knowledge of the given sentences to solve the questions. This means that encoded semantic knowledge and the ability to understand the given sentences are not very helpful in explaining the accuracies of BertQA, even in our PLSPM model with the GLUE dataset.
In terms of the overall results of our PLSPM models, we found another problem regarding the discrepancy between each simple factoid question answering dataset.  Another issue is the discrepancy in the GoF values between the PLSPM models using SentEval and the PLSPM models using GLUE. As in Table 7, we find that the PLSPM models using GLUE have higher GoF values than those using SentEval. This indicates that the linguistic knowledge measured by SentEval has lower explainability than that measured by GLUE, following our causal diagrams in Figure 2. We suppose two reasons for this phenomenon. First, Conneau and Kiela (2018) reported that sometimes the accuracy of their probing task does not show any correlation with the accuracy of downstream tasks. For example, the tree depth and bigram shift tasks only correlate with one downstream task among the 17 downstream tasks. This means that the probing tasks in SentEval are not sufficiently robust to explain the accuracy of downstream tasks.
Second, SentEval was proposed for sentence encoders such as SkipThought (Kiros et al. 2015) and InferSent (Conneau et al. 2017) before the proposal of contextual embeddings. We suppose that this is why the probing task in SentEval has lower explainability for the BERT-based model.

BertQA and semantic understanding
The results in Table 5 show that BertQA largely depends on the encoded surface and syntactic knowledge. This also means that the surface and syntactic features of each dataset affect the accuracy of the BertQA model. On the other hand, the accuracy of semantic information tasks cannot explain the accuracies of BertQA. As mentioned in Section 3.2, semantic understanding is required to solve ambiguous questions among datasets, such as people.person.profession. Therefore, the lack of semantic understanding in BertQA is an important factor responsible for the failure of general simple factoid question answering.
The method used to evaluate the simple factoid question answering task, namely matching the predicted subject and relation with gold data (Bordes et al. 2015), is one reason for this problem. Particularly in SimpleQuestions, the subject and the relation can be extracted from the question without semantic understanding, because the questions usually contain labels of the subject and the relation (Serban et al. 2016). This means that the evaluation method compels existing QA models to concentrate on surface and syntactic features. Furthermore, the possibility of multiple correct facts for the given questions could be another problem with this evaluation method. For example, when BertQA predicted people.person.profession for a given question then it was able to find the correct answer in Freebase, but the traditional evaluation method may reject this prediction if the gold relation in the dataset is common.topic.notable type. Hereinafter, we refer to the traditional evaluation method as the matching accuracy because the criteria of this evaluation are based on matching the correct subject and relation from a given question.
To examine the effect of the evaluation method on the accuracy of BertQA, we conducted additional experiments. We employed an older evaluation method (Berant et al. 2013;Berant and Liang 2014), which considers an object, the answer to a given question in the QAKB task, to overcome the limitation of the matching accuracy. Evidence integration, a submodule of BertQA, combines the predicted subjects and predicted relations for a given input question, and evaluates its result with the gold fact given by the dataset. Here, our employed evaluation method compares the predicted object, which is derived from a predicted subject and relation, with the gold object in the dataset. Moreover, we extended this method to entity linking and relation prediction. In particular, we aggregated all facts from FB2M to automatically examine whether the predicted result of each submodule can reach the gold object. We refer to this employed evaluation method as the reachability accuracy because the criteria of this evaluation are based on the reachability of the correct answer in a knowledge base. This indicates that the previous evaluation method does not consider paraphrasing or synonyms, strongly related to semantic understanding.
We also estimated the PLSPM models with the reachability accuracies of BertQA. Note that we did not change our causal models at all, because we only employed the reachability accuracies of BertQA as observed variables. The new PLSPM models still rejected the structural equations between the probing tasks for semantic information and BertQA. However, the new PLSPM models reported a higher Goodness-of-Fit value for all datasets on average than the PLSPM models with matching accuracies, as indicated in Table 9. In particular, in Table 10, the difference in the R 2 value of latent variables is generally larger for WQ than for SQ. Because it is related to the features of WebQuestions, we continue this discussion in the next section.

Special characteristics of WQ
As we reported in Section 4.2, the Goodness-of-Fit values of the PLSPM models for WQ are lower than those of the PLSPM models for other datasets. This indicates that the same causal diagram, which assumes that encoded linguistic knowledge can explain the accuracies of BertQA, does not match for WQ, unlike other datasets. these problems occur. We manually labeled 20 error questions, which were evaluated as being correct in entity detection but as incorrect in entity linking, as in Table 11. As a result, we find that two main problems occur when linking the entity in given questions. First, Freebase links too many entities with a single entity label. For example, when we find an entity with the label mexico in the preprocessed index by BuboQA, we obtain 2,830 results. Because the entity linking of BuboQA and BertQA does not have any scoring process to sort ambiguous results, BertQA sometimes cannot find the correct entity for a given question within the top-n results. The second problem is that the string label for the entity in Freebase and the written string for the subject in WQ are not identical. For example, the given question in WQ has the phrase, communist party of china, but the label of the entity for this phrase in Freebase is Chinese communist party.
Therefore, the characteristics of WQ are the reason why entity detection and linking of WQ are not explained well by our PLSPM models.
For relation prediction, we find that sometimes the same relation is written with different patterns by annotators on each dataset. We examined the relation people.person.profession as the sample relation to investigate the difference in writing patterns between datasets. If the term profession appears directly in the question, we label it as directly specify relation. If a term job, work, ... that is similar to profession appears in the question, we use the indirectly specify relation label, and if there is no term similar to profession in the question, paraphrasing is used for annotation.

models.
As discussed above, WQ has different characteristics for entities and relations compared with other datasets, especially SQ. According to our analysis, entity linking for WQ requires the ability to disambiguate too many candidate entities for the same label in Freebase. Furthermore, relation prediction for WQ involves semantic understanding to predict the gold relation because the questions in WQ tend to contain a paraphrased term or synonym of the label for the gold relation. Han et al. (2020b) reported that BuboQA and KEQA, other state-of-the-art systems for SimpleQuestions, have the same difficulty with entity linking and relation prediction for WQ as BertQA. This indicates that WQ is a more challenging dataset than SQ, even for other stateof-the-art systems.
One main reason for this phenomenon is the difference in the method that is used to generate questions in the dataset. For WebQuestions, the source of WQ, the questions were generated by Google Suggest API as naturally written user queries (Berant et al. 2013). On the other hand, the question in SimpleQuestions, the source of SQ, were written artificially by crowd workers based on the suggested fact (Bordes et al. 2015). Therefore, the difference in the method that was used to create each dataset is the reason for the difference in distribution among the datasets.
Moreover, this is also the reason why each dataset is solved using different linguistic knowledge in our PLSPM models.

Conclusion
In this paper, we examined whether the pretrained language model, BERT, could robustly solve simple factoid question answering. BERT-based models proved to be robust in other natural language processing areas (Talmor and Berant 2019). However, we found that BertQA failed to solve various tasks successfully, as in previous systems. We conducted PLSPM analysis to investigate why BertQA could not overcome the discrepancy in the distribution among datasets.
As a result, our experiments revealed that even BERT depends on the surface and syntactic features of each dataset, rather than the semantic understanding required for general simple factoid question answering. This indicates that a pretrained language model is not sufficient to generalize the distribution discrepancy among existing datasets. In addition, we discuss the source of each dataset and the particular method that was used to evaluate simple factoid question answering, which are important to understand why even BERT depends on the surface and syntactic features. Although these studies yielded promising results, we additionally suggest that it would also be necessary to reconsider the evaluation method for simple factoid question answering. For example, changing an objective function from the subject and relation to an object may improve the semantic understanding of the QA system for the given questions. In future work, we hope to suggest a more robust system for simple factoid question answering, based on the findings of this study and those of other researchers.