Abstract
In recent years, large language models (LLMs) have been increasingly adopted in practical applications and
interactive systems. Contemporary LLMs rely heavily on inductive learning and are therefore strongly influenced by the
characteristics and quality of their training data. While the importance of high-quality training data is widely acknowledged,
it is also reasonable to assume that erroneous interpretations may be inherited from existing sources. In this study, we report
a concrete example of such an error encountered during the design of a statistical learning game utilizing an LLM. During
discussions with the model regarding statistical hypothesis testing, a critical issue arose concerning tests of normality. In
standard statistical practice, normality tests are conducted under the null hypothesis that the sample data follow a normal
distribution. If this null hypothesis is not rejected, the correct conclusion is that the data cannot be determined to either follow
or deviate from a normal distribution. However, the LLM consistently interpreted a non-rejection of the null hypothesis as
evidence that the data do follow a normal distribution. Through extended interaction, the model eventually acknowledged
this reasoning as incorrect. Notably, similar misinterpretations were also observed in other artificial intelligence models.
Given that analogous misunderstandings of normality testing can be found in academic papers and conference presentations,
it is plausible that these errors were acquired through the training data. These observations suggest that LLMs cannot be
assumed to reason correctly about statistical concepts without careful scrutiny. Consequently, particular caution is required
when incorporating LLM-based reasoning into educational tools such as statistical learning games.