2024 Volume 262 Issue 1 Pages 33-41
As evidence of risk factors for severe cases of coronavirus disease 2019 (COVID-19) was uncertain in early phases of the pandemic, the development of an efficient predictive model for severe cases to triage high-risk individuals represented an urgent yet challenging issue. It is crucial to select appropriate statistical models when available data and evidence are limited. This study was conducted to assess the accuracy of different statistical models in predicting severe cases using demographic data from patients with COVID-19 prior to the emergence of consequential variants. We analyzed data from 929 consecutive patients diagnosed with COVID-19 prior to March 2021, including their age, sex, body mass index, and past medical histories, and compared areas under the receiver operating characteristic curve (ROC AUC) between different statistical models. The random forest (RF) model, deep learning (DL) models with not too many neurons, and naïve Bayes model exhibited AUC measures of > 0.70 with the validation datasets. The naïve Bayes model performed the best with the AUC measures of > 0.80. The accuracies in RF were more robust with narrower distribution of AUC measures compared to those in DL. The benefit of performing feature selection with a training dataset before building models was seen in some models, but not in all models. In summary, the naïve Bayes and RF models exhibited ideal predictive performance even with limited available data. The benefit of performing feature selection before building models with limited data resources depended on machine learning methods and parameters.
The coronavirus disease 2019 (COVID-19) pandemic, caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has raised global public health concerns since 2019 (Huang et al. 2020; Zhu et al. 2020). In early stages of the pandemic, each country had to optimize public health strategies to quarantine infected individuals. In Japan, each prefectural government has converted pre-existing facilities, such as non-governmental hotels, into quarantine stations (Akashi et al. 2022; Machida and Wada 2022). Many studies have been conducted to build effective prediction models for the identification of potential severe cases of COVID-19 (Gallo Marin et al. 2021). The effective pre-admission triage of infected individuals to hospitals for high-risk patients or quarantine facilities for low-risk patients represented an important issue during the pandemic. Consequently, appropriate statistical models had to be developed to predict severe COVID-19 cases from the limited data available during early stages of the pandemic. To date, most predictive models have incorporated clinical or laboratory COVID-19 infection data (Li et al. 2020; Sun et al. 2020; Wang et al. 2020; Gude-Sampedro et al. 2021; Meng et al. 2021). Many of these models have achieved moderate to high predictive accuracy with areas under the receiver operating characteristic curve (ROC AUC) exceeding 0.70 (Metz 1978). However, the symptoms differ with respect to the number of days from infection (Lauer et al. 2020), and incorporating this information into predictive models requires careful consideration of how and when to collect the data. Furthermore, laboratory and imaging data were unavailable for most cases during the initial triage step. Another issue is the selection of an appropriate statistical model for the predictive task. Conventional statistical models include multivariate analysis with a logistic regression model, as well as machine learning (ML)-based models. However, few studies to date have employed real-world data to evaluate which statistical models can more accurately predict a severe clinical course given limited available data and evidence of risk factors (Xiong et al. 2022; Ustebay et al. 2023). Moreover, the accuracy and robustness of Bayesian models, such as the naïve Bayes classification model, in the prediction of severe COVID-19 cases remains unknown. Therefore, the present study was conducted to evaluate the accuracy of predicting a severe disease course in different statistical models using basic demographic data and medical histories of COVID-19 patients obtained prior to the emergence of consequential variant strains.
The present study utilized data of individuals infected with COVID-19 who were designated to the largest quarantine hotel in Miyagi Prefecture between December 2020 and February 2021 (Tadano et al. 2023). Because the study period preceded the start of the mass vaccination campaign, none of the participants had previously been vaccinated against SARS-CoV-2. This period also preceded the first occurrence of the delta variant in Japan, which may have caused severe disease profiles (Hu et al. 2022; Ong et al. 2022).
All participants were assessed upon preadmission interviews by local government health workers to be (1) clinically mild and (2) without severe conditions of medical history that may predispose them to life-threatening events due to COVID-19 infection. Detailed eligibility criteria for admission to the isolation facility have been reported previously (Tadano et al. 2023). Specifically, scores were calculated for each patient based on a combination of potential risk factors including older age, pregnancy, occurrence of serious conditions, obesity, smoking habits, and past medical history of diabetes mellitus (DM), bronchial asthma (BA), chronic obstructive pulmonary disease (COPD), uncontrolled hypertension, cardiovascular diseases (CVD), chronic kidney diseases (CKD), and malignancies. Symptomatic patients were allowed to leave the facility once they lacked antipyretics or respiratory symptoms (1) 10 days after onset and (2) 72 hours after the resolution of fever. Asymptomatic patients were permitted to leave the facility 10 days after testing for SARS-CoV-2.
Body mass index (BMI) data were provided for approximately 40% of the admitted patients. Patients with BMI data were used as the original cohort, and those without BMI data were used for the sensitivity analyses. A flow diagram of the study design is presented in Fig. 1.

Flow diagram of study design.
A total of 929 patients with mild to moderate coronavirus disease 2019 (COVID-19) symptoms, admitted to a quarantine hotel between Dec. 2020 and Feb. 2021, were enrolled in this study. Patients were divided into cohorts of 358 and 571 individuals based on the presence or absence of body mass index (BMI) data, respectively. For each cohort, several sensitivity analyses were performed to evaluate the robustness of model performance. Feature selection using least absolute shrinkage and selection operator (LASSO) before building models was performed only with the training dataset to avoid data leakage. ROC AUC, area under the receiver operating characteristic curve; HTN, hypertension; DM, diabetes mellitus; DL, dyslipidemia; BA bronchial asthma; CVD, cardiovascular disease; COPD, chronic obstructive pulmonary disease; CKD, chronic kidney disease; HU, hyperuricemia; SAS, sleep apnea syndrome; LDA, linear discriminant analysis; LR, logistic regression; SVM, Support Vector Machines; CART, Classification and Regression Trees; NN, neutral network; DL, deep learning.
Variables considered prior to feature selection with the least absolute shrinkage and selection operator (LASSO) included age, sex, nationality, BMI, antibiotic prescription before admission, and current smoking status, and medical history of the following 14 conditions: hypertension, DM, dyslipidemia, BA, heart disease, CVD, malignancies, COPD, CKD, hyperuricemia (HU), liver disease, psychiatric disease, sleep apnea syndrome (SAS), and atopy. The evaluated outcome was the occurrence of hypoxia with a prolonged decrease in percutaneous arterial oxygen saturation (SpO2) ≤ 93%.
Statistical analysisMachine learning process in this study was consisted of the following steps: (1) data preprocessing with z-score normalization, (2) data splitting into a training and validation dataset, (3) feature selection with a training dataset, (4) model training, and (5) cross-validation with test dataset for accuracy estimation. As the data preprocessing to improve the prediction performance, each variable was standardized using z-score as Z = (χ − M) ⁄ SD , where χ is each patient’s raw score, M is the mean of the population, and SD is the standard deviation (Andrade 2021; Tanaka et al. 2022). Before constructing supervised ML models to predict the development of severe COVID-19, the properties to be incorporated into these models were determined using LASSO to minimize dimensionality and maximize predictive power (Yamada et al. 2014). In the LASSO, ℓ1 norm was used as the penalty, and features with calculated coefficients that were reduced to zero were excluded from subsequent ML models. Feature selection was performed only with the training dataset and not with the validation dataset to avoid data leakage with excessively optimistic evaluation caused by using the whole dataset before cross-validation in building models (Yagis et al. 2021). In the subsequent cross-validation step, 60% of the data were allocated as the training dataset, with the remaining 40% reserved for validation. The following supervised ML models were prepared: linear discriminant analysis (LDA), nonlinear discrimination with logistic regression (LR), a tree-based model with classification and regression trees (CART), support vector machines (SVM), random forest (RF), a single-hidden layer neural network (NN), deep learning (DL) models with multiple hidden layers, and naïve Bayes models with and without Kernel density estimation for non-parametrically estimating the probability density function (Uddin et al. 2019). In the SVM model, three-fold cross-validation was performed. In the NN models, validation accuracies were obtained using 100 epochs with 3-fold cross-validation. In the RF model, square-rooted quantities of incorporated features were randomly selected to build 500 decision trees. The number of decision trees was determined after checking for the relationship between the error rate and the number of decision trees. In the single-hidden-layer NN model, five units were prepared in the hidden layer. Multiple patterns for the numbers of hidden layers and neurons in each layer were tested for the single-hidden-layer NN and DL models, with predictive accuracies compared by ROC AUC using the DeLong test (DeLong et al. 1988). Multiple comparisons of the AUC measures were not adjusted based on the exploratory nature of the present study. For the ML model with the highest predictive accuracy, the importance of preliminary feature selection was verified by calculating AUC with and without said selection. To verify the robustness of AUC measures, 20-fold iterated measurements were performed for the NN, DL, and RF models. Comparisons of repeated AUC measures between models with and without feature selection were performed using the Mann-Whitney U test. The AUC measurements were reperformed after randomly reallocating training and validation datasets in the first cohort with BMI data. To verify the robustness of the finding, AUC measurements were further performed in the second cohort without BMI data. All statistical analyses were performed using Python 3.11.1, and R version 4.1.3 (R Foundation, Vienna, Austria).
EthicsThis study was approved by the institutional review board of Tohoku University Graduate School of Medicine (approval number: 2021-1-1178). Written informed consent was waived by the review board because of the anonymity of the present study and to prevent unnecessary risks of transmitting the infection by obtaining written forms from the participants. All process of the study was performed in accordance with the latest version of the Declaration of Helsinki as revised in 2013 (World Medical Association 2013).
Of the 929 patients with reliable daily SpO2 measurements enrolled during the study period, 358 had BMI data and the remaining 571 did not. Fifteen patients (1.6%) were transferred from hospitals to quarantine facilities for continued isolation. Previously reported demographic data (Tadano et al. 2023) indicate that although none of the patients were hypoxic (with SpO2 ≤ 93%) on admission to the isolation facility, 63 (6.8%) developed hypoxia at a median of 8 days (interquartile range: 6-10 days) from the clinical onset.
First cohortThe first cohort included data from 358 individuals (197 males and 161 females), including 96 current smokers, with evaluated variables and reliable SpO2 measurement results. The median and interquartile range (IQR; 25-75 percentile) of age at hotel admission were 39 and 24-52 years, respectively. Among them, 341 were of Japanese nationality, 15 were Asian from countries other than Japan, and 2 were Caucasians. The following prevalence of medical histories was reported: 60 with hypertension, 16 with diabetes mellitus, 34 with dyslipidemia, 24 with BA, 14 with heart disease, 3 with CVD, 9 with malignancies, 2 with COPD, 2 with CKD, 6 with HU, 3 with liver diseases, and 4 with psychiatric diseases. The median BMI was 22.67 (IQR; 20.30-26.35). There were 23 patients (6.4%) who developed hypoxia following clinical onset. The overall cohort was randomly divided into a training dataset (215 individuals, including 15 who developed hypoxia) and a validation dataset (143 individuals, including 8 who developed hypoxia).
Feature selectionIn the initial feature selection stage using LASSO with the training dataset, the following pre-infection variables produced non-zero coefficients and were used in the subsequent ML models: age (Wald χ2 = 4.65; p = 0.0310), BMI (χ2 = 5.43; p = 0.0198), sex (χ2 = 2.27; p = 0.1318), history of hypertension (χ2 = 0.12; p = 0.7249), history of DM (χ2 = 2.89; p = 0.0893), history of BA (χ2 = 0.16; p = 0.6866), history of HU (χ2 = 0.46; p = 0.4967), and antibiotic prescription before admission (χ2 = 0.55; p = 0.4577). All other variables produced coefficients of zero and were eliminated from the subsequent ML models. For the naïve Bayes models, variables producing variance errors were further excluded.
AUC measures for ML models (original data configuration)Table 1 lists the obtained AUC measures and 95% confidence intervals obtained for each ML model using the validation dataset for the first cohort. The linear discriminator (AUC, 0.671; p = 0.0525), logistic regression (AUC, 0.522; p = 0.5852), SVM (AUC, 0.594; p = 0.1863), and DL models with too many hidden layers or neurons failed to show a significant prediction accuracy. The highest AUC measures were obtained with the naïve Bayes model, with both configurations with and without kernel density estimation exhibiting AUC measures greater than 0.80 (p = 0.0004 with kernel density estimation and p = 0.0010 without it). In the NN and DL models, AUC measures largely differed according to parameters such as the numbers of hidden layers and neurons, with too many of either resulting in lower AUC measures, possibly reflecting overfitting of the models. The ROC curves obtained by the evaluated models are presented in Fig. 2, suggesting the superiority of the naïve Bayes model over the other models.
Next, to decide the optimal number of trees in RF model, the relationship between the number of trees and the errors in prediction were evaluated for three times with different random seeds to determine the optimal number of trees for a stable prediction (Fig. 3). In obtaining the errors in these analyses, the outcome was used as a dummy variable. The obtained results indicated that approximately 300 trees would realize minimal error rate, and the error rate will not decrease by further increasing the number of trees. Based on this finding, the number of trees in RF was set with 500 in the subsequent sensitivity analyses.

Area under the receiver operating characteristic curve (ROC AUC) measures with different machine learning models and parameters (original dataset).
The AUC measure in each machine learning (ML) model was obtained with or without feature selection by least absolute shrinkage and selection operator (LASSO) using the training dataset. Feature selection with LASSO was performed with the training dataset before building each ML model to reduce dimensionality, with eight eligible features identified with nonzero coefficients [i.e., age, sex, body mass index (BMI), hypertension (HTN), diabetes mellitus (DM), bronchial asthma (BA), hyperuricemia (HU), and antibiotics]. The ML models were built based on a training dataset of 215 individuals, and AUC measures were obtained from the validation dataset encompassing the 143 remaining individuals. For all models, the occurrence of prolonged decrement in SpO2 measures ≤ 93% was used as the binary outcome.
CART, Classification and Regression Trees; HL, hidden layer; NN, neural network; SVM, Support Vector Machines.
*All 20 features before feature selection by LASSO were used in these models.

The receiver operating characteristic (ROC) curves with evaluated machine learning (ML) models for predicting severe COVID-19 cases.
The conventional linear regression model (Model 2) exhibited poor AUC measures below 0.60. Other ML models produced AUC measures greater than 0.70 when the parameters were appropriately assigned. Especially, the naïve Bayes models exhibited AUC measures of > 0.80. CART, classification and regression trees; LDA, linear discriminant analysis; LR, logistic regression; NN, neural network; RF, random forest.

Relationship between the number of trees in random forest (RF) model and predictive error.
As the predictive error was suggested to decrease with an increased number of trees in RF model, the relationship between the number of trees and the predictive error rate was evaluated for three times to determine the optimal number of trees for a stable prediction. The obtained line graphs suggested that approximately 300 trees realize minimal error rate, and the error rate was stabilized above this number of trees.
To determine the variability of AUC measures with the NN, DL, and RF models, 20-fold repeated AUC measurements with the validation dataset were performed with the NN models with two hidden layers (of [3, 2] neurons or [10, 5] neurons), a DL model with three hidden layers (of [4, 3, 2] neurons), and an RF model. The distribution of AUC measurements is depicted in Fig. 4. These measures were more widely distributed with the NN and DL models irrespective of parameters than with the RF model, suggesting the high reliability and reproducibility of the RF model. The benefit of performing feature selection with the training dataset before building models depended on the types of ML model and the parameters.

Variability of AUC measures based on different machine learning (ML) models.
The distributions shown with eight different ML models were obtained from 20 iterations of ML simulations among 358 patients with body mass index (BMI) data. The AUC measures obtained with random forest (RF) exhibited narrower distributions than those obtained via deep learning (DL) models, implying greater robustness of the former. The AUC measure obtained with the RF method was higher when preliminary feature selection was performed with least absolute shrinkage and selection operator (LASSO) (p < 0.0001), demonstrating the importance of feature selection in this model. Meanwhile, the benefit of preliminary feature selection could not be confirmed with the neutral network (NN) and DL models. The p-values are results of the Mann-Whitney U test. NN [4,3,2] denotes a DL model with three hidden layers comprising 4, 3, and 2 neurons. Validation accuracies in NN models were obtained using 100 epochs with 3-fold cross-validation. The number of trees in RF was 500.
Next, a sensitivity analysis with randomly changed datasets for training (215 individuals, including 13 with the primary outcome episode) and validation (143 individuals, including 10 with the primary outcome episode) was performed to verify the robustness of the present findings. The AUC measurements obtained using the changed datasets are listed in Table 2. Again, the RF model, DL models with small number or neurons, and naïve Bayes models exhibited robust predictive accuracies with moderate-to-high levels of AUC measures greater than 0.70. In particular, the naïve Bayes models again achieved the highest AUC measures, exceeding 0.80. These models were the only models that showed AUC measures greater than 0.80 for both data configurations.

Sensitivity analysis for AUC measures with randomly reassigned training and validation datasets following feature selection.
Sensitivity analysis was performed by randomly selecting another pair of datasets for training (215 patients) and validation (143 patients) using the features selected by least absolute shrinkage and selection operator (LASSO). The feature selection was performed with the new training dataset, and the following 11 features were with nonzero coefficients: age, sex, body mass index (BMI), hypertension (HTN), diabetes mellitus (DM), bronchial asthma (BA), heart diseases, cardiovascular disease (CVD), chronic obstructive pulmonary disease (COPD), hyperuricemia (HU), and atopy. Validation accuracies in neural network (NN) models were obtained using 100 epochs with 3-fold cross-validation.
ML, machine learning; SVM, Support Vector Machines; CART, Classification and Regression Trees; HL, hidden layer.
Finally, to further evaluate the reproducibility of the results, another sensitivity analysis was performed with the 571 patients without available BMI data. Among this cohort, 40 patients (7.0%) developed hypoxia following clinical onset. Sixty percent of this cohort was randomly allocated for the training dataset (343 individuals, including 23 patients who developed hypoxia), with the remaining 40% reserved for the validation dataset (228 individuals, including 17 patients who developed hypoxia). First, feature selection using LASSO with the training dataset identified the following six characteristics with non-zero coefficients: age, sex, dyslipidemia, heart disease, liver disease, and psychiatric disease. The AUC measures obtained with different ML-based models by using these characteristics are listed in Table 3. As in the previous analyses, the naïve Bayes models, NN models with not too many hidden layers or neurons, and RF model exhibited moderate-to-high AUC measures of > 0.70, with the naïve Bayes models showing the highest AUC.

AUC measures following feature selection using data from another cohort of 571 patients.
Another sensitivity analysis was conducted on a different cohort of 571 patients without available body mass index (BMI) data. Sixty percent of the patients were randomly allocated for the training dataset (n = 343), with the remaining 40% reserved for the validation dataset (n = 228). Feature selection was performed using the least absolute shrinkage and selection operator (LASSO) method for the training dataset, and the following six features were with nonzero coefficients: age, sex, deep learning (DL), heart disease, liver disease, and psychiatric disease.
ML, machine learning; SVM, Support Vector Machines; CART, Classification and Regression Trees; HL, hidden layer; NN, neural network.
In the present study, different types of prediction models based on the conventional logistic regression model and other ML models were comprehensively evaluated, and their accuracies were compared in the prediction of severe conditions using only pre-infection data. Although the models in this study did not use data directly pertaining to COVID-19-related symptoms, most of them exhibited moderate-to-high AUC measures exceeding 0.70. In particular, the naïve Bayes models exhibited the highest AUC measures, exceeding 0.80, in all evaluated data configurations. The findings suggested that some ML-based models, including the RF, DL, and naïve Bayes models, would realize higher AUC measures compared to conventional logistic regression model even with limited data resources in size and variables. These ML-based predictive models may contribute to the initial triage stage of public health agencies when predicting outcomes in the absence of reliable data and evidence of risk factors, especially in early phases of a pandemic. Other notable findings of the present study include the wide distribution of expected AUC measures with NN and DL models depending on their parameters, such as the numbers of hidden layers and neurons. These findings collectively indicate the excellent usability of the RF and naïve Bayes models in predicting severe COVID-19 cases when reliable clinical or laboratory data are unavailable. In clinical studies, it is crucial to determine whether a predictive model can be structurally interpreted in view of distinct risk factors. In this respect, the conventional logistic regression model has an advantage over ML-based models. Conversely, in view of the practical usability of predictive models in actual triage processes, predictive accuracy may be prioritized over interpretability by certain public health policies, in which case ML models are more desirable than those derived from conventional logistic regression. Although the naïve Bayes models employed fewer features, they exhibited promising potential as predictive models for severe COVID-19 cases. Subsequent attempts to develop predictive models for severe cases may benefit from using the naïve Bayes classifier, particularly given a relatively small quantity of training set.
To date, few studies have evaluated the accuracy of the naïve Bayes classifiers in the prediction of severe COVID-19 cases. The naïve Bayes classifier uses Bayes’ theorem with a strong assumption of independence between features, with parameters approximated using the maximum likelihood method. Despite its strong independence assumption with input features that appear to be oversimplified for real-world data, this model has been reported to exhibit excellent performance compared to logistic regression and even other supervised ML models (Awan et al. 2020; Golpour et al. 2020; Mfateneza et al. 2022). One strength of the naïve Bayes classifier is that it often works well with a relatively small amount of training data, as demonstrated in a previous study (Sardesai et al. 2021). This strength may be derived from the assumption of independence between features, as the dimension of calculation, or number of data points, used to estimate the optimal parameters is well-suppressed to a lower level than in other ML models, including DL models. This is particularly important when the data dimensionality is high for a number of training datasets, which results in the curse-of-dimensionality problem. In the early stages of a pandemic, with limited direct data and evidence, the issue of high dimensionality with a low training data size is a frequent occurrence. In such situations, the naïve Bayes model represents a promising approach for the prediction of outcomes along with the conventional logistic regression model or penalized feature selection methods, such as LASSO, for extracting significant risks.
This study had several limitations. First, the number of available data points was relatively small, and the incidence of the primary outcomes was relatively low at less than 10%. Consequently, the generalizability of our findings to other demographics, including other variants of COVID-19, remains uncertain. Further studies using data from patients with these variants are required to confirm our hypotheses. Another limitation pertains to the patterns of parameters used for the NN and DL models, with 1-5 hidden layers and 2-50 neurons in each layer. Parameter optimization is an essential but difficult issue in developing DL models, and the advantages of these models could not be statistically evaluated in this study. Finally, because the present study encompassed the period before the development of vaccines against COVID-19, the collected features did not include a history of vaccination. In similar future studies, the vaccination status of each patient must also be considered as an important demographic feature, as vaccination status is known to suppress the incidence of severe COVID-19 (Ng et al. 2022).
In conclusion, this study evaluated the robustness and accuracy of clinical predictive models based on logistic regression and ML models given a limited sample size and set of variables. Several ML-based models, including the naïve Bayes, RF, NN, and DL models, performed better than the conventional logistic regression model for this task. Conversely, an excessive number of hidden layers or neurons in a DL model resulted in suboptimal predictive accuracy. The benefit of performing feature selection before building models depended on the types of ML models and their parameters. Overall, this study demonstrated a high usability of the naïve Bayes model in building a prediction model when the data resources and the evidence of risk factors are limited.
The authors appreciate all medical staffs and local government staffs (Miyagi Prefecture) who cooperated to the management of the quarantine facility where the present study was performed.
This study was funded by JSPS KAKENHI Grant Number JP21K10367.
T.A., Y.T., and T.I. contributed to the concept, design, and data collection of this study. T.A. and Y.T. performed statistical analyses. Y.K. and N.Y. verified the machine learning processes. Y.T. played a primary role in data collection. T.A. drafted the manuscript and prepared figures and tables. T.I. and N.Y. supervised this study. All authors critically revised and approved the final version of the manuscript.
The authors declare no conflict of interest.