Environmental and Occupational Health Practice
Online ISSN : 2434-4931
Original Articles
Prediction and predictor elucidation of metabolic syndrome onset among young workers using machine learning techniques: A nationwide study in Japan
Miyuki SudaTadao Ooka Zentaro Yamagata
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML

2022 Volume 4 Issue 1 Article ID: 2021-0023-OA

Details
Abstract

Objectives: Predictive models for the onset of metabolic syndrome (MS) for people in their 30s are scarce. This study aimed to construct a highly accurate model to predict MS onset by 40 years of age and to identify important predictors of MS onset using health checkup data of Japanese employees aged between 30 and 35 years. Methods: The study included 6,048 Japanese employees aged 40 years who underwent periodic health examinations over 10 years. We developed predictive models for MS onset using machine learning methods, including random forest and logistic regression models. The variable importance of each explanatory variable was calculated to identify important predictors of MS onset for the random forest models. Results: Of 2,998 participants, 164 participants aged 30 and 180 of 4,045 participants aged 35 years developed MS by age 40 years. The random forest models have the highest predictive power (e.g., AU-ROC 0.867 for males aged 30) compared to the logistic regression models. In these models, diastolic blood pressure was the most important predictor of MS onset for males, while body mass index was the most important predictor for females. Conclusions: We created machine learning models to predict MS onset at the age of 40 years with high accuracy from health examination data obtained at the age of 30 or 35 years. Sex differences in important predictors of MS onset were shown by the variable importance indices of the random forest. Applying our model in routine healthcare management could provide early health interventions to prevent MS onset.

Introduction

Metabolic syndrome (MS) is a combination of metabolic disorders, including obesity, hyperglycemia, hypertension, and lipid abnormalities, which predispose patients to diabetes and cardiovascular disease1). There are over 1 billion people with MS worldwide, indicating the need for effective measures to prevent a further increase in MS prevalence2). Therefore, preventing MS is an important issue from a public health perspective. In addition, MS prevention has an emergent influence on health economics3), health inequalities4), and occupational health problems, including overwork injury prevention and the promotion of older workers5).

In 1999, the World Health Organization (WHO) proposed criteria for the diagnosis of MS6). Since then, two approaches to the diagnostic criteria for MS have been advanced. The first is based on the WHO concept, which includes insulin resistance and visceral fat6,7). The second is based on the overlap of risk factors for cardiovascular disease, including obesity and hypertension8,9). The former concept is mainly used in Japan7), while the latter is commonly used in the United States and European countries.

In Japan, people aged 40 years or older have the option to receive health guidance as a preventive measure against MS in accordance with national law. However, there are no MS prevention measures for people younger than 40 years old. To the best of our knowledge, few studies have examined long-term (at least 10 years), large-scale health checkup data of employees in their 30s. However, in recent years, health checkups and health guidelines for young workers in their 30s have been promoted in Japan, and there is a need for improvement in the methodology and rationale for providing health interventions for workers in this age bracket.

Previous studies have reported that the basis for MS is established before the age of 4010) and that the lifestyle choices in one’s 30s are associated with the later development of MS11,12,13). In addition, a previous study on weight control suggests that health education by the age of 35 years leads to weight reduction at the age of 40 years14). Another study also reported that health guidance for people in their 30s is important because lifestyle habits in their 30s are reflected in future health checkup results15).

Therefore, identifying individuals at high risk of MS in their 30s and intervening to improve lifestyle habits could be useful for the prevention MS. However, to the best of our knowledge, there are no studies on the identification of high-risk groups for developing MS in their 30s. We hypothesized that the results of health checkups in one’s 30s reflect lifestyle habits in the same period and that the prediction of MS onset in their 40s is possible using health checkup data. The longitudinal collection of health checkup data in their 30s will enable us to verify this hypothesis.

From an analytical perspective, existing studies often use regression analysis for prediction10), and one limitation of regression analysis is collinearity of explanatory variables. Random forest (RF), a machine learning technique, is an analytical method that avoids this problem. A previous study reported that the prediction accuracy of a diabetes prediction model was higher with RF models than with logistic analysis models because of the differences in collinearity16).

In this study, we used longitudinal health examination data of males and females in their 30s from a single Japanese company. By using a highly interpretable machine learning method17) and comparing the detected variables to clinically well-known factors, we confirmed the validity of the models and identified important factors associated with the development of MS in males and females in their 30s.

Methods

Study design and participants

The study included Japanese employees 30 years of age in 2008 or 2009 that underwent continuous periodic health examinations conducted between 2008 and 2019 by Health Insurance Association A. Health Insurance Association A oversees health management across 525 business sites throughout Japan. The business sites include various industries, including manufacturing, sales, engineering, and clerical positions. Of the 6,248 individuals who received physical examinations at age 30 years and of the 6,235 individuals who underwent physical examinations at age 35 years, the participants who did not develop MS at age 30 years but developed MS at age 40 years were included in this analysis (Figure 1).

Fig. 1.

Participant selection flow.

*Study participants who could not be assessed for MS because of missing values. MS, metabolic syndrome.

We excluded participants who could not be evaluated for MS due to missing values and those confirmed to have MS prior to the age of 40 years. If participants underwent two medical examinations in the same year, only the data point with fewer missing values or the first data set obtained when the number of missing values was equivalent in the two examinations were included in the analysis.

Finally, we prepared two datasets for analysis in this study. The first was created by combining data from the health examination aged 30 years with the MS evaluation data at the age of 40 years. The second dataset was created by combining data from health examinations at the age of 35 years with the MS evaluation data at the age of 40 years. Finally, these two datasets were used to construct predictive models for MS onset at the age of 40 years and to examine important predictors of MS onset.

Measures

Outcome

The primary outcome of this study was onset of MS at the age of 40 years. The diagnostic criteria for MS were based on the 2005 Japanese Journal of Internal Medicine criteria4), which is the most widely used criteria in Japan.

For the diagnosis of MS, a notably large waist (male ≥85 cm, female ≥90 cm) was set as a mandatory criteria, along with at least two of the following: blood pressure (systolic blood pressure ≥130 mmHg or diastolic blood pressure ≥85 mmHg), lipid levels (triglycerides ≥150 mg/dL or high-density lipoprotein cholesterol [HDL-C] ≤40 mg/dL), and blood glucose levels (fasting blood glucose ≥110 mg/dL) exceeding standard values. In addition, patients receiving medication for blood pressure, lipids, or blood glucose were considered to meet the criteria for each item, even if the standard values were not exceeded.

Predictive variables

In constructing the predictive model, we used 16 examination items or 12 interview items from the health examination prescribed by Japanese law. The validity and reliability of these interview items were verified by the Standard Health Examination and Health Guidance Program18,19) from the Japanese Ministry of Health, Labour and Welfare. The examination items included (1) body mass index (BMI) (kg/m2); (2) waist circumference (cm); (3) systolic blood pressure (mmHg); (4) diastolic blood pressure (mmHg); (5) HDL-C (mg/dL); (6) low-density lipoprotein cholesterol (LDL-C) (mg/dL); (7) triglycerides (mg/dL); (8) alanine aminotransferase (ALT) (U/L); (9) aspartate aminotransferase (AST) (U/L); (10) γ-glutamyl transpeptidase (γ-GTP) (U/L); (11) blood glucose (mg/dL); (12) hematocrit (%); (13) hemoglobin (g/dL); (14) red blood cells (104/μL); (15) white blood cells (102/μL); and (16) uric acid (mg/dL). LDL-C was calculated using either the direct method or the Friedewald estimation formula. Blood glucose was measured in the fasting state.

The interview items included (1) daily alcohol consumption; (2) having breakfast; (3) paying attention to nutritional balance; (4) walking for more than 1 h per day; (5) walking speed; (6) intention to improve health; (7) eating before bed; (8) restful sleep; (9) eating too fast; (10) cigarette smoking; (11) weight gain >10 kg; and (12) exercising more than twice a week.

Physical measurements and blood tests were treated as continuous variables, and all questionnaire items were treated as binary variables in the prediction models for better adaptability. The cutoff points for converting questionnaires with multiple options to binary variables are based on previous studies15,20). A summary table on variable processing is provided in eTable 1.

Statistical analysis

The predictive models for MS onset were created using machine learning methods, including RF and logistic regression (LR), and were evaluated with the area under the receiver operating characteristics curve (AU-ROC) and the area under the precision-recall curve (AU-PRC). Precision-recall curve is a graph with precision values on the y-axis and recall values on the x-axis.

Training data were used to create the machine learning models, and test data were used to check model accuracy. The training and test datasets were randomly selected from the original dataset in a 4:1 ratio for males and a 1:1 ratio for females because of the small number of MS cases for females. In all models, the outcome was either the presence or absence of MS at the age of 40 years, and the 28 items, including all examination items and interview items, were used as explanatory variables. In these models, sex was treated as a stratified variable rather than an adjusted variable because outcome criteria are different for males and females.

In the construction of the predictive model, RF modelling was performed using the randomForest package of the statistical software R (R Foundation for Statistical Computing, Vienna, Austria). The number of the decision trees was set to 1,000, and the minimum size of the terminal nodes was set to 1. All other parameters, including the number of features used to create the decision trees were automatically set by the R caret package21). The Gini index was used as an impurity function14). The analysis was performed with 10-segment cross-validation.

In the LR model, all variables were used as explanatory variables with the forced entry method to compare the performance using the same number of explanatory variables as RF. To avoid the complete separation problems with a small number of the incident cases, Firth’s bias-reduced logistic regression was used to create the LR model for the 30-year-old females22). LR models were not used to evaluate the importance of the predictive variables because a small number of incident cases can affect the interpretation of those variables.

When we created the RF models, the variable importance of each explanatory variable was calculated to identify important predictors of MS. In the calculation of variable importance, we used RF with conditional inference trees (using the cForest package)23) because RF tends to underestimate categorical variables with fewer categories. In addition to creating the predictive models, we used multidimensional scaling (MDS) to evaluate the similarity between MS and non-MS patients. MDS visualizes the degree of similarity between individual participants in a dataset24). All analyses were performed using R version 3.6.1, with the significance level set as 0.05.

Ethical considerations

In this study, we used health checkup data collected by Health Insurance Association A. All datasets used in this study were anonymized and statistically processed in systems that were not connected to any external networks or the internet. Personal information was strictly protected and managed in accordance with the ethical guidelines established by the government (Ethical Guidelines for Medical Research Involving Human Subjects)25).

This study was approved by the Ethics Committee of the University of Yamanashi (Ethics Committee receipt number 2201) and by the Ethics Committee of Health Insurance Association A (receipt number 2019-002). The study details were described on the websites of the company and the university where the study was conducted, and all study participants were offered the opportunity to review and opt out of this study.

Results

Of the 6,248 participants who underwent health checkups aged 30 years and the 6,235 participants who underwent health checkups aged 35 years, 2,998 (2,342 males and 656 females) aged 30 years and 4,045 (3,098 males and 947 females) aged 35 years had MS assessment data available and had not developed MS by the age of 40 years. Of 2,998 participants, 164 aged 30 years and 180 of 4,045 participants aged 35 years had developed MS by age 40 years (Figure 1).

When blood test data taken at the age of 30 were compared between the MS-onset and non-MS-onset groups at the age of 40, all blood test data except HDL-C were significantly higher in the MS-onset group, and HDL-C was significantly lower in the MS-onset group (Table 1). Blood test data at the age of 35 years were also compared between the MS-onset and non-MS-onset groups aged 40 years, with the same results as at the age of 30 years.

Table 1. Characteristics of study participant
(a) Examination data
Examination data30-year-old35-year-old40-year-old
40 years MS (-)40 years MS (+)40 years MS (-)40 years MS (+)40 years MS (-)40 years MS (+)
p-valuep-valuep-value
malen=2,186n=156n=2,927n=171n=3,857n=351
Body mass index, kg/m2*22.03 (2.80)25.34 (3.27)<0.00122.60 (3.02)26.18 (3.02)<0.00123.26 (3.22)28.83 (4.07)<0.001
Waist circumference, cm*77.95 (7.57)86.73 (8.58)<0.00180.18 (8.20)89.64 (7.69)<0.00182.28 (8.66)96.50 (9.23)<0.001
Systolic blood pressure, mmHg*115.08 (11.45)123.35 (12.35)<0.001116.42 (11.10)125.91 (12.06)<0.001118.20 (11.43)134.99 (12.89)<0.001
Diastolic blood pressure, mmHg68.94 (8.43)75.97 (9.42)<0.00170.97 (8.60)78.95 (10.03)<0.00173.67 (9.28)88.01 (10.42)<0.001
HDL-C, mg/dL*58.16 (12.49)52.16 (13.61)<0.00157.14 (12.92)49.76 (11.45)<0.00158.01 (13.65)46.93 (10.98)<0.001
LDL-C, mg/dL*111.59 (28.83)127.62 (32.99)<0.001119.20 (30.78)132.49 (35.76)<0.001123.54 (30.82)135.12 (33.04)<0.001
Blood glucose, mg/dL*89.15 (7.91)92.22 (12.80)<0.00190.30 (8.43)94.38 (9.32)<0.00190.89 (9.91)105.85 (31.75)<0.001
Uric acid, mg/dL*5.97 (1.11)6.63 (1.19)<0.0016.05 (1.12)7.01 (1.18)<0.0016.29 (1.23)7.22 (1.38)<0.001
Hemoglobin, g/dL*15.30 (0.87)15.62 (0.89)<0.00115.33 (0.89)15.79 (0.87)<0.00115.25 (0.93)15.97 (0.97)<0.001
Hematocrit, %*46.49 (2.62)47.49 (2.85)<0.00146.21 (2.71)47.66 (2.76)<0.00146.11 (2.79)48.08 (2.88)<0.001
Red blood cells, 104/μL*503.22 (31.52)515.49 (31.54)<0.001502.00 (31.87)517.09 (33.15)<0.001499.61 (33.44)522.59 (35.49)<0.001
White blood cells, 102/μL*58.30 (14.05)65.33 (15.89)<0.00160.05 (17.59)66.68 (18.64)<0.00159.79 (15.70)70.87 (17.61)<0.001
Alanine aminotransferase, U/L26 (14-26)30 (20-48)<0.00120(15-30)35(23-57)<0.00122(17-33)46(31-69)<0.001
Aspartate aminotransferase, U/L23 (17-23)23 (19-30)<0.00120(17-24)25(20-31)<0.00121(18-26)29(23-40)<0.001
γ-glutamyl transpeptidase, U/L31 (18-31)35.5 (24-58)<0.00125(18-38)43(30-74)<0.00128(20-46)60(41-99)<0.001
Triglycerides, mg/dL105 (56-105)117 (87-170)<0.00183(60-120)136(106-193)<0.00189(63-126)198(157-263)<0.001
femalen=648n=8n=938n=9n=1,762n=24
Body mass index, kg/m2*20.04 (2.59)30.98 (4.12)<0.00120.76 (3.12)34.68 (4.10)<0.00121.49 (3.50)34.10 (5.81)<0.001
Waist circumference, cm*71.08 (7.22)95.89 (9.44)<0.00173.53 (8.29)105.69 (14.15)<0.00175.41 (9.01)104.88 (14.13)<0.001
Systolic blood pressure, mmHg*105.51 (11.06)124.25 (12.67)<0.001107.59 (12.16)126.89 (12.44)<0.001110.24 (12.88)138.88 (14.95)<0.001
Diastolic blood pressure, mmHg63.99 (8.58)75.88 (8.11)<0.00166.05 (9.39)78.44 (4.45)<0.00167.75 (10.19)86.04 (10.19)<0.001
HDL-C, mg/dL*70.37 (13.71)55.00 (12.18).00268.56 (14.17)56.00 (15.84).00869.19 (14.49)48.92 (10.77)<0.001
LDL-C, mg/dL*99.17 (23.17)127.75 (24.28).001106.27 (26.73)133.89 (21.79).002109.01 (28.16)140.54 (31.72)<0.001
Blood glucose, mg/dL*84.74 (6.05)93.38 (6.50)<0.00186.52 (7.25)100.67 (9.45)<0.00187.46 (8.13)123.96 (52.13)<0.001
Uric acid, mg/dL*4.19 (0.87)5.63 (1.26)<0.0014.23 (0.87)6.33 (1.28)<0.0014.35 (0.97)5.71 (1.14)<0.001
Hemoglobin, g/dL*12.92 (1.09)14.03 (1.33).00412.96 (1.10)13.98 (1.04).00612.89 (1.26)14.27 (0.93)<0.001
Hematocrit, %*40.25 (3.02)44.27 (3.37)<0.00140.01 (2.91)42.43 (2.62).01340.01 (3.33)43.82 (1.91)<0.001
Red blood cells, 104/μL*438.23 (29.83)488.75 (41.93)<0.001442.08 (30.99)481.00 (24.12)<0.001442.10 (31.76)489.08 (32.09)<0.001
White blood cells, 102/μL57.68 (15.18)76.25 (16.88)<0.00158.30 (15.49)75.56 (11.66)<0.00158.82 (16.08)82.79 (14.79)<0.001
Alanine aminotransferase, U/L12 (10-15)(14-30).00812(10-15)24(17-89)<0.00116(12-21)39.5(25-48)<0.001
Aspartate aminotransferase, U/L17 (15-19)(14-21).52417(15-19)18(18-63).0517(15-20)25.5(17-42)<0.001
γ-glutamyl transpeptidase, U/L15 (12-18)(16-43).00415(12-19)26(20-44)<0.00113(10-17)27(19-59)<0.001
Triglycerides, mg/dL52 (40-67)(58-157).00357(44-75)125(106-131)<0.00160(47-82)158(125-205)<0.001
(b) Questionnaire
Questionnaire30-year-old35-year-old40-year-old
40 years MS (-)40 years MS (+)40 years MS (-)40 years MS (+)40 years MS (-)40 years MS (+)
p-valuep-valuep-value
malen=2,186n=156n=2,927n=171n=3,857n=351
Drinking alcohol every dayYes216 (9.9)21 (13.5).168410 (14.1)32 (18.8)0.091633 (16.4)55 (15.7)0.763
No1,969 (90.1)135 (86.5)2,503 (85.9)138 (81.2)3,222 (83.6)296 (84.3)
Having breakfastYes1,409 (64.5)90 (57.7).1012,012 (69)100 (58.8)0.0062,769 (71.8)229 (65.4)0.013
No777 (35.5)66 (42.3)902 (31)70 (41.2)1,085 (28.2)121 (34.6)
Paying attention to nutritional balanceYes1,509 (76.5)96 (73.8).5222,345 (80.5)141 (82.9)0.4851,958 (50.8)155 (44.3)0.022
No463 (23.5)34 (26.2)568 (19.5)29 (17.1)1,894 (49.2)195 (55.7)
Walking more than 1 hour per dayYes800 (36.6)48 (30.8).1441,087 (37.3)61 (35.9)0.7442,455 (63.7)181 (51.6)<0.001
No1,383 (63.4)108 (69.2)1,824 (62.7)109 (64.1)1,400 (36.3)170 (48.4)
Walking speedYes1,140 (52.2)70 (44.9).0821,459 (50.1)76 (44.7)0.182,004 (52.0)134 (38.2)<0.001
No1,045 (47.8)86 (55.1)1,454 (49.9)94 (55.3)1,850 (48.0)217 (61.8)
Intention to improve healthYes1,592 (72.9)131 (84).0021,989 (68.3)140 (82.4)<0.0012,725 (70.7)300 (85.5)<0.001
No591 (27.1)25 (16)924 (31.7)30 (17.6)1,129 (29.3)51 (14.5)
Eating before bedYes1,223 (55.9)90 (58.1).6171,677 (57.6)105 (62.1)0.263754 (19.6)71 (20.3)0.726
No963 (44.1)65 (41.9)1,232 (42.4)64 (37.9)3,099 (80.4)279 (79.7)
Getting rest from sleepYes1,223 (56.2)81 (52.3).3581,710 (58.7)102 (60)0.812,688 (69.7)219 (62.4)0.005
No954 (43.8)74 (47.7)1,202 (41.3)68 (40)1,166 (30.3)132 (37.6)
Eating too fastYes776 (35.5)70 (45.2).019979 (33.6)84 (49.4)<0.0011,502 (39.0)142 (40.5)0.607
No1,408 (64.5)85 (54.8)1,935 (66.4)86 (50.6)2,350 (61.0)209 (59.5)
Smoke cigarettesYes730 (33.4)65 (41.7).044916 (31.4)62 (36.5)0.1762,305 (59.8)225 (64.1)0.124
No1,456 (66.6)91 (58.3)1,997 (68.6)108 (63.5)1,549 (40.2)126 (35.9)
Gained more than 10 kg in weightYes393 (18)75 (48.1)<0.001805 (29.1)109 (65.7)<0.0011,576 (41.6)285 (82.1)<0.001
No1,793 (82)81 (51.9)1,957 (70.9)57 (34.3)2,213 (58.4)62 (17.9)
Exercise more than twice a weekYes369 (16.9)39 (25).016515 (17.7)29 (17.2)0.918766 (19.9)60 (17.1)0.233
No1,816 (83.1)117 (75)2,399 (82.3)140 (82.8)3,089 (80.1)291 (82.9)
femalen=648n=8n=938n=9n=1,762n=24
Drinking alcohol every dayYes23 (3.5)1 (12.5).25956 (6.00)1 (11.1).430138 (7.8)2 (8.30).712
No625 (96.5)7 (87.5)880 (94.0)8 (88.9)1,622 (92.2)22 (91.7)
Having breakfastYes142 (21.9)1 (12.5)1759 (81.0)8 (88.9)11,431 (81.3)18 (75.0).430
No506 (78.1)7 (87.5)178 (19.0)1 (11.1)329 (18.7)6 (25.0)
Paying attention to nutritional balanceYes521 (85)6 (85.7)1817 (87.3)8 (88.9)11,562 (88.8)21 (87.5).746
No92 (15)1 (14.3)119 (12.7)1 (11.1)198 (11.2)3 (12.5)
Walking more than 1 hour per dayYes176 (27.2)1 (12.5).689237 (25.3)4 (44.4).244526 (29.9)8 (33.3).823
No472 (72.8)7 (87.5)700 (74.7)5 (55.6)1,233 (70.1)16 (66.7)
Walking speedYes245 (37.8)2 (25).717380 (40.6)3 (33.3).746758 (43.1)5 (20.8).036
No403 (62.2)6 (75)557 (59.4)6 (66.7)1,001 (56.9)19 (79.2)
Intention to improve healthYes504 (78)6 (75).691698 (74.7)8 (88.9).4631,361 (77.3)22 (91.7).136
No142 (22)2 (25)237 (25.3)1 (11.1)399 (22.7)2 (8.30)
Eating before bedYes267 (41.3)1 (12.5).150294 (31.4)3 (33.3)1461 (26.2)6 (25.0)1
No380 (58.7)7 (87.5)643 (68.6)6 (66.7)1,298 (73.8)18 (75.0)
Getting rest from sleepYes347 (53.6)6 (75).298496 (53.1)3 (33.3).320912 (51.8)12 (50.0)1
No300 (46.4)2 (25)438 (46.9)6 (66.7)848 (48.2)12 (50.0)
Eating too fastYes139 (21.5)2 (25).684201 (21.5)3 (33.3).415436 (24.8)7 (29.2).636
No508 (78.5)6 (75)735 (78.5)6 (66.7)1,324 (75.2)17 (70.8)
Smoke cigarettesYes81 (12.5)2 (25).26892 (9.80)1 (11.1).608183 (10.4)4 (16.7).308
No567 (87.5)6 (75)845 (90.2)8 (88.9)1,577 (89.6)20 (83.3)
Gained more than 10 kg in weightYes49 (7.6)5 (62.5)<0.001110 (12.0)8 (88.9)<0.001360 (22.2)22 (100.0)<0.001
No597 (92.4)3 (37.5)803 (88.0)1 (11.1)1,260 (77.8)0 (00.0)
Exercise more than twice a weekYes66 (10.2)1 (12.5).58079 (8.4)2 (22.2).176199 (11.3)3 (12.5).747
No582 (89.8)7 (87.5)858 (91.6)7 (77.8)1,561 (88.7)21 (87.5)

Values are presented as mean ± SD (standard deviation) or median (IQR) or as n(%) *t-test Mann–Whitney U test Chi-square test

HDL-C, High-density lipoprotein cholesterol; IQR, interquartile range; LDL-C, low-density lipoprotein cholesterol; MS, metabolic syndrome; SD, standard deviation.

In terms of the health examination questions at the ages of 30 and 35 years, we had significantly higher prevalence of “yes” responses among six questions (“Drinking alcohol every day,” “Not having breakfast,” “Intention to improve,” “Gained more than 10 kg in weight,” “Eating too fast,” and “Smoke cigarettes”) (Table 1).

We compared the predictive accuracy of the models using AU-ROC and AU-PRC and found that RF had a higher AU-ROC and AU-PRC than LR in all the models, although the differences in accuracy were not significant (Table 2 and Figure 2).

Table 2. Predictive accuracy of the random forest and logistic regression models
Random forestLogistic regressionp-value*
Prediction for 30-year-old malesAU-ROC0.8670.852.64
Sensitivity0.8820.765
Specificity0.7370.863
AU-PRC0.3920.356
Prediction for 35-year-old malesAU-ROC0.8760.852.15
Sensitivity0.8180.879
Specificity0.8200.767
AU-PRC0.4440.427
Prediction for 30-year-old femalesAU-ROC0.9850.748.31
Sensitivity1.0001.000
Specificity0.9740.510
AU-PRC0.5500.377
Prediction for 35-year-old femalesAU-ROC0.9810.940.16
Sensitivity1.0001.000
Specificity0.9270.882
AU-PRC0.6130.271

*DeLong’s test for two correlated ROC curves At the optimal point on ROC curve

AU-PRC, area under the precision-recall curve; AU-ROC, area under the receiver operating characteristic.

Fig. 2.

Receiver operating characteristic (ROC) curve and precision-recall (PR) curve for predicting the onset of metabolic syndrome via the random forest and logistic regression models. A PR curve is a graph with precision values (i.e., positive predictive value) on the y-axis and recall values (i.e., sensitivity) on the x-axis.

We assessed the importance of the predictors by calculating the variable importance of the explanatory variables in the RF models, and diastolic blood pressure was shown to be the most important predictor of MS onset in males aged 30 and 35 years. LDL-C, BMI, HDL-C, waist circumference, and walking time were also identified as important factors for predicting MS onset among male participants aged 30 years. HDL-C and skipping breakfast were also revealed as important factors among male participants aged 35 years. For female participants aged 30 and 35 years, BMI, waist circumference, uric acid levels, and triglyceride levels were the most important predictors of MS onset (Figure 3).

Fig. 3.

Important predictors of metabolic syndrome onset with values of important variables calculated using the random forest method.

*Variable importance is expressed as percentages. X-axis indicates the variable importance in the random forest model when we set the degree of most influential variable in each model as 100%.

The MDS plots of the male participants at the ages of 30 and 35 years were divided into two major clusters, and most of the MS-onset group was included in the right cluster of the plot. Conversely, in the MDS plots of the female participants at the age of 30 and 35 years, non-MS cases formed a single cluster and MS-onset cases were scattered away from the non-MS-onset cluster (Figure 4).

Fig. 4.

Clustering of young workers with the multidimensional scaling (MDS) plot using the random forest method. We can evaluate the similarity between each of samples with the distance of each dot on the MDS plot. Dim1 and Dim2 represent the eigenvectors of the proximity matrix from random forest model.

Discussion

In this study, we applied the RF and LR machine learning methods to 10-year longitudinal health checkup data from individuals aged 30 and 35 years employed at a Japanese company and developed models to predict MS onset among these individuals. RF models were found to have a higher predictive power than logistic regression models. We also determined important predictors by comparing the variable importance in RF models. The MDS plots using the RF method showed different characteristics between the participants with MS and those without MS onset.

In this study, the accuracy of the RF predictive model was higher than that of the LR model. In LR models, we assume that samples can be linearly stratified for each outcome; whereas the calculation methods used in RF models assure stratification26). Furthermore, RF has been reported to have better performance than other methods when there are many explanatory variables and interactions between variables27). In the present study, 28 explanatory variables were used to create the model, and a group of variables with interactions (e.g., BMI and waist circumference) were included. This may be why the accuracy of predicting the onset of MS was higher in RF models than in LR models.

In a previous study, the predictors of MS onset were investigated using regression analysis in Japanese males aged 30, 35, and 40 years, and an increase in BMI was reported as the most important predictor13). However, in the present study, diastolic blood pressure was found to be the most important predictor. This difference may be due to multicollinearity that occurred differently in the RF and LR models. A previous study showed that RF has better performance for nonlinear relationships than regression analysis (Cox proportional hazards regression)16). This characteristic of RF may lead to the presence of different factors for MS prediction. Another previous study using the neural network method also showed diastolic blood pressure as an important predictor28), and another study also reported that diastolic blood pressure is associated with the development of MS at a young age29). These studies support the results of the present study. The value of diastolic blood pressure may be more useful for judging the future onset of MS than the increase in BMI because blood pressure can be ascertained at one point, while the increase in BMI cannot be measured at one time.

In the analysis of female participants, the most important predictor of MS was BMI, followed by waist circumference, as noted in a previous study13). Waist circumference is the criterion for determining the onset of MS, and the criteria for waist circumference are stricter in females than in males. Therefore, the importance of waist circumference and BMI might be higher in female participants than in male participants.

Among the questionnaire items, walking time in males at the age of 30 years, skipping breakfast in males at the age of 35 years, walking speed in females at the age of 30 years, and not feeling rested from sleep in females at the age of 35 years were identified as important predictors. A previous study reported that daily exercise habits, regular diet, and restful sleep were associated with the development of MS in both males and females20). This finding is consistent with the results of the current study.

In the MDS plot of female participants, MS cases were sporadically located away from the cluster. This sporadic population may represent a future unhealthy population. However, in the plot of male participants, MS cases were concentrated in one of two separate clusters; thus, clusters with a high concentration of MS cases may represent future unhealthy populations. The differences between MS and non-MS participants in the MDS plot were more conspicuous among female participants than among male participants. This may have been because the criteria for determining MS was more strict in males than in females, and females who experience MS onset at the age of 40 had more distinctive characteristics than males in the same situation.

The strength of our study is our construction of models with machine learning methods to predict the onset of MS in males and in females using large longitudinal health examination data of young people collected over a 10-year period at a Japanese company.

In addition, using a highly interpretable machine learning method, we were able to identify important predictors from many health checkups items. To the best of our knowledge, this is the first study to develop predictive models with machine learning methods to predict the onset of MS using longitudinal data of males and female in their 30s.

Limitations

This study has several limitations. First, there is a possibility of selection bias because we developed and evaluated the predictive model using data of healthy employees who spontaneously underwent health checkups in various business sites of a large company only. We also included multiple industry types; thus, the study population may not represent average young workers in Japan.

Second, it is difficult to apply the same result to populations in other countries because this study only used Japanese health examination data and MS criteria specific to Japan. However, since East Asians share similar characteristics, the present results may be applicable to East Asia, including China, Korea, and Taiwan. Furthermore, when the MS criteria of the International Diabetes Federation were applied to this study, we obtained similar results with the main analyses (eFigure 1. and eFigure 2.).

Third, a small sample size and low incidence of MS in female participants might lower the predictive ability of the models for the female population. Hence, we used Firth’s bias reduction method to reduce the problem of complete separation caused by the small number of cases.

Fourth, we could not prepare an external data set for this study. However, there are very few companies that have collected health checkup data for 10 consecutive years starting in the employees’ 30s; therefore, it is impractical to prepare an external dataset. To respond to this problem, we adjusted the internal data by using 10-folds cross-validation.

Fifth, we had a moderate number of study exclusions. Most exclusions were due to missing health examination data at the age of 40 years. Comparison of baseline characteristics between the study subjects and those that were excluded showed that study subjects were slightly worse in BMI, HDL-C, “bedtime eating,” and “willingness to improve” than the study exclusions (eTable 2). The performance of the developed model may be slightly reduced when the model is applied to those with characteristics similar to the study exclusions.

Sixth, the current study does not include socioeconomic factors (e.g., education and household income) or occupational factors (e.g., long working hours and shift work), which are important factors in predicting the onset of MS. Therefore, the validity of this study needs further investigation.

Seventh, limiting the outcome of this study to only 40 years of age may lead to an underestimation of MS incidence. However, since the health checkup items related to MS were introduced in 2008, it is difficult to secure sufficient data on MS onset for workers who were 30 years old at that time. Although we have data up to the age of 41 years, evaluating only those with data at age 41 years for 2 years and those with data at age 40 years for a single year may lead to bias in the prediction models. Therefore, we focused our analysis at 40 years of age, herein.

Finally, the method of measuring LDL-C in this study is not standardized, although the Friedewald estimating equation is the preferred method of measuring LDL cholesterol to diagnose MS. Furthermore, reagents for blood testing and physical measurement protocols (e.g., weight and blood pressure) at each health checkup site are not standardized. This might have affected the prediction model and interpretation of variables of importance in this study.

Clinical indication

By applying the predictive model developed in this study to the health checkup data of males and females in their 30s, it may be possible to prevent MS onset in an effective way. For instance, companies and municipalities with limited medical sources can identify high-risk groups for MS by applying our models to their data.

In this study, in addition to BMI and waist circumference, diastolic blood pressure, LDL-C, and HDL-C in males and uric acid and triglyceride in females were noted as important predictors of MS. Additionally, walking habits in both males and females, skipping breakfast in males, and restful sleep in females were also presented as important factors. Based on these results, we can prevent MS onset efficiently by focusing on items by sex when conducting health guidance for people in their 30s.

Until recently, legal medical checkups in Japan were not required to include items related to MS onset (e.g., blood glucose and lipids) in one’s 30s. However, since April 2018, revisions to the law have made many companies include these items for people under the age of 40. Therefore, we believe that the prediction model developed in this study will be useful for people in their 30s in the future.

A previous study showed that health guidance for people under 40 years of age is effective for the prevention of MS onset30). Therefore, using this model to identify high-risk subjects in their 30s, we may efficiently prevent the onset of MS. Furthermore, providing health guidance to young people at high risk and focusing on the predictors identified in this study might lead to the effective prevention of MS onset, resulting in a reduction in the nation’s healthcare costs.

Conclusion

We developed a high-accuracy predictive model with a machine learning method that predicts MS onset at the age of 40 years based on health examination data obtained at the ages of 30 and 35 years. Some important sex-specific predictors were identified using this highly interpretable machine learning method. Applying our models to routine healthcare management should provide early and appropriate health interventions to young people for preventing the onset of MS in this population.

Acknowledgments

We are grateful to all the staff of Health Insurance Association A for preparing the dataset for the current study. The analysis code of this study can be found on the following website: (https://github.com/mysuda/tokuho_pred).

Conflicts of interest

There are no conflicts of interest to declare.

Disclosures

The study protocol was examined and approved by the Ethics Committee of the University of Yamanashi (Ethics Committee receipt number R01688). The study was also approved by the Ethics Committee of Health Insurance Association A (receipt number 2019-002). All participants had the opportunity to opt out.

Funding

The authors received no financial support for the research, authorship, and publication of this article.

Author contributions

M.S. and T.O. participated in the design and conception of the study and its coordination, data acquisition, statistical analysis, and manuscript drafting. M.S and T.O contributed equally to this manuscript. Z.Y. reviewed the analysis and manuscript.

References
Appendices

eTable 1. List of questionnaire item categorical variables
Drinking alcohol every dayDrink dailyyes1
Drink 4–6 days a weekno0
Drink 1–3 days a week
I don’t drink.
Having breakfastyes1
no0
Paying attention to nutritional balanceNot paying much attentionno1
Paying attention from time to timeyes0
Always attentive
Walking more than 1 hour per dayyes1
no0
Walking speedslowno1
fastyes0
Intention to improve healthI don’t intend to improveNo1
I will improve it (within 6 months)yes0
I plan to improve it in the near future.
I’m starting little by little (within 1 month).
Already working on it (under 6 months).
Already working on it (over 6 months).
Eating before bedyes1
no0
Getting rest from sleepno1
yes0
Eating too fastfastyes1
averageno0
slow
Smoke cigarettescurrent smokeryes1
former smokerNo0
never smoker
Gained more than 10 kg in weightyes1
no0
Exercises more than twice a weekno1
yes0

eTable 2. Comparison of study participants and exclusions
Examination data30 years old35 years old
ExcludedAnalysisExcludedAnalysis
p-valuep-value
Male   n=1,817n=2,342n=1,283n=3,098
 Body mass index, kg/m2 *22.79±3.8822.25±2.95<0.00123.77±4.2722.80±3.13<0.001
 Waist circumference, cm*80.10±10.2578.53±7.94<0.00183.58±11.2080.70±8.46<0.001
 Systolic blood pressure, mmHg *117.52±12.37115.63±11.6<0.001119.86±13.42116.95±11.36<0.001
 Diastolic blood pressure, mmHg*70.14±9.3469.41±8.67.01073.68±10.5771.41±8.87<0.001
 HDL-C, mg/dL*56.30±13.1257.76±12.65<0.00154.90±13.8956.73±12.95<0.001
 LDL-C, mg/dL*112.56±28.72112.67±29.39.914121.41±31.04119.94±31.21.167
 Blood glucose, mg/dL*90.16±11.7589.35±8.36.01292.88±18.0290.53±8.53<0.001
 Uric acid, mg/dL*6.05±1.226.01±1.12.2886.19±1.296.11±1.14.031
 Hemoglobin, g/dL*15.39±0.9215.32±0.88.01815.48±0.9415.36±0.90<0.001
 Hematocrit, %*46.66±2.7446.56±2.64.25146.45±2.7846.29±2.73.075
 Red blood cells, 104/μL*505.58±32.90504.04±31.66.143506.96±34.17502.83±32.12<0.001
 White blood cells, 102/μL*61.71±16.8058.77±14.28<0.00163.67±18.2060.42±17.71<0.001
 Alanine aminotransferase, U/L 19 (14–31)19 (14–27).04823 (16–37)21 (15–31)<0.001
 Aspartate aminotransferase, U/L20 (16–24)19 (17–23).12421 (18–27)20 (17–25)<0.001
 γ-glutamyl transpeptidase, U/L24 (18–37)23 (18–32).00428 (20–50)25 (19–40)<0.001
 Triglycerides, mg/dL83 (57–123)78 (57–108)<0.00198 (65–160)86 (62–124)<0.001
femalen=1,433n=656n=907n=947
 Body mass index, kg/m2 *20.12±3.0620.18±2.88.06920.99±3.6720.89±3.40.552
 Waist circumference, cm*71.52±7.7471.38±7.74.71473.98±9.3473.84±8.92.753
 Systolic blood pressure, mmHg *106.32±11.58105.74±11.26.283107.53±13.04107.77±12.30.686
 Diastolic blood pressure, mmHg*64.32±8.7064.14±8.67.66465.40±9.5066.17±9.43.081
 HDL-C, mg/dL*70.44±13.2870.18±13.79.68468.77±14.3868.44±14.23.628
 LDL-C, mg/dL*99.67±25.4799.52±23.38.896106.49±27.49106.54±26.81.971
 Blood glucose, mg/dL*84.55±8.0784.85±6.13.40886.09±12.4786.65±7.40.238
 Uric acid, mg/dL*4.13±0.864.20±0.88.0764.19±0.974.25±0.89.235
 Hemoglobin, g/dL*12.93±1.1212.93±1.09.94512.83±1.2412.97±1.11.011
 Hematocrit, %*39.86±3.1540.30±3.06.02439.58±3.2740.03±2.92.002
 Red blood cells, 104/μL*436.82±31.09438.85±30.48.169436.27±34.96442.45±31.15<0.001
 White blood cells, 102/μL*58.78±16.5057.90±15.33.25859.00±16.0258.46±15.54.470
 Alanine aminotransferase, U/L 12 (9–15)12 (10–15).67112 (10–16)12 (10–15).106
 Aspartate aminotransferase, U/L17 (15–19)17 (15–19).96717 (15–19)17 (15–19).696
 γ-glutamyl transpeptidase, U/L15 (12–18)15 (12–18).84715 (12–19)15 (12–19).767
 Triglycerides, mg/dL52 (41–69)52 (40–67).42656 (45–78)57 (44–76).321
Questionnaire30 years old35 years old
ExcludedAnalysisExcludedAnalysis
p-valuep-value
malen=1,817n=2,342n=1,283n=3,098
 Drinking alcohol every dayYes194 (11.0)237 (10.1).355178 (14.0)442 (14.3).812
No1,567 (89.0)2,104 (89.9)1,093 (86.0)2,641 (85.7)
 Having breakfastYes1,083 (61.9)1,499 (64.0).159853 (67.0)2,112 (68.5).353
No668 (38.1)843 (36.0)420 (33.0)972 (31.5)
 Paying attention to nutritional balanceYes1,113 (73.7)1,605 (76.4).066957 (75.7)2,486 (80.6)<0.001
No398 (26.3)497 (23.6)308 (24.3)597 (19.4)
 Walking more than 1 hour per dayYes609 (34.7)848 (36.3).322489 (38.5)1,148 (37.3).470
No1,144 (65.3)1,491 (63.7)782 (61.5)1,933 (62.7)
 Walking speedYes855 (48.9)1,210 (51.7).077614 (48.2)1,535 (49.8).351
No894 (51.1)1,131 (48.3)659 (51.8)1,548 (50.2)
 Intention to improve healthYes1,276 (72.8)1,723 (73.7).544874 (68.7)2,129 (69.1).801
No477 (27.2)616 (26.3)399 (31.3)954 (30.9)
 Eating before bedYes949 (54.2)1,313 (56.1).240694 (54.5)1,782 (57.9).043
No802 (45.8)1,028 (43.9)579 (45.5)1,296 (41.2)
 Getting rest from sleepYes936 (53.7)1,304 (55.9).171699 (55.1)1,812 (58.5).028
No806 (46.3)1,028 (44.1)569 (44.9)1,270 (41.2)
 Eating too fastYes679 (38.8)846 (36.2).089481 (37.8)1,063 (34.5).040
No1,072 (61.2)1,493 (63.8)792 (62.6)2,021 (65.5)
 Smoke cigarettesYes707 (40.1)795 (33.9)<0.001467 (36.7)978 (31.7).002
No1,055 (59.9)1,547 (66.1)806 (63.3)2,105 (68.3)
 Gained more than 10 kg in weightYes448 (25.6)468 (20.0)<0.001493 (40.0)914 (31.2)<0.001
No1,305 (74.4)1,874 (80.8)740 (60.0)2,014 (68.8)
 Exercise more than twice a weekYes301 (17.2)408 (17.4).835205 (16.1)544 (17.6).233
No1,452 (82.8)1,933 (82.6)1,068 (83.9)2,539 (82.4)
femalen=1,433n=656n=907n=947
 Drinking alcohol every dayYes54 (3.9)24 (3.70).90248 (5.3)57 (6.0).548
No1,338 (96.1)632 (96.3)850 (94.7)888 (94.0)
 Having breakfastYes1,055 (76.6)513 (78.2).463706 (78.6)767 (81.1).201
No322 (23.4)143 (21.8)192 (21.4)179 (18.9)
 Paying attention to nutritional balanceYes1,048 (85.5)527 (85.0).781769 (86.7)825 (87.3).728
No178 (14.5)93 (15.0)118 (13.3)120 (12.7)
 Walking more than 1 hour per dayYes334 (24.3)177 (27.0).190221 (24.7)241 (25.5).707
No1,042 (75.7)479 (73.0)675 (75.3)705 (74.5)
 Walking speedYes513 (37.4)247 (37.7)<.922365 (40.7)383 (40.5).962
No858 (62.6)409 (62.3)532 (59.3)563 (59.5)
 Intention to improve healthYes1,156 (84.1)510 (78.0).001713 (79.6)706 (74.8).015
No218 (15.9)144 (22.0)183 (20.4)238 (25.2)
 Eating before bedYes422 (30.8)268 (40.9)<0.001271 (30.2)297 (31.4).614
No949 (69.2)387 (59.1)626 (69.8)649 (68.6)
 Getting rest from sleepYes725 (53.0)353 (53.9).739439 (48.9)499 (52.9).085
No643 (47.0)302 (46.1)459 (51.1)444 (72.1)
 Eating too fastYes300 (21.9)141 (21.5).908236 (26.3)204 (21.6).019
No1,072 (78.1)514 (78.5)661 (73.7)741 (78.4)
 Smoke cigarettesYes216 (15.5)83 (12.7).094116 (12.9)93 (9.8).040
No1,177 (84.5)573 (87.3)782 (87.1)853 (90.2)
 Gained more than 10 kg in weightYes105 (7.6)54 (8.3).659131 (14.8)118 (12.8).246
No1,269 (92.4)600 (91.7)756 (85.2)804 (87.2)
 Exercise more than twice a weekYes135 (9.8)67 (10.2).812104 (11.6)81 (8.6).036
No1,241 (90.2)589 (89.8)793 (88.4)865 (91.4)

Values are presented as mean ± SD or median (IQR) *t-test †Mann–Whitney U test Chi-square test

HDL-C, High-density lipoprotein cholesterol; IQR, interquartile range; LDL-C, low-density lipoprotein cholesterol; SD, standard deviation.

Values are presented as n (%)

eFigure. 1

Receiveroperating characteristic(ROC) curves and precision-recall (PR) curves using overseas criteria for determining Metabolic Syndrome (International Diabetes Federation: IDF)

eFigure. 2

The important predictors for metabolic S)rndrome onset using overseas criteria for determining Metabolic Syndrome (International Diabetes Federation:IDF)

*Variable importance is defined as the variable importance when the top variable is set as 100%.

 
© 2022 The Authors.

This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
https://creativecommons.org/licenses/by-nc-nd/4.0/
feedback
Top