Novel Statistical Classification Model of Type 2 Diabetes Mellitus Patients for Tailor-made Prevention Using Data Mining Algorithm

To estimate the usefulness of data mining algorithms for extracting risk predictors of diabetic vascular complications in proper order in the future, we tried applying the Classification and Regression Trees (CART) method to the prevalence data of 165 type 2 diabetic outpatients and already known risk factors. Among the 6 categorical and 15 continuous risk factors, age (cutoff: 65.4) was the best predictor for classifying patients into groups with and without macroangiopathy (p=0.000). Body weight (cutoff: 53.9) was the best predictor (p=0.006) in the older group (age >65.4), whereas systolic blood pressure (cutoff: 144.5) was the best predictor in the remaining group (p=0.002). Age (cutoff: 64.8) was also the best predictor for categorizing them into groups with and without microangiopathy (p=0.000). In the older group (age >64.8), BMI (cutoff: 21.5) was the best predictor (p=0.001), whereas morbidity term (cutoff: 15.5) was the best predictor in the other group (p=0.010). Because the orders and values of all risk factors and cutoff points mined were reasonable clinically, this method may have the potential to highlight predictors in order of importance to apply tailor-made prevention of diabetic vascular complications.


INTRODUCTION
Type 2 diabetes mellitus is one of the most common diseases in developed countries.Since its morbidity term is quite long and it greatly affects patient's quality of life, the prevention of its complications is very important.Various risk factors for diabetic complications have been revealed in previous studies 1-4} Age, high blood pressure, long morbidity term, poor glycemic control, high triglyceride, high total cholesterol, high low density lipoprotein, low high density lipoprotein, high BMI, smoking, and male sex are known to be the risk factors for macroangiopathy.Age, poor glycemic control, high blood pressure, high BMI, smoking and male sex are known to be the risk factors for microangiopathy.
Logistic regression analysis in large scale randomized controlled trials has made great contributions to identifying multiple risk factors I).It has yielded odds ratios with 95% confidence intervals for each factor, and the bigger the odds ratio is, the larger its influence would be expected to be.However, since the regression model is optimized for fitting to whole samples, this reasoning is valid only in whole populations, not in the latent subgroups within it.Usual statistical methods are not effective enough to indicate the priority of risk factors in individual patients.
Our final goal is to determine which factors contribute most to the occurrence of complications in each latent subgroup by applying the Classification and Regression Trees (CART) method.As a preliminary step, we estimate here the usefulness of this data mining algorithm using the prevalence of diabetic complications and its risk factors which are already known .

SUBJECTS AND METHODS
We enrolled 165 type 2 diabetic patients from the Keio University Hospital outpatient clinic in the first half of 2000 .We explained the study purpose and plan , and got consents by all of them before the enrollment.To protect the privacies , we omitted patient's name and patient's number from the data before the analysis in case of data leakage .We took the patients' past histories of macroangiopathy and microangiopathy as the endpoints of this analysis.Macroangiopathy here means myocardial infarction, coronary stenosis of more than 75% in diameter as confirmed by coronary angiography, cerebral infarction confirmed by either brain magnetic resonance imaging or computed tomography, or a history of amputation.Microangiopathy here means stage progressions of diabetic retinopathy and nephropathy.The stage of retinopathy was diagnosed by ophthalmologists, and patients diagnosed as having diabetic retinopathy were counted.The stage of nephropathy was decided by the urine albumin-creatinine ratio (Alb/Cr) value, and a value of more than 30 mg/gCr was counted, since the Alb/Cr value have been identified as reliable and valid means of estimating albumin excretion rate in clinical settings 8,9).
We collected the patients' clinical and biochemical data and medical histories related to type 2 diabetes, obtaining data on 6 categorical variables (i.e., sex, smoking habit, family histories of coronary artery disease, cerebrovascular disease, hypertension, and type 2 diabetes mellitus), and 15 continuous ones (i.e., age, morbidity term, systolic blood pressure, diastolic blood pressure, body weight, height, body mass index (BMI), maximum weight, hemoglobin Alc, fast blood sugar, total cholesterol, triglyceride, free fatty acid, low density lipoprotein, and high density lipoprotein) as predictor variables for the analysis.All of the measurements were done in the same laboratory at Keio University Hospital.The characteristics of the patients recruited are shown in Table 1.
We used the Answer Tree 2.1 software produced by the SPSS Company on a Windows NT workstation, and classified the 165 patients by one of the classification tree algorithms called CART.A classification tree is an empirical rule for predicting the class of an object from the values of predictor variables.The concept of classification trees is statistically mature.The first classification tree method was developed by Kass 10)  shown at each branch point of the trees (Figures 1 and 2).The predictors are then compared, and the predictor with the best improvement is selected for the split.The process repeats recursively until one of the stopping rules (shown in the Appendix) is triggered'.The nearer to the root node, the larger the influence is in regard to classifying patients with or without macroangiopathy and microangiopathy.At each branch point, we additionally showed the Chi-square value and p value in order to evaluate the dependency of the classification conventionally.In the branch points in which the expected value of at least one cell of 2x2 cells was less than 5, Fisher's exact test was performed (shown as the p' value).

RESULTS
The root node of Figure 1 indicates that 55 of the 165 patients have had an episode of macroangiopathy.Each node below the root node has the same meaning: the column in the node labeled "category," "%," and "n" indicate which category is described in a row, what percent of the subjects in that node fall into that category, and how many patients are in the category.
The root node of Figure 2 indicates that 70 of the 165 patients have the episode of microangiopathy.From the standpoint of microangiopathy, aging (cutoff: 64.8) was the best predictor for classifying patients with or without microangiopathy (p=0.000).In this old (age >64.8) group, BMI (cutoff: 21.5) was the best predictor (p<0.001),whereas morbidity term (cutoff: 15.5) was the one in the other (age <=64.8) group (p<0.010)(Figure2).In the same way, each node of the tree diagram is interpreted, and represents the patient categories.Due to the recursiveness of the CART algorithm, we can find which factors principally contribute to the occurrence of the complications in each category.For example, from the standpoint of macroangiopathy risk, patients who belong to the old (age >65.4) and heavy (weight >53.9) group in Figure 1 should pay more attention to lowering their LDL level, since it is the most dominant factor.
All the classifications were statistically significant at the 0.02 criterion level (two-sided) by the Chi-square test or Fisher's exact test.

DISCUSSION
The cutoff values derived from the CART algorithm seem consistent with clinical experiences.The cutoff value for systolic blood pressure (144.5) is very close to the WHO's criterion for hypertension (i.e., 140).The cutoff value for fast blood sugar (140.5) is about the same as the old criterion for the diagnosis of diabetes of the American Diabetes Association (i.e., 140).All the other cutoff values were also reasonable clinically.Age, blood pressure, weight control, glycemic control and morbidity term are the dominant risk factors clinicians of the authors predicted before the analysis.The mined cutoff value of age (about 65 in both tree models) is not only near to the predicted value, but also to the administrative senescent cutoff (65 in Japan).Additional Chi-square tests and Fisher's exact tests showed that all of the classifications were statistically significant.These results indicate that this classification method efficiently categorizes diabetic patients with or without diabetic complications to some extent.
Examination of the nodes one by one shows that the most dominant factor differs with the patient category.The results indicate that the risk factors which type 2 diabetic patients must focus on have an order of priority, and that they differ with patient category.For example, from the standpoint of macroangiopathy risk, patients who belong to the middle-aged (age <=65.4) and hypertensive (systolic blood pressure >144.5) group in Figure 1 should pay more attention to lowering their fast blood sugar.However, patients who belong to middle-aged and normotensive group in Figure 1, should pay more attention to lowering their triglyceride level.
Patients at diabetes outpatient clinics often receive many instructions from their doctors on their lifestyles and day-to-MACROANG The root node of Figure 1 indicates that there were 165 total observations, and 55 of the 165 patients have had an episode of macroangiopathy.(i.e., patients belonging to category 1 had a previous episode, and patients belonging to category 0 had not had an episode.)Each node below the root node has the same meaning: the column in the node labeled "category" tells us which category is described in a row, the column labeled "%" tells us what percentage of the subjects in that node fall into that category , and the column labeled "n" tells us how many patients are in that category.At each branch point, "improvement" shows the value of the Gini standard function (see Appendix) at the point."p" stands for the p value of the chi-square test and "p' (p prime)" stands for the p value of Fisher's exact test.All differences are significant at p<0.02.MACROANG, BPSYS, TG, GLU, LDL and FFA in Figure 1 stand for macroangiopathy, systolic blood pressure, triglyceride, fast blood sugar, low density lipoprotein and free fatty acid , respectively.The root node of Figure 2 indicates that there are 165 total observations, and 70 of the 165 patients have had an episode of microangiopathy.The meaning of the nodes in Figure 2 is the same as that in Figure 1.At each branch point, "improvement" shows the value of the Gini standard function (see Appendix) at the point."p" stands for the p value of the chi-square test and "p' (p prime)" stands for the p value of Fisher's exact test.All differences are significant at p<0.02.MICROANG, TERM, BMI, HDL, HBA1C, and GLU in Figure 2 stand for microangiopathy, morbidity term, body mass index, high density lipoprotein, hemoglobin Alc and fast blood sugar, respectively.If used properly with an appropriate study design, this data mining method may have the potential to highlight predictors in order of importance, thus leading to tailor-made prevention plans for individual patients.

Figure 1 .
Figure 1.Classification tree of diabetes patients by macroangiopathy risk factors .

Figure 2 .
Figure 2. Classification tree of diabetes patients by microangiopathy risk factors.

Table 1 .
Baseline characteristics of patients recruited.While this method (the Chi-square Automatic Interaction Detection, CHAID) handles categorical dependent variables, the CART developed byBreiman et al.11)in 1984 can also handle continuous variables.CART is a non-parametric method that splits a node based on a data-defined impurity function (i.e., Gini criterion function described in the Appendix).More recently, in 1997, Loh and Shih developed the Quick Unbiased Efficient Statistical Tree (QUEST) "a, which is intended to be faster computationally and less biased in variable selection than either CHAID or CART.However, QUEST currently handles only categorical dependent variables, and empirical studies indicate that QUEST is not always faster than CART.13)Thus, CART remains a popular classification tree method, and that is why we adopted it in this study.The CART method generates binary decision trees.The trees are constructed by splitting subsets of the data set, using all *mean (SD) , #geometric mean (range) in 1980 to predict and classify data that were related in complex ways.predictorvariables to create two child nodes repeatedly, beginning with the entire data set.The best predictor is chosen using impurity measurement (see Appendix).The goal is to produce subsets of data that are as homogeneous as possible with respect to the target variable (i.e., to divide data into high risk groups and low risk groups as clearly as possible).Each predictor is evaluated for each split to identify the best cut point (continuous predictors) or groupings of categories (nominal and ordinal predictors) based on the improvement score (calculated by Gini criterion function defined as in the Appendix)