2024 Volume 44 Issue 5 Pages 253-262
For prediction model using structured data like that with laboratory test results, ensemble learning is considered as the most appropriate method of machine learning. Thus we evaluated the performance of some ensemble models for predicting sex and age of patients using only routine laboratory test results. For this, datasets on 77,965 cases, each including 17 variables of clinical chemistry test results as the features, were collected. Ensemble models were constructed by gradient boosting decision tree (GBDT) such as LightGBM, XGBoost, and CatBoost. Also importance in the contribution for modeling of the features were analysed using the SHAP value method. LightGBM, XGBoost, and CatBoost achieved values of area under the ROC curve (AUROC) of 0.927, 0.927, and 0.930, respectively for sex classification, and determinant coefficient of 0.676, 0.682, and 0.690, respectively for age prediction. In other machine learning including logistic regression, support-vector machine (SVM), and linear regression, best AUROC of 0.907 for sex classification was achieved by SVM, and linear regression with L1 regularization showed a determinant coefficient of 0.410. The top four features in SHAP value of any GBDT were CRE, UA, γGTP, and TC, for sex classification, and ALB, CRE, TC and BUN or γGTP for age prediction. These results suggest that GBDT is promising for prediction of physiological status and possibly underlying diseases using only routine laboratory test results, and helpful for diagnostic process.