Constructing Regression Models with High Prediction Accuracy and Interpretability Based on Decision Tree and Random Forests

Journal of Computer Chemistry, Japan

Online ISSN : 1347-3824
Print ISSN : 1347-1767
ISSN-L : 1347-1767

この記事には本公開記事があります。本公開記事を参照してください。
引用する場合も本公開記事を引用してください。

Constructing Regression Models with High Prediction Accuracy and Interpretability Based on Decision Tree and Random Forests

Naoto SHIMIZU, Hiromasa KANEKO

著者情報

キーワード: Model interpretability, Predictive ability, Decision tree, Random forests, Regression model

ジャーナルフリー HTML 早期公開

論文ID: 2020-0021

DOI https://doi.org/10.2477/jccj.2020-0021

この記事には本公開記事があります。

The final version of this article is now available: Vol. 20 (2021), No. 2 pp. 71-87

詳細

Abstract

Models for predicting properties/activities of materials based on machine learning can lead to the discovery of new mechanisms underlying properties/activities of materials. However, methods for constructing models that exhibit both high prediction accuracy and interpretability remain a work in progress because the prediction accuracy and interpretability exhibit a trade-off relationship. In this study, we propose a new model-construction method that combines decision tree (DT) with random forests (RF); which we therefore call DT-RF. In DT-RF, the datasets to be analyzed are divided by a DT model, and RF models are constructed for each subdataset. This enables global interpretation of the data based on the DT model, while the RT models improve the prediction accuracy and enable local interpretations. Case studies were performed using three datasets, namely, those containing data on the boiling point of compounds, their water solubility, and the transition temperature of inorganic superconductors. We examined the proposed method in terms of its validity, prediction accuracy, and interpretability.

Figures

Figure 1.

Basic concept of DT-RF.

Figure 2.

DT for BP dataset.

Figure 3.

Results of DCV for BP dataset.

Figure 4.

Results of DCV for each node for BP dataset.

Figure 5.

Importance of variables of BP dataset for focal RF models.

Figure 6.

Importance of variables of BP dataset for global RF model.

Figure 7.

DT for logS dataset.

Figure 8.

Results of DCV for logS dataset.

Figure 9.

Results of DCV for each node for logS dataset.

Figure 10.

Importance of variables of logS dataset for local RF models.

Figure 11.

Importance of variables of logS dataset for global RF model.

Figure 12.

DT for T_c dataset.

Figure 13.

Results of DCV for T_c dataset.

Figure 14.

Importance of variables of T_c dataset for local RF models

Figure 15.

Predicted T_c for compounds simulated based on T_c dataset.

Tables

Table 1. Prediction results for BP dataset

Method	r²_train	MAE_train	r²_DCV	MAE_DCV
PLS	0.890	16.0	0.823	20.5
SVR	0.961	3.67	0.836	13.8
RF	0.946	12.1	0.700	27.7
DT-RF	0.971	7.60	0.847	17.2

Table 2. Prediction results for logS dataset

Method	r²_train	MAE_train	r²_DCV	MAE_DCV
PLS	0.883	0.540	0.461	0.785
SVR	0.984	0.175	0.879	0.505
RF	0.918	0.441	0.825	0.687
DT-RF	0.971	0.257	0.875	0.550

Table 3. Prediction results for T_c dataset

Method	r²_train	MAE_train	r²_DCV	MAE_DCV
PLS	0.752	11.9	-30.7	17.7
RF	0.990	1.85	0.869	7.24
DT-RF	0.923	5.14	0.842	7.96

References

[1] R. Tibshirani, J. R. Stat. Soc. B, 58, 267 (1996). doi:10.1111/j.2517-6161.1996.tb02080.x
[2]H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, V. Vapnik, In Neural Information Processing Systems, 9, 155-161 (1997)
[3] C. M. Bishop, Pattern recognition and machine learning; Springer: New York (2006)
[4] https://www.oecd.org/chemicalsafety/risk-assessment/37849783.pdf
[5] T. Cover, P. Hart, IEEE Trans. Inf. Theory, 13, 21 (1967). doi:10.1109/TIT.1967.1053964
[6] D. M. J. Tax, Pattern Recognit. Lett., 20, 1191 (1999). doi:10.1016/S0167-8655(99)00087-2
[7] T. Fujita, D. A. Winkler, J. Chem. Inf. Model., 56, 269 (2016). doi:10.1021/acs.jcim.5b00229
[8] D. C. Park, M. A. El-Sharkawi, R. J. Marks, L. E. Atlas, M. J. Damborg, IEEE Trans. Power Syst., 6, 442 (1991). doi:10.1109/59.76685
[9] S. Kar, J. Roy, D. Leszczynska, J. Leszczynski, Computation (Basel), 5, 2 (2016). doi:10.3390/computation5010002
[10] L. Breiman, Mach. Learn., 45, 5 (2001). doi:10.1023/A:1010933404324
[11] V. Stanev, C. Oses, A. Gilad Kusne, E. Rodriguez, J. Paglione, S. Curtarolo, I. Takeuchi, npj Comput., Mater., 4, 29 (2018).
[12] V. M. Alves, A. Golbraikh, S. J. Capuzzi, K. Liu, J. Chem. Inf. Model., 58, 1214 (2018). doi:10.1021/acs.jcim.8b00124
[13] I. M. dos Santos, J. P. G. Agra, T. G. C. de Carvalho, G. L. de Azevedo Maia, E. B. de Alencar Filho, Struct. Chem., 29, 1287 (2018). doi:10.1007/s11224-018-1110-8
[14] M. Asahara, R. Fujimaki, IEEE Trans. Parallel Distrib. Syst., 30, 1481 (2019). doi:10.1109/TPDS.2019.2892972
[15] R. Eto, R. Fujimaki, S. Morinaga, H. Tamano, PMLR, 33, 238 (2014).
[16] Y. Iwasaki, R. Sawada, V. Stanev, M. Ishida, A. Kirihara, Y. Omori, H. Someya, I. Takeuchi, E. Saitoh, S. Yorozu, npj Comput., Mater., 5, 103 (2019).
[17] L. Breiman, Mach. Learn., 24, 123 (1996).
[18] G. De’ath, K. E. Fabricius, Ecology, 81, 3178 (2000). doi:10.1890/0012-9658(2000)081[3178:CARTAP]2.0.CO;2
[19] https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
[20] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
[21] https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.PLSRegression.html
[22] https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html
[23] H. Kaneko, K. Funatsu, Chemom. Intell. Lab. Syst., 142, 64 (2015). doi:10.1016/j.chemolab.2015.01.001
[24] P. Filzmoser, B. Liebmann, K. Varmuza, J. Chemometr., 23, 160 (2009). doi:10.1002/cem.1225
[25] https://chm.kode-solutions.net/products_dragon.php
[26] https://www.rdkit.org/docs/GettingStartedInPython.html
[27] J. Bardeen, L. N. Cooper, J. R. Schrieffer, Phys. Rev., 108, 1175 (1957). doi:10.1103/PhysRev.108.1175
[28] J. G. Bednorz, K. A. Müller, Z. Phys. B Con., Mat., 64, 189 (1986).
[29] Y. Sato, S. Kasahara, H. Murayama, Y. Kasahara, E. G. Moon, T. Nishizaki, T. Loew, J. Porras, B. Keimer, T. Shibauchi, Y. Matsuda, Nat. Phys., 13, 1074 (2017). doi:10.1038/nphys4205
[30] https://github.com/parrt/dtreeviz
[31] A. Schilling, M. Cantoni, J. D. Guo, H. R. Ott, Nature, 363, 56 (1993). doi:10.1038/363056a0

© 2021 Society of Computer Chemistry, Japan

Top