Abstract
Models for predicting properties/activities of materials based on machine learning can
lead to the discovery of new mechanisms underlying properties/activities of materials.
However, methods for constructing models that exhibit both high prediction accuracy and
interpretability remain a work in progress because the prediction accuracy and
interpretability exhibit a trade-off relationship. In this study, we propose a new
model-construction method that combines decision tree (DT) with random forests (RF); which
we therefore call DT-RF. In DT-RF, the datasets to be analyzed are divided by a DT model,
and RF models are constructed for each subdataset. This enables global interpretation of
the data based on the DT model, while the RT models improve the prediction accuracy and
enable local interpretations. Case studies were performed using three datasets, namely,
those containing data on the boiling point of compounds, their water solubility, and the
transition temperature of inorganic superconductors. We examined the proposed method in
terms of its validity, prediction accuracy, and interpretability.
Tables
Table 1.
Prediction results for BP dataset
Method |
r2train |
MAEtrain |
r2DCV |
MAEDCV |
PLS |
0.890 |
16.0 |
0.823 |
20.5 |
SVR |
0.961 |
3.67 |
0.836 |
13.8 |
RF |
0.946 |
12.1 |
0.700 |
27.7 |
DT-RF |
0.971 |
7.60 |
0.847 |
17.2 |
Table 2.
Prediction results for logS dataset
Method |
r2train |
MAEtrain |
r2DCV |
MAEDCV |
PLS |
0.883 |
0.540 |
0.461 |
0.785 |
SVR |
0.984 |
0.175 |
0.879 |
0.505 |
RF |
0.918 |
0.441 |
0.825 |
0.687 |
DT-RF |
0.971 |
0.257 |
0.875 |
0.550 |
Table 3.
Prediction results for T
c dataset
Method |
r2train |
MAEtrain |
r2DCV |
MAEDCV |
PLS |
0.752 |
11.9 |
-30.7 |
17.7 |
RF |
0.990 |
1.85 |
0.869 |
7.24 |
DT-RF |
0.923 |
5.14 |
0.842 |
7.96 |
References
- [1] R. Tibshirani, J. R. Stat.
Soc. B, 58, 267 (1996). doi:10.1111/j.2517-6161.1996.tb02080.x
- [2]H. Drucker, C. J. C.
Burges, L. Kaufman, A. Smola, V. Vapnik, In Neural Information Processing Systems, 9,
155-161 (1997)
- [3] C. M. Bishop, Pattern
recognition and machine learning; Springer: New York (2006)
- [4]
https://www.oecd.org/chemicalsafety/risk-assessment/37849783.pdf
- [5] T. Cover, P. Hart, IEEE
Trans. Inf. Theory, 13, 21 (1967). doi:10.1109/TIT.1967.1053964
- [6] D. M. J. Tax, Pattern
Recognit. Lett., 20, 1191 (1999). doi:10.1016/S0167-8655(99)00087-2
- [7] T. Fujita, D. A. Winkler, J.
Chem. Inf. Model., 56, 269 (2016). doi:10.1021/acs.jcim.5b00229
- [8] D. C. Park, M. A.
El-Sharkawi, R. J. Marks, L. E. Atlas, M. J. Damborg, IEEE Trans. Power Syst., 6, 442
(1991). doi:10.1109/59.76685
- [9] S. Kar, J. Roy, D.
Leszczynska, J. Leszczynski, Computation (Basel), 5, 2 (2016).
doi:10.3390/computation5010002
- [10] L. Breiman, Mach. Learn.,
45, 5 (2001). doi:10.1023/A:1010933404324
- [11] V. Stanev, C. Oses, A. Gilad
Kusne, E. Rodriguez, J. Paglione, S. Curtarolo, I. Takeuchi, npj Comput., Mater., 4, 29
(2018).
- [12] V. M. Alves, A. Golbraikh,
S. J. Capuzzi, K. Liu, J. Chem. Inf. Model., 58, 1214 (2018).
doi:10.1021/acs.jcim.8b00124
- [13] I. M. dos Santos, J. P. G.
Agra, T. G. C. de Carvalho, G. L. de Azevedo Maia, E. B. de Alencar Filho, Struct. Chem.,
29, 1287 (2018). doi:10.1007/s11224-018-1110-8
- [14] M. Asahara, R. Fujimaki,
IEEE Trans. Parallel Distrib. Syst., 30, 1481 (2019).
doi:10.1109/TPDS.2019.2892972
- [15] R. Eto, R. Fujimaki, S.
Morinaga, H. Tamano, PMLR, 33, 238 (2014).
- [16] Y. Iwasaki, R. Sawada, V.
Stanev, M. Ishida, A. Kirihara, Y. Omori, H. Someya, I. Takeuchi, E. Saitoh, S. Yorozu,
npj Comput., Mater., 5, 103 (2019).
- [17] L. Breiman, Mach. Learn.,
24, 123 (1996).
- [18] G. De’ath, K. E. Fabricius,
Ecology, 81, 3178 (2000).
doi:10.1890/0012-9658(2000)081[3178:CARTAP]2.0.CO;2
- [19]
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
- [20]
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
- [21]
https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.PLSRegression.html
- [22]
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html
- [23] H. Kaneko, K. Funatsu,
Chemom. Intell. Lab. Syst., 142, 64 (2015).
doi:10.1016/j.chemolab.2015.01.001
- [24] P. Filzmoser, B. Liebmann,
K. Varmuza, J. Chemometr., 23, 160 (2009). doi:10.1002/cem.1225
- [25]
https://chm.kode-solutions.net/products_dragon.php
- [26]
https://www.rdkit.org/docs/GettingStartedInPython.html
- [27] J. Bardeen, L. N. Cooper, J.
R. Schrieffer, Phys. Rev., 108, 1175 (1957). doi:10.1103/PhysRev.108.1175
- [28] J. G. Bednorz, K. A. Müller,
Z. Phys. B Con., Mat., 64, 189 (1986).
- [29] Y. Sato, S. Kasahara, H.
Murayama, Y. Kasahara, E. G. Moon, T. Nishizaki, T. Loew, J. Porras, B. Keimer, T.
Shibauchi, Y. Matsuda, Nat. Phys., 13, 1074 (2017). doi:10.1038/nphys4205
- [30]
https://github.com/parrt/dtreeviz
- [31] A. Schilling, M. Cantoni, J.
D. Guo, H. R. Ott, Nature, 363, 56 (1993). doi:10.1038/363056a0