Terpenoids, phenylpropanoids, and polyketides are the majority of the secondary metabolites containing carbon, hydrogen, and oxygen. In this work, 19,769 metabolites accumulated in KNApSAcK Core DB were classified into 71 subgroups comprising three major groups (terpenoids, phenylpropanoids, and polyketides) according to scientific literatures. We represented the metabolites as molecular fingerprint including chemical properties, and used those descriptors for classification by random forest model. We found that both training and test metabolites were well classified into the subgroups, with 94.06 %, and 94.23 % accuracy, respectively. Though classification of metabolites based on metabolic pathways is very time-consuming works, machine learnings with molecular fingerprint made it possible to attain the classification. This work will lead a light for systematical and evolutional understanding of diverged secondary metabolites based on secondary metabolic pathways. Data science is an interdisciplinary and applied field that uses techniques and theories drawn from statistics, mathematics, computer science, and information science. Combining these resources data science enables extracting meaningful and practical insights for secondary metabolites.
View full abstract