Recently, the fragment molecular orbital (FMO) method has attracted considerable attention as an electronic structure calculation scheme applicable to macromolecular systems. As a major advantage, the lists of inter fragment interaction energies (IFIEs) are straightforwardly obtained from the FMO calculations. It has been well recognized that the IFIE-based analyses are useful to grasp the nature of interactions in the given target system in practical applications. However, there exists a severe limitation that the values of IFIE between covalently bonded fragments takes an abnormally large value (about -15.2 hartree) due to the fragmentation at the so-called bond detached atom technique, and this should degrade the usability of FMO calculations in several cases. In our pervious paper (J. Comput. Aided Chem. 18, 143 (2017)), we examined a correction method to solve this problem, based on the fictitious dissociation processes. In this paper, we propose a more realistic model with radical dissociation correction scheme.
Fluorescent substances are used in a wide range of applications, and the method that effectively design molecules having desirable absorption and emission wavelength is required. In this study, we used boron-dipyrromethene (BODIPY) compounds as a case study, and constructed high precision wavelength prediction model using ensemble learning. Prediction accuracy improved in stacking model using RDKit descriptors and Morgan fingerprint. The variables related to the molecular skeleton and the conjugation length were shown to be important. We also proposed an applicability domain (AD) estimation model that directly use the descriptors based on Tanimoto distance. The performance of the AD models was shown better than the OCSVM-based model. Using our proposed stacking model and AD model, newly generated compounds were screened and we obtained 602 compounds which were estimated inside the AD in both absorption wavelength and emission wavelength.
Using a semi-empirical molecular orbital method, we have studied the efficacy of magic numbers to predict the emergence of Y-aromaticity in specific compounds having a trifurcated structure, which we simply referred to as “n-Tridentene” in our previous report. In this study, we obtained computational results suggesting a tendency of increase in the HOMO–LUMO gap, relating to the kinetic stabilization, of n-Tridentene ions containing the same number of π electrons as the magic number. Furthermore, energetic and reaction kinetics considerations suggest the possible development of Y-aromaticity in the 6-Tridentene anion.
Prof. Kimito Funatsu received the Honor Award in Division of Chemoinformatics, the Chemical Society of Japan in the 42th Annual Meeting of Chemoinformatics held on Nov. 28th 2019. The awarding recognizes his significant contributions in the development of the cheminformatics discipline in the world as well as in Japan. His research efforts extend over multiple domains such as (i) system development including elucidation of chemical structures and prediction of organic reactions, (ii) quantitative structure activity relationship (QSAR), (iii) quantitative structure property relationship (QSPR), and (iv) international collaborations in chemoinformatics. In the present review, we focus on chemoinformatics in the world as well as in Japan based on “Special issue dedicating to Honor Award: Prof. Kimito Funatsu”, which consists of five invited papers by the world-famous distinguished foreign researchers, and six papers from domestic researchers. Taking these papers into consideration, we try to discuss the meanings of the Honor Award dedicating to Prof. Kimito Funatsu.
On the occasion of honoring Kimito Funatsu with the 2019 Herman Skolnik Award, aspects of similarity, diversity and complexity are mentioned in relation to chemoinformatics, chemometrics, Japan, and personal encounters with the awardee.
The achievements of Professor Kimito Funatsu for the development of chemoinformatics in Japan are briefly summarized. Furthermore, some aspects of the collaboration of this author with Kimito Funatsu are discussed.
Computer-assisted de novo drug design has been a central research topic in the field of chemoinformatics for approximately 30 years. Professor Kimito Funatsu’s research has been a formative component in these developments. His seminal work has contributed inverse quantitative-structure-activity relationship (QSAR) models for small molecule and peptide design. This article highlights a class of recurrent neural networks, so-called long short-term memory (LSTM) networks for generative molecular design, which further the conceptual approach of inverse QSAR. We review the LSTM method for molecular design along with selected practical applications.
Multi-target activity (promiscuity) of small molecules provides the basis of drug polypharmacology. Computationally, promiscuity can be explored through systematic analysis of compound activity data. Inhibitors of the human kinome represent an instructive example.
Information of transition states of similar reactions is the key to locating those of unknown reactions. In order to utilize this feature, we are constructing a database, called QMRDB, which gathers results of quantum mechanical calculations for elementary reactions as well as those for related molecules. Another database (TSDB) stores information of name reactions in organic synthesis. Retrieval results from these databases are used for analyzing reaction mechanisms which have not been experimentally examined. We developed a cloud system managing both the two databases and theoretical calculations. The present paper describes the summary of the TSDB cloud system and how to use it to perform in silico screenings for synthesizing drug candidates.
Finding direct correlations between electronic structures of molecules and their properties, which we call “electronic-structure informatics”, is one of the challenging issues in chemoinformatics because the electronic degree of freedom is an essential factor determining the chemical characteristics. Herein we develop computational methods to automatically draw two types of orbital correlation diagrams. They are expected useful to perform machine learning including electronic degrees of freedom. In the present approach, we focus on electronic similarity called orbital similarity whose score is defined as spatial overlap between two molecular orbitals (MOs) enclosed with their iso-value surfaces. The similarity scores are also used to derive another orbital correlation diagram called “orbital interaction diagram”. This diagram is to relate MOs of a target molecule with those of its fragments. Through applications to benzene derivatives, these diagrams are shown to be reasonable, indicating potential usefulness of the present method in machine learning for quantitative predictions of molecular properties and chemical reactivities.
Fatty acid synthase (FASN) inhibitors are known to work as anti-cancer drugs. In order to find important factors in their structure-activity relationships and to derive a predictive model for the activity, we herein tried to develop regression models by using descriptors representing chemical reactivities and intermolecular interactions. By employing the descriptors calculated with the electronic-structure theory, regression models for the experimental IC50 values were derived. Good correlations between the predicted and experimental values were obtained for the natural products having inhibitor activity to FASN. The obtained models are expected useful for systematic search for more efficient inhibitors. At the same time, the present results justify the use of the newly suggested descriptors evaluated in electronic-structure calculations.
A number of studies have investigated the relations between structures and activities of metabolites. It has been proposed that structural similarity between metabolites implies activity similarity between them. In light of this fact we propose a method for activity prediction of secondary metabolites based on association philosophy. First we determined the structural similarity scores between targeted metabolite pairs using COMPLIG algorithm. To increase the possibility of clusters rich with known metabolites we calculated structural similarity between metabolite pairs for which activities of both or at least one metabolite is known and then selected the metabolite pairs for which the similarity score is higher than a threshold (s > 0.95). The network of such metabolite pairs was then clustered using the DPClusO algorithm. Statistically significant cluster-activity pairs were then selected using the hypergeometric test. Then biological activities of unannotated metabolites were predicted from the activity of metabolites included in the statistically overrepresented clusters.
In polymer material development, we often need to optimize some physical and chemical properties simultaneously. On the other hand, there is no established method to predict some different properties of polymers by the same approach. In this study, property values of various polymers were collected from the literature. Their relevance was considered by hierarchical clustering. PLSR models were constructed which predicted density, glass transition temperature, and dissolution parameter using descriptors obtained from the monomer unit structure information. R2 of the models were 0.88 ~ 0.97. The concept of informatics has shown the possibility to predict different polymer properties in a similar way.
Pesticides are considered a vital component of modern farming, playing major roles in maintaining high agricultural productivity. Pesticide recovery rates in vegetables and fruits determined using GC/MS depends on various factors including the matrix effect and chemical interactions between pesticides and mixing compounds in crops. In this study, the recovery rate of a pesticide is defined by a ratio of peak area of 50 ppb spiked in a crop sample to that in the solvent standard calibration curve. The estimation of recovery rates of pesticides in crops leads to evaluation of precise contents of them in the crops. In the present study, we performed regression models of the recovery rates based on molecular descriptors using R-packages rcdk and caret. Each of the chemical structures of 248 pesticides was converted to 174 molecular descriptors, then, for 7 crops, we created 69 ordinary and 20 ensemble learning regression models for estimating the recovery rates from the molecular descriptors using R-package caret. In the present study, two machine learning regression methods called mSBC and xgbLinear performed the best in view of prediction rates and execution times. In those two regression models predictions of recovery rates of pesticides are carried out in local distribution of chemical properties out of the 174 molecular descriptors. This concludes that closely related pesticides in the chemical space have also very similar recovery rates.
International concern on in silico methodology development of ecotoxicity prediction of chemical substances become one of the hot topics these years. To classify chemical substances based on their structure information and then predict ecotoxicity with Log Kow linear regression empirically for the chemical with a similar structure is the most seen conventional methods. Nevertheless, it is challenging to predict the ecotoxicity of the inorganic and ionized chemical substances with multiple functional groups. We previously developed an in silico prediction method by machine learning on the fingerprint of those chemicals with known ecotoxicity test data from AIST-MeRAM to overcome these problems. Our developed method can provide better prediction accuracy than conventional methods for a broader range of chemical substances including inorganic and ionized compounds. To further improve and explain the prediction ability on inorganic and ionized chemical substances, this study investigated the contribution of the structural feature to the prediction of fish acute ecotoxicity with supervised machine learning by using two kinds of target variables. We found that the ecotoxicity of metal compounds was mainly predicted based on their hydrophilicity that structural related to the numbers of oxygen, benzene ring, and methyl groups. Moreover, the prediction accuracy of this method proved to be better than our previous method.
The in sillico method to predict the ecotoxicity of chemical substances for reducing animal testing has become attracted attention. A most common model for predicting ecotoxicity classify chemical substances empirically based on functional groups and then predict ecotoxicity with a linear regression by using a descriptor of a chemical substance such as Log Kow. But the conventional method outputs duplicate result for one chemical substance when it has multiple functional groups. Moreover, this method is not appropriate for predicting the ecotoxicity of metal compounds. To overcome these challenges, this study developed a new fingerprint as a feature set for machine learning, and a new prediction model with supervised machine learning for chronic ecotoxicity on fish. The new fingerprint extracts feature of a chemical substance by judging the existence of the structure contributing to ecotoxicities such as carbamate insecticide, organophosphorus pesticides, organic halogen, various metal elements, and hexavalent chromium. Moreover, we compared the accuracy for predicting chronic ecotoxicity on fish with various machine learning models by 10-fold cross-validation using this new fingerprint, general fingerprints, and descriptors together as a feature set. As a result, our developed method with the stacking ensemble was the most accurate in this study. This method improved accuracy by using the result of multiple machine learning algorithms as a part of a feature set. The result of the benchmark test show that the prediction accuracy of this method was better than conventional methods.
Batch or semi-batch processes have been of great use in various industrial chemical plants. For efficiently monitoring such processes, soft-sensor models can be employed. Many of previously proposed soft-sensor models assumed that objective variable values for model construction can be available at any time during process operation. However, in many chemical plants, it is difficult to sample product from the ongoing process due to such extreme reaction conditions as high pressure and temperature. Therefore, understanding the relationship between time-series soft-sensor model’s predictability and the number of sampling points is important. In the present work, we clarified this relationship using simulation datasets, which can be easily reproduced. When sampling points were scarce, data augmentation strategy was also found to be effective. Soft-sensor models can be effectively built using sampling points in the early phase of the process. These findings were applied to build a soft-sensor model of an industrial semi-batch process.