Classification of the mode of action and regression modeling are reported for environmental toxicity of phenol derivatives. In this study, the descriptors for machine learning were numerically evaluated using the electronic-structure calculations.
In the field of material development and drug development, it is necessary to search for a compound that satisfies the desired properties and activity from a large number of candidates. There is a need to reduce the number of experiments in order to improve financial and time costs. Sequential Model-Based Optimization (SMBO) is one of the methods to reduce the number of experiments by using machine learning based prediction models. Conventional methods used as models are not suitable for extrapolation. On the other hand, extrapolation is required in compound discovery to achieve properties and performance independent of existing data. In this study, we propose a new nonlinear regression method, Stochastic Threshold Model Trees (STMT), which is applicable to extrapolation, and apply it to SMBO to achieve efficient compound search. By applying a new acquisition function to STMT, we have shown that the search performance of the proposed method was better than that of the conventional methods for the dataset used for verification. We also visualized the search process of each method and confirmed that the proposed method is efficient.
We conducted machine learning studies of epoxy resin for Structure-property relationship analyses of mechanical and thermal properties. The training sets were generated by full atomistic molecular dynamics calculations. Since the physical properties of thermosetting resins are strongly dependent on the higher-order structure formed by crosslinking reactions, it is necessary to develop a property prediction model that takes into account not only the molecular structure of the pre-polymer but also the higher-order structure of the resin after curing. In this study, in addition to the regression analysis with molecular descriptors and vectorized molecular fingerprints of pre-polymers, higher-order structure–property relationship analysis and molecular descriptor–higher-order structure correlation analysis were carried out using topological data analysis (TDA) techniques such as persistent homology. By connecting multi-scale structures with learning models, we were able to achieve both prediction accuracy and understanding of the phenomena which are necessary for inverse analysis to find new materials.
Quantitative structure-activity relationship (QSAR) and quantitative structure-property relationship (QSPR) models predict biological activity and molecular property based on the numerical relationship between chemical structures and activity (property) values. Topological information of molecular structures is usually utilized for this purpose (2D representations, 2D descriptors). However, conformational information seems important because molecules are in the three-dimensional space. As a three-dimensional molecular representation(3D descriptors) applicable to diverse compounds, similarity between a test molecule and a set of reference molecules has been previously proposed. In this study, we introduced the 3D descriptors into QSAR/QSPR modeling (regression tasks). Furthermore, we investigated relative merits of 3D descriptors over 2D in terms of the diversity of training and test data sets. For the prediction task of quantum mechanics-based properties, the 3D descriptors were superior to 2D. For predicting activity of small molecules against specific biological targets, no consistent trend was observed in the difference of performance using the two types of representations, irrespective of the diversity of training and test data sets.
In optimizing compound structures using machine learning, it is important not only to build and predict using highly accurate models, but also to obtain new knowledge that can be applied to material development. So far, methods for visualizing the prediction basis of compounds using deep learning have been reported, but the development of methods that can be applied to small data that is difficult to analyze by deep learning is also required. In this presentation, we propose a method to integrate the contribution of each substructure obtained by a machine learning model using fingerprint as the descriptor in consideration of the spread of the substructure and visualize it on the target molecule. We also show that the proposed method was applied to the published data and a reasonable prediction basis was obtained in comparison with known chemical findings. Furthermore, we report the results of using the proposed method for the optimization of copolymers.
In this study, we constructed two different neural network models. One is a convolutional neural network model for evaluation of the quality of catalysts from a set of nanoscale images. Another is a graph convolutional neural network model for evaluation of the predict glass transition temperature (Tg) of polyesters. We further applied the gradient-based method to visualize saliency maps to understand which nanostructures or chemical structures will affect the performance. Along this line, approaches based on integrated gradients will be significantly more effective for structural characterization tasks and may save both time and costs required for the design and development of materials.
The 2nd AMES/QSAR International Challenge Project is held for the newly established Ames- test database for approximately 13,000 compounds. This study aims to provide a new predictive benchmark for this database using statistical methods (QSAR). We have developed an AMES/QSAR model using Graph Neural Network, which extracts features from molecular graphs through End-to-End learning and machine learning models based on molecular descriptors (LightGBM, XGBoost, and Neural Network). Our modeling scheme introduced the stacking ensemble method to integrate the predictions of each model. This is motivated by the ability to combine the different input representations of molecular structures and different classifiers' algorithms with improving the prediction accuracy. Our models showed good prediction performance for machine learning methods based on molecular descriptors and Graph Neural Network. The Stacking models of these models show further improvement in prediction accuracy. This study's findings can be used as a benchmark for AMES/QSAR models for new mutagenicity databases.
Although present proteins are constructed from about 20 amino acids, primitive proteins were frequently considered that they were composed by only limited number of amino acids. Generally, Glu, Ala, Asp, and Val were presumed to be the components of primitive proteins. In this study, Glu were added to the tentative primitive proteins as the “fifth” amino acid, and the protein structures were evaluated. Glu, Ala, Asp, Val, and Glu were randomly sequenced, and the three-dimensional structures of random peptides were predicted by using molecular dynamics simulations. The results suggest that the tentative primitive proteins including Glu can form secondary structures more frequently than those without Glu. In addition, the structural rigidity of the random peptide for the peptide including Glu was larger than those for the peptide without Glu. Thus, “protein-like” peptides can be obtained by only five types of amino acids, Gly, Ala, Asp, Val, and Glu, which can existed on the primitive Earth.
Lithium-ion batteries have many problems as a sustainable energy resource, and the development of post-lithium-ion batteries is urgently needed. In battery design, machine learning-based screening, which has a low computational cost, has attracted much attention as an alternative to first-principles calculations. For cathode materials, the model which predicts the average voltage of intercalation cathodes from the given composition formula has been proposed. However, there are two problems. The first one is that the composition formula is incompatible with the detailed property evaluation by first-principles calculations. The second one is that it is hard to evaluate the applicability domain of the model. In this work, we developed the model which predicts the average voltage from crystal structures and evaluates the reliability of the prediction easily. Compared with the previous model, our model overcame the problems of the previous model while maintaining accuracy. We will perform the machine learning-based high throughput screening for sodium-ion battery cathodes by using our model.
We have developed the scheme for prediction accurate energy density based on the neural network-based batch machine learning by connecting electron density information and energy density. In this presentation, we extended the scheme to the online machine learning, which continuously learns a large amount of data from the electron / energy density database in order to improve the general applicability of the method. It is suggested that the present system based on the online version of the extreme learning machine can predict accurate kinetic / correlation energy densities for arbitrary compounds, which has been difficult in the conventional functionals.
Machine learning was applied to derive regression models for predicting hole mobility of hole-transporting phthalocyanine derivatives. In the analysis, the descriptors were numerically evaluated by using quantum chemical calculations.
In drug discovery and material design, efficient methods to search for novel molecules with desired properties are needed. The use of graph neural networks (GNNs) as a quantitative structure-property relationship model enables virtual screening of candidate structures with better prediction performance than that of conventional feature extraction methods. However, previous studies have used a large amount of structure and property data for the training. In molecular design, where a large amount of data are difficult to collect, GNNs may fail to predict properties accurately. In this study, we designed a GNN model called Perturbating Message Passing Neural Network (PMPNN), which is based on MPNN, to augment graph data by adding perturbations to feature vectors during message passing operation. We compared PMPNN with MPNN on the QM9 dataset, verified the effectiveness of the proposed method, and discussed the effect of the perturbation on predictions. It was also shown that the proposed method could achieve the same level of prediction performance with about half of the dataset, and suggested that it can extract features successfully even with a small amount of graph data.
A formal oxidation number in a transition metal complex is an important factor for estimating geometric structures, understanding the reactivity of homogeneous catalysts, elucidating the redox properties, and so on. On the other hand, a charge obtained by a quantum chemical calculation is often used to analyze an electronic state of a transition metal. However, the charge, which contains various effects, sometimes shows a different behavior from the formal oxidation number. In this study, we propose a scheme to interpret the formal oxidation number based on quantum chemical calculations. The scheme gives the same results of the formal oxidation number as the IUPAC’s definition. Furthermore, the charge is divided into three contributions: formal oxidation number, bonding, and remaining contributions. Each contribution is expected to be used for reactivity analysis and as new descriptors for machine learning in chemistry.
Estimation of synthetic accessibility is an important task for computer-aided drug design. A number of methods to predict synthetic accessibility are reported. Most of them are based on retrosynthetic analysis, molecular complexity, and/or fragment contributions, and there is almost no method using machine learning. We have reported a deep learning-based model to predict synthetic accessibility. Although our prediction model is successfully distinguished synthetically difficult compounds from easier ones, it cannot quantify synthetic feasibility especially for compounds of medium synthetic accessibility. To address the issue, we first examined whether optimizing the discriminant model would improve the quantitative prediction accuracy of synthetic feasibility. The results show that the model improved prediction accuracy for test sets from 99.08% to 99.32%, but it was impossible to distinguish compounds of medium synthetic accessibility in the validation set. The methods for interpreting the predictive model outputs are currently being further investigated.
Reaction kinetics simulations which are adopted transition states obtained by theoretical calculations will be useful for flow reactor design and the like. However, when a simulation is attempted in a system including a plurality of reactions in which a plurality of substances are shared, there is a problem that calculation costs become enormous. In this study, we propose a method to solve the reaction kinetics formula of complex systems by simultaneous ordinary differential equations based on Eyring’s absolute reaction kinetics formula. We also implemented this method to the Kinerator: reaction kinetics simulator we developing, in python3. For an anhydrous reaction of benzoic acid collected in the TSDB, which is 4 step reaction, the Kinerator read activation Gibbs free energy and reaction Gibbs free energy and then simulated the concentration change of substances over time.
The development of synthetic routes for functional chemicals has been heavily depending on experience and intuition of synthetic organic chemists. In case that desired molecules have complex structures, there are many possible synthetic routes, and it is often difficult to determine which one should be adopted. For reactions of molecules with many substituents, we have proposed a method to locate the TS structure of a target reaction by using TS structures of similar reactions stored in TSDB. However, this method seldom gives the most stable TS structure within possible conformers. That is, the stability of optimized transition states (TS), reactants and products is highly dependent on initial structures used for optimization. Therefore, this method is likely to give inadequate data to compare calculated and measured values of other synthetic reactions. For these purposes, we have to find reaction mechanism with the most stable TS and molecules involved in the reaction. In this paper, we proposed a method to search the most stable reaction path and show the results applying to the Pinner Pyrimidine reaction of 1-phenylbutane-1,3-dione and ethanimidamide.