Half-parent type diazomethane (EMind)CHN2 having a bulky steric protecting group (Rind: octa-R-substituteds-hydrindacene; EMind) is expected for the synthesis of silyne having acarbon-silicon triple bond. Since diazomethanes generate highly reactivecarbene by photolysis, controlling the carbene reaction is the key tosynthesize silyne. In this study, we investigated the photoreactions ofhalf-parent type diazomethane using density functional theory (DFT)and time-dependent density functional theory (TD-DFT). First, we used phenyldiazomethane (PhCHN2) as a modelmolecule, and searched for the reaction paths on ground state using theartificial force induced reaction (AFIR)method which is proposed as an automatic reaction path search method. Next, weanalyzed the molecular orbitals involved in the photoexcitation of diazomethaneusing the TD-DFT method. And, it is also introduced a method for searching fora crossing of tow potential energy surfaces, such as conical intersections (CI)and intersystem crossing (ISC).
We constructed a machine-learned electronic correlation model to develop a method for evaluating electronic correlation energy with high accuracy and low computational cost. This model is constructed using the correlation energy density at the complete basis-set (CBS) limit of coupled cluster theory as the objective variable. The grid-based energy density analysis and composite method to evaluate the correlation energy at the CBS limit, which were proposed in our group, were applied to obtain the objective variable. As the descriptor, density variables such as electron density and density gradient were used in the same way as correlation functionals in density functional theory (DFT). A multi-layer neural network was adopted as a machine learning method. Numerical assessments clarified that our correlation model is capable of reproducing the accurate electron correlation energy with a relatively small basis set. Furthermore, reaction energies of chemical reactions were calculated by combining with the CBS limit of the Hartree−Fock energy, resulting in the accuracy better than DFT calculations based on a large number of exchange-correlation functionals.
MOF crystal can be regarded as an inorganic nanocluster integrated structure. By utilizing such MOF materials, it is possible to achieve a highly self-organized structure that is difficult to achieve by simple assembly of inorganic nanomaterials, and excellent optical properties and carrier mobility can be expected. In this presentation, we report the electronic properties of MOFs that have high-dimensional cluster structures in the framework. In addition, we propose a method based on machine learning technique to improve the accuracy of the prediction for the synthesis condition of MOFs. In this work, we explored synthesis condition of MOFs containing sulfide-metal bonds by high throughput screening systems. We tried to synthesize a MOF composed of trithiocyanuric acid as sulfide containing ligand and Ag ion as metal. The relationship between synthesis condition and obtained X-ray diffraction patterns was estimated by decision tree analysis. In addition, we have succeeded in determining the crystal structure of three novel MOFs.
The design of novel functional polymer resin with experiments costs time and money. Thus, the design with chemoinformatics is desired; ingredients for novel polymers are, however, in general hardly found in the market, which results in a small-sized dataset. Linear model with polymer descriptors obtained as linear combination of monomer descriptors and compositions was used in this study so as to model an industrial small-sized dataset. The monomers were expressed with Morgan fingerprint with radii 0-3, and PLS model was employed. The main feature of the model is to be able to estimate the contribution of each monomer to the objective variable, when setting a mole fraction of a monomer one. After the model construction, optimization of monomers’ compositions and structure generation were carried out. A novel combination of monomers obtained in the optimization step made the objective variable the most desired one. The optimization results were verified by experiments. In the structure generation step, the regression model with the fragment descriptors was incorporated mol2vec, a sort of auto-encoder in the deep learning field, into. With both the linear model and mol2vec, structure generation provided a large number of structures that were possibly effective for polymer resin with the desired objective variable.
This presentation introduces a chemical structure search method using SMILES with regular expression extension. As regular expressions dramatically improve the convenience of string search, a regular expression extension for SMILES can improve the convenience of chemical structure search.
Activity cliff (AC) is formed by a pair of structurally similar compounds with large difference in biological potency. Successful AC prediction leads to efficient exploration of lead compounds in medicinal chemistry. Using previous ligand-based approaches to the prediction of AC, it was found that similarity values between a pair of compounds consisting of the same core, but different attachment points were overestimated. Furthermore, if compounds in the training data set did not consist of the same core as in the test data set, AC prediction accuracy decreased. In the present study, we investigated whether AC can be accurately predicted in such difficult situations mentioned above. We proposed a novel AC prediction scheme for taking into account the attachment points of cores. The proposed scheme was applied to the prediction of AC in several activity classes and it was confirmed that the prediction accuracy improved compared to the previously proposed scheme.
The purpose of this study was to investigate the correlation between electronic descriptors and antibacterial activity, and to predict molecules with high antibacterial activity. First, electronic structures of 27 molecules for which MIC against E. coli was reported were calculated, and 11 electronic descriptors selected on the basis of the mode of action were obtained. Multiple regression analysis was performed using this descriptor set. In the analysis, descriptors having the p-value larger than 0.05 were eliminated in a stepwise manner. In the end, four descriptors were selected. The coefficient of determination of the regression model created using the selected descriptors was 0.763. This indicates that the correlation between the predicted and experimental MIC values were reasonably good. The obtained regression model was used to predict 147 natural products. The molecules selected in this procedure was found to be reasonable since they have been reported to have high antibacterial activity.
Recently, machine learning based approaches by extracting physical features from molecular structures have been applied to classify molecular structural similarity, or prediction of biochemical activities such as ligand activity against target proteins in various applications. In this study, we will introduce an application that more efficiently trains molecular feature extraction using molecular graph convolution neural network (MGCNN), which is an application of deep learning model to chemical molecules.
This work proposes a unified approach to predict glass transition temperatures (Tgs) of polymers by machine-learning approaches based QSPR (Quantitative Structure–Property Relationships) study. Our approach encompass all the three senarios: linear homo- and heteropolymers, plus reticulated heteropolymers by generating descriptors of reagents undergoing polymerization. Three predictive SVR (Support Vector Regression) models are discussed here generated from ISIDA (In Silico design and Data Analysis) descriptors. In 12 times repeated 3-fold cross-validation challenges, it displayed the highest accuracy of Q2 = 0.920, RMSE = 34.3 K over the training set of 270 polymers, and R2 = 0.779, RMSE 35.9 K for an external test set of 119 polymers. GTM (Generative Topographic Mapping) analysis produced a 2D map of “polymer chemical space”, highlighting the various classes of polymers included in the study and their relationship with respect to Tg values.
Iterative screening surveys a small set of (virtual) compounds, during which their property values are determined by experiments and used as feedback for updating quantitative structure-property (activity) relationship models. This cycle is repeated several times until identifying the compounds exhibiting desired property values or better property (activity) values. In the present work, we have conducted a series of virtual experiments to assess the characteristics of different iterative screening methods using compounds from ZINC and ChEMBL databases. Overall, batch-based Bayesian optimization with Gaussian process, which impose penalty on the acquisition function for compounds proximal to already sampled compounds in a batch, performed better in terms of the number of iterations to identify one of the goal compounds. Linear regression models without taking into account the domain of applicability to the regression model also worked consistently for the property for which a key factor was present in the set of molecular descriptors.
Applications of machine learning methods to chemistry and materials science have been attracting much attention in recent studies. In these fields, only limited number of experimental data are usually available for supervised machine learning (SML). Herein, we performed a model study on the accuracy of SML in obtaining regression models for electron-transfer rate using a small data set of reference data. The model data was prepared by applying the Marcus theory on electron transfer. Three parameters that reflect the characteristics of the reaction substrate in the formula were generated using random numbers, and 1000 pieces of training and test data sets were created. Arbitrary numbers were chosen from the training set, 0-30% error was added, and the performance was compared by performing prediction using a support vector machine (SVR). As a result, when there was no error in the training data, at least 30 pieces of data were required for the R2 value of the test set to be 0.8 or more.
The ferulic acid is known to have strong antioxidant properties. In the present study, we have investigate the electronic structures of the ferulic acid and its radical extracting the hydrogen atom from its phenolic hydroxyl group. We have discussed the relation of the results with the radical scavenging activity with the DPPH reagent measured by Sakamoto et al. by several machine learning methods.
In recent years, functional organic molecules have been actively developed. The use of structural generators is one of the ways to develop such molecules efficiently. Here we present a novel algorithm to diversify the structure generated by the DAECS structure generator, which was previously developed to generate structures having objective properties. Two rules for structural transformation, bond contraction and ring merge, were newly added. The new algorithm, which restricts a search area and subsequently cluster structures on a 2-dimensional map generated by the generative topographic mapping, was implemented for the selection of seed structures. In order to evaluate the proposed method, we generated a ligand structure for the histamine H1 receptor. As a result of the experiment, we observed that diversity was improved by the proposed method. It was also suggested that there is a possibility of obtaining a new structure by the proposed method.
To differentiate between various Oh skeletons (octahedron, cube, cuboctahedron, truncated octahedron, and truncated hexahedron), the newly-developed combined-permutation representations are used under the GAP system, where their mark tables and USCI-CF tables are calculated. Because these tables are diferent from each other, they are standardized by using the newly-developed GAP functions to generate the standard mark table and the standard USCI-CF table. Thereby symmetry-itemized enumeration based on these Oh skeletons are conducted by applying Fujita's USCI approach (S. Fujita, Symmetry and Combinatorial Enumeration in Chemistry, Springer-Verlag, Berlin-Heidelberg, 1991).
To improve efficiency in discovering new compounds and searching path synthetic pathway with the machine learning has received attention and there is big demand to extract chemical information from document data automatically. The named entity recognition that is a task to detect chemical entity from documents is a fundamental and important process for chemical information extraction. In the named entity recognition, the BiLSTM-CRF model has been widely used. The input of model is a sequence of words. The words are converted to vector that is called the distributed representation. Recently, it has been reported that the contextualized distributed representation improves the performance of the neural model for the named entity recognition. In this paper, to apply these approaches to chemical informatics domain, we employ a contextualized word representation combined to the BiLSTM-CRF and our method achieved the state-of-the-art performance in the chemical named entity recognition task.
There are various types of adhesives made from epoxy resin or the like. Different physical properties are manifested depending on the materials and the blending amount. Conventionally, the blending amount has been optimized to achieve the desired physical properties based on the knowledge and intuition of researchers. The effects of each material on the physical properties have not been fully revealed, and a number of experiments are necessary until the new adhesives is invented. In order to solve this problem, machine learning models that predicts the target physical properties from the composition of the adhesive was built. In this presentation, we will describe the results of building models for glass transition temperature, moisture permeability, and adhesive strength, which are examples of physical properties that are of interest for adhesives. We also report on the result of experiments to verify the composition predicted to appear the desired physical properties by generating a large number of candidate compositions.
Along with the advances in manipulation method for atomistic and spectroscopic characteristics of materials, designing them with machine learning algorithms is increasingly common in recent years. That is because the designing is defined as black-box optimization, which is generally a difficult problem. Its difficulty grows exponentially in the number of variables and severely suffers the classical search algorithms. We combine a regression model called factorization machine with quantum annealing to propose a new quantum-classical hybrid algorithm and show how it can be incorporated into automated materials discovery. The quantum annealing greatly reduces the time for selection from the massive number of candidates. As a proof-of-principle work, we used the algorithm with an analytical method in computational electromagnetics called RCWA to design wavelength selective radiator. The resulting material showed much better concordance with the thermal atmospheric transparency window than existing human-designed alternatives. It indicates the further use of quantum annealing in real-world design problems.
Recently, prediction of material properties and search for crystal structures with machine learning have been extensively explored. This study addresses the issue of generating crystal structures with generative models. So far, an algorithm called CrystalGAN has been proposed. This algorithm generates crystal structures in a fashion of "A-H-B" (A, B: metal, H: hydrogen) with DiscoGAN, a generative model across different domains. CrystalGAN is a simple algorithm to generate crystal structures. However, on the other hand, since it builds a feature by combining the lattice vectors and the coordinates of hydrogen and metals, it is not sufficient to consider the geometric structure of the crystals. We propose an algorithm to generate crystal structures by representing crystals with graph structures to consider those geometric structures. There are three key ideas of the proposed algorithm: (1) usage of crystal graph as a feature, (2) usage of generative models for graph structures such as GraphGAN, (3) conversion of crystal graph into an interpretable format such as POSCAR.
Catalytic performance on oxidative coupling of methane (OCM) reaction was predicted by using two kinds of machine learning (ML) approaches using previously-reported experimental data. The first approach considers catalyst compositions and experimental condition as input value. The second approach considers elemental features as input representations instead of inputting catalyst compositions directly. Among 10-fold cross validation , XGB Regressor provided the best results, and prediction accuracy was improved by the second approach. In addition, SHAP values were calculated to evaluate the most influenced input variables on catalyst performance. Experimental conditions such as reaction temperature and partial pressure of reaction gases, as well as catalyst compositions such as Mn, Na, and Li were identified to be highly important. Partial dependence plot was obtained to visualize the relationship between catalytic performance and catalyst composition on the Mn/Na 2 WO 4 /SiO 2 type catalyst. Finally, optimization of catalyst composition and experimental condition were explored using a SMAC procedure with ML as a “surrogate model”. Top 20 promising candidate catalysts were identified for future study.
In electronic devices, when the number of paths connecting the source and drain electrodes increase, the conductance of the device should also increase. This is true in the macroscopic case, but this is not always the case on the nanoscale. Kirchhoff’s superposition law tells us that when the number of paths is doubled, the conductance is also doubled. However, as far as the path in a sense of molecular graph theory goes, things are not so simple. When the number of paths in a molecule gets doubled, two situations will arise: the conductance gets more than doubled or even gets smaller. Our theoretical study with the non-equilibrium Green’s function method has revealed that the distinction of these situations has a close relation to the aromaticity of the ring formed as a result of doubling the path. We will see how helpful it is to characterize the molecular transmission features relying on the frontier orbital theory and orbital interactions. Some discrete mathematical aspects of the relation between the atom connectivity and electron conductivity are also described.
In this study, the DFT method was employed to calculate 33 alkylphenol molecules with toxicity to Tetrahymena pyriformis. We evaluated their 29 descriptors derived from physicochemical considerations. The statistical analysis was performed in order to derive a regression model for prediction on the basis of the calculated electronic descriptors. Good prediction was achieved by using the random forest method. By defining a partial set of descriptors, it was found that the molecular size parameters should be important in describing the toxicity of alkylphenols.
For the drug discovery, the Fragment Molecular Orbital (FMO) method has attracted attention as a method for quantitatively evaluating the strength of a target protein-ligand interaction by electronic structure calculation. The FMO method is an excellent method that dramatically improves computational costs, but is currently not suitable for virtual screening of many compounds. For this reason, it is thought that it is necessary to use it in combination with a technique for narrowing down ligands that are candidates for new drugs in advance. In this study, we considered whether the strength of protein-ligand interaction could be reproduced only by electronic descriptors, representing the characteristics of the ligand, by machine learning using the FMO database. As a result of constructing a random forest regression model for p38 MAP kinase ligands, a good correlation was confirmed between the electronic descriptor and the interaction strength. Therefore, the obtained regression model is expected to allow for virtual screening of candidate compounds binding strongly to the target protein.
The synthesis route developing systems (SRDS) create synthesis routes of target molecules. However, there are no guaranty that experiments using created routes produce the targets by following three reasons. The first is that created routes use precursors which are much more difficult to synthesize than the target itself. The SA score for the molecular complexity is key to solving this problem. The second is the route divergence for multistep reactions and in silico screening is very useful to reduce the number of experiments to check. The third is that side reactions together with the main one are likely to proceed on the synthesis reaction offered by SRDS. All the plausible reactions for reactants have to be examined to determine which is the main reaction on the basis of theoretically calculated free energies of activation. The reaction handling function in RDKit was used for predicting possible reactions and then theoretical calculations performed to calculate the energies to check what is the main product from reactions from SRDS. A procedure for this purpose was created and applied for reactants for which Ene reaction proceed.
The target of this study is the CVD process for the fabrication of semiconductor devices. In order to automate of the entire process for research and development of CVD, we developed and evaluated the system that automatically design the experiments for identifying reaction models. In our conventional system, the algorithms using CMA-ES were adopted. However, many dominated solutions were generated, and many candidates for experimental condition that were not effective on identifying reaction models were proposed. Therefore, in this study, the algorithms using SPEA2 were adopted to try to generate the Pareto solution group by avoiding the dominated solution group. In addition, the x-means method that automatically determines the number of clusters was introduced as a method for classifying the obtained solutions, aiming at further automation of the system.
Chemical Vapor Deposition (CVD) is the major process of semiconductor device fabrication. In order to analyze CVD process at both low cost and high speed, we introduced the novel calculation method to reproduce the deposition profiles of the CVD processes. We adopted three type of reactors, that is, the batch type reactor with round-shaped substrates, the tubular reactor and the substrates with trench. The deposition profiles of the reactors can be obtained from the exact solutions of solving mass balance equations. The calculation speed was greatly improved, with the same calculation accuracy as the conventional calculation method, which is performed by the iteration of numerical integrations. Moreover, by implementing the exact solution of the deposition rate distribution, we could analyze various types of CVD reactors at high speed and with high accuracy.
Our research group has studied experiment-oriented materials informatics based on original small experimental data combined with experience and intuition of researchers. In the present study, layered organic-inorganic composites were exfoliated in the dispersion media. The yield of the nanosheets was measured for 128 different guest-medium combinations to prepare the training dataset. The important descriptors were explored from 35 potential factors related to the yield. The number of explanatory variables were restricted to 16 using minimax concave penalty (MCP), a machine learning method. Then, two descriptors were extracted on the basis of the chemical relevance. The simple prediction model was obtained by this sparse modeling. Then, the yield was predicted for 211 unknown guest-medium combinations in three different host layers. The high- and low-yields were actually achieved on the three new host layered materials. The results indicate that the experiment-oriented MI has potentials for the small experimental data.
The method of searching for alloy catalysts for direct methane conversion is developed. In order to create more realistic model, the composition and structures of the alloys were selected on the basis of the enthalpy of formation, and the surfaces were determined by using surface energy obtained with DFT calculations. Some regression models for prediction of two conditions necessary for the direct methane conversion, selectivity to suppress the undesired reactions and high reactivity to cleave the strong C-H bond of methane, were built by partial least squares (PLS) method.
Terpenoids are one of the main secondary metabolite groups and have a variety of physical properties and physiological activities, so they are widely used in fragrances, pharmaceuticals and fuels. In terpenoid biosynthesis, various terpene skeletons are formed from a common precursor by an enzyme called terpene synthase at an intermediate stage. Terpene synthases have reaction specificity, and each terpene synthase catalyzes the formation of a defined terpene skeleton. Therefore, understanding the relationship between the amino acid sequence and reaction characteristics is important in applications such as the biotechnological production of the desired terpenoid, but the correspondence is still unclear. Therefore, the correspondence relationship between the structure of the terpene skeleton and the amino acid sequence is examined by creating a neural network model that performs interconversion between the terpene skeleton and the amino acid sequence of the terpene synthase.
KNApSAcK Family DB is a set of databases associated with natural products and organisms. In the present article, we explain species-natural product relation DB, the KNApSAcK Core DB together with the current status of KNApSAcK Family DB in view of expansion of the DB which can be utilized in multifaceted scientific fields and acquisition of new knowledge based on mining techniques. Alkaloids have extremely diverged chemical structures including heterocyclic ring systems and they encompass more than 20,000 different molecules in organisms. To facilitate a systematic understanding of the species-metabolite relationship, we have developed KNApSAcK family DB. KNApSAcK Core DB has stored 116,315 metabolite-species pairs and 51,179 different metabolites. Of them, 12,460 metabolites belong to alkaloid compounds, which covered almost all plant-produced alkaloids (approximately 12,000 alkaloids). An evaluation of the numbers of alkaloids linked to different starting substances leads to information on the origin of the creation and evolution of diverged alkaloids. We applied the MGCNN model to 12,460 compounds in the KNApSAcK Core DB. A large number of alkaloids were predicted to be associated with six starting substances, i.e. L-Arg, L-Tyr, L-Pro, L-Lys, L-Asp and L-Trp. These starting substances fundamentally may contribute to create diversity of chemical structures of alkaloids.
Natural compounds continue to attract researchers' interest due to their rich biological activity. However, the series of enzymatic reactions that biosynthesize complex molecular skeletons of natural compounds is extremely difficult, and there are many reactions that cannot be envisaged in organic synthesis reaction theory. In this study, we focus on the secondary metabolic pathways of natural compounds and are working on the development of in silico tools to calculate the metabolic pathways appropriately. In this paper, we report the results of an attempt to reproduce in silico secondary metabolic pathways of natural compounds.
In metabolome analysis, structure estimation for unknown compounds repeatedly becomes a problem. In this study two independent approaches to it are performed. One is estimation based on tandem mass spectrometry (MS/MS), which is widely used in the metabolomics field. Another is our distinctive approach on the basis of metabolism. Combining them we have succeeded in identifying several unknown compounds. For some of them, either of the two approaches showed better performance. On the other hand, some compounds were estimated effectively by both. MS/MS analysis sometimes loses structural information such as position of a functional group, while our metabolism-based approach can deal with it. In contrast MS/MS can perform well for compounds with relatively large structure. Two independent approaches can both complement and reinforce estimation obtained from each other.
Dioxins such as dioxin and dibenzofuran are compounds having acute and chronic toxicity. The interaction between nucleobase and dioxin was analyzed by molecular orbital calculation. Model molecules are used for nucleobases such as adenine, cytosine, guanine, thymine, and uracil, and the dioxins are calculated for four compounds including two dioxins and two dibenzofurans with a large toxicity equivalence factor. MP2 calculation was performed for the optimized structure obtained by the HF calculation, and the stabilization energy was obtained. At that time, BSSE was corrected. A stable structure of nucleobase and dioxin was obtained. The stabilization energy of nucleobase and dioxin was less than about half that of Watson-Crick base, but the results suggest the possibility that nucleobase and dioxin form hydrogen bonds.
Microarray technology has produced a large amount of gene expression data. These data are widely used in many fields of research. At the same time, breast cancer is one of the most common cancer diagnosed in women in the world and is also the leading cause of cancer death in women. Therefore, more and more researches which using machine learning to predict the cancer prognosis have been done based on DNA microarray data. However, most of them used single classifiers instead of ensemble learning. This research is trying to prove the effectiveness of stacking model on predicting breast cancer patients’ 5-year survival rate, comparing with six single classifiers (SVM, Random Forest, Logistic Regression, XGboost, GBDT, KNN). The dataset contains 1592 samples and 22283 features, so Lasso regression was used to select features. According to the result of ACC, TPR and AUC, the stacking algorithm is proved better than single classifiers.
Currently, in Japan, 28-day repeated dose studies are conducted on animals based on the Chemical Substances Control Law as a compound toxicity test. This test has problems such as high test costs of tens of millions of yen per compound, reduced competitiveness of the Japanese science industry due to long-term test periods, and ethical issues. In addition, in conventional toxicity prediction models using only quantitative structure-activity relationship (QSAR) by machine learning, the mechanism of action of compounds on cells is a black box and the applicability domain of the model is not clear. Therefore, we are considering the three-stage model. At first, whether "the compound will be absorbed into the body", is predicted, "cytotoxicity test results from compound information" for absorbed compounds, and "toxicity in each organ from compound information and cytotoxicity test results", aiming to clarify the applicability domain (AD) for each model. In this study, we analyzed various outlier detection methods for a model that predicts "cytotoxicity test results from compound information" and investigated which method is effective as an index for setting AD.
A mass spectrometer identifies a molecular structure from the spectrum obtained by fitting to a database, but it is difficult to identify an unmeasured molecule. In this study, we developed a deep learning method that learns spectra in a database to infer the molecular structure. Since the inference is done by assembling the molecular structure, it is possible to infer even if the molecular structure is not in the database. By using the latent expression of the molecular structure, we succeeded in inferring the molecular structure with high similarity.
Identification of molecules included in a sample from an NMR spectrum is a fundamental and important issue. Although there are various identification methods to compare NMR spectra with those of a database, many of them rely on the human. Hence, the identification of the molecule from its NMR spectrum largely depends on human. However, the human dependence means that there are some problems from the viewpoint of objectivity and their effort. To address this problem, we are developing the method to identify a molecule by combing a search algorithm with machine learning. In this presentation, we will show the application of a de novo molecule generator coupled with quantum chemical calculation to identify a molecule from molecular spectrum by designing molecular candidates.
Chemical space is the space of possible molecules. Investigating the structure of the space can help us exploring compounds with desirable properties. In a previous study, a network representation of the biologically relevant chemical space was suggested. The network consisted of nodes representing compounds, and a link between two nodes was drawn if the similarity between these compounds exceeds the preset threshold. However, the network topology depends on the threshold. In our study, instead of using threshold, we investigated the weighted network where the weight of a link equals similarity between two compounds. In this study, we examined bioactive compounds, whose data are from the ChEMBL database. By analyzing the weighted network of bioactive compounds for each target, we found that each network has the homogeneous structure, where rare nodes are connected to others with extreme weights. The community structure of each network was also weak. However, some detected communities were regarded as connecting tightly and exhibited the biased bioactivity distribution against the whole network. Furthermore, we found that compounds with significantly high/low bioactivity are connected strongly to each other.
Estimation of synthetic accessibility is an important aspect for computer-aided drug design. Several methods to predict synthetic accessibility are reported. These methods are based on retrosynthetic analysis, molecular complexity, and fragment contributions. However, there is almost no method using machine learning. Here we report a prediction method of synthetic accessibility using machine learning. Since synthetic accessibility is a subjective judgment, it is difficult to prepare a large-scale training set for machine learning. Here, we assume that compounds obtained by removing the ZINC15 compounds (purchasable “drug-like” compounds) from the GDB-17 compounds (Compounds of up to 17 atoms of C, N, O, S, and halogens forming the chemical universe database) are likely to be difficult to synthesize, and ZINC15 compounds are easier to synthesize than these compounds. Based on the hypothesis, we have created a data set and applied it on the neural network classifier. Then, we have evaluated the model using a validation set obtained from the literature. The results show that the model was possible to distinguish compounds that are difficult to synthesize from easier ones. We are developing models using different machine learning methods and expect to report a comparison with the neural network model.
Quantitative Structure-Property Relationship (QSPR) is a kind of method to predict properties of compounds. In QSPR, a regression model is constructed from training data consisting of the structure and properties of the compound. In many cases, molecular descriptors are calculated from structure and are used as input. However, finding the best set of descriptors for each prediction is very difficult, and the descriptors may not contain sufficient information about the object property. In this study, molecular structures were represented by atom position, atom kind and graph structure. And regression model was constructed using Graph Convolutional Neural Network (GCNN). As a result of the case study, the proposed method outperformed the existing method which use descriptors in a case. But in another case, the proposed method performed worse than existing method. It can be thought that one of the reasons was the insufficiency of representability of model. Thus, the consideration of input form or model structure may improve the prediction ability of the proposed method.
The intrinsic reaction coordinate (IRC), which is defined as a minimum energy pathway connecting two equilibrium structures via a transition state structure on the potential energy surface, are useful tool to describe an elementary reaction mechanism. Recently, the global reaction route mapping strategy has been developed, and it enables us to construct a global reaction route network which is composed of all IRC pathways for a given molecular system. However, the overall positional relationship of each molecular structure in the network is difficult to reproduce in a lower-dimensional space because of the multidimensionality of molecular structures. Very recently, to visualize the static reaction coordinate onto a 2- and 3-dimensional space, we applied the principal coordinate (PCo) analysis, one of the dimensionality reduction techniques, to IRC pathways and a global reaction route network. In this study, we embed classical trajectories given by the on-the-fly molecular dynamics method into a PCo-subspace obtained by projecting an IRC pathway, and we also discuss dynamical reaction pathways based on the visualized static reaction pathway.
Coordination Polymer (CP) exhibit promising various functionalities by utilizing their uniform framework structures. Most reported CPs are composed of oxygen or nitrogen as coordination atoms in the metal bridging part, and there are few structures containing sulfur or other elements as coordination atoms. Since mechanism of crystallization process of CP is not fully understood and rational design strategy for novel CPs has not been established, time consuming exploration has been required to optimize the synthesis conditions. Here, we focus on machine learning techniques, cluster analysis and decision tree analysis, to improve the accuracy of the prediction for the synthesis conditions. In this work, we explored the synthesis conditions of CP containing sulfide-metal bonds by high throughput screening systems. We implemented cluster analysis to categorize their powder X-ray diffraction (PXRD) patterns into several different types. The relationship between synthesis condition and obtained PXRD diffraction patterns was estimated by decision tree analysis. We have demonstrated that cluster analysis and decision tree are useful to predict the reaction mechanism and explore synthesis condition for novel CPs.
Various types of palladium complexes bearing phosphine-sulfonate ligands have been developed for the coordination–insertion copolymerization of olefins with polar monomers. Characteristic features of the ligands, such as electronic and steric properties were discussed in relation to their catalytic performance. Aiming at further analysis of the obtained data, here we report development of prediction method for copolymerization of ethylene and methyl acrylate using machine learning. As a result of prediction by machine learning, parameters that are important for molecular weight of the obtained polymers and polymerization activity were obtained. These results suggest concepts for new catalyst designs.
Metal complexes are used for the catalystsof polymerization of alkyd resin, which accelerate the drying of the paint. Thecatalytic ability of the metal complexes heavily depended on the metal andligand. Thus, the better understanding of the mechanism of this catalyticreaction could contribute to the rational design of catalysts. In this study,we examined the Gibbs free energy profile of three possible reaction pathwaysand compared the activation barriers of the reactions catalyzed by three metalcomplexes. The activation barriers for three catalytic systems were similar.Only the stable spin states of intermediates depended on the catalytic systems.
The planar four-membered ring structures, which are not global minimums of crystallogen (Cr) compounds Cr4R4, could be formed by introducing bulky substituents on the side chain (R). Though bulky substituents are efficient to control the stability of local minimums, it is still not easy to synthesize a desired structure because of the difficulty of synthesize of bulky side chains. To design appropriate substituents, we applied an automated reaction path search strategy, called the global reaction route mapping, to model crystallogen compounds Cr4H4 (Cr = C, Si, Ge, and Sn) to search their local minimums and transition states exhaustively. By focusing on the electronic structures, bond characters, reactivities, and their dependency on the substituents, we will discuss the strategy to design the appropriate substituents affording the desired molecular structures.
An inference of chemical reaction and development of function materials using machine learning/deep learning are paid attention to by synthetic/theoretical chemists. In this paper we report an inference of activation energy in the coordination-insertion of ethylene into cationic metallocene complexes using machine learning combined with a DFT calculation. A theoretical reaction pathways of coordination-insertion mechanism in 68 of metallocene catalysts consisting of Cp rings with methyl groups and group 4 elements (Ti, Zr, Hf) were estimated using B3LYP with 6-31++G(d,p) basis set, and the values of activation energies were obtained. The resultant activation energies as objective values were inferenced by the effect of steric parameters and electronic parameters, which were explanatory values, in cationic metallocene complexes before coordination of ethylene through machine learning (random forest regression). A relationship between objective values and categorical data in explanatory values was summarized by violin plot, which showed that the values of E-INT1 in Zr and Hf complexes are almost the same while that of Ti complex was not stabilized rather than Zr and Hf complexes even by coordination of ethylene. The tendency was seen in E-TS and E-TS-INT1 as well.
Recently, computational chemistry as well as chemoinformatics plays an important role in theoretical consideration of chemical reactions. In the present study, we tried to predict the activation free energies (ΔG‡) calculated for substrates with different substituents without searching transition states, one of the most difficult problems in computational chemistry. For that purpose, we adopted a reaction of aza-β-lactam which inhibits the activity of protein phosphatase methylesterase 1 (PME-1). A methanol was used for the model of the serine residue of PME-1. The mechanism of the ring opening reaction including a water molecule was analyzed and ΔG‡ values were calculated as the objective variable for PLS regression. Explanatory variables are HOMO and LUMO energy levels of the reactants, the parameters related to the structure factor from the GRAGON program, and molecular volumes using the Winmoster program. The Chemish program gave a good regression model with R2 = 0.932 and Q2 = 0.915. The major part of the model consisted of LUMO energy and Hammett's σ value.
A new quantum chemical descriptor, the quinoid stabilization energy (QSE), is established for the computational design of narrow-bandgap polymers. QSE was constructed based on the energy change of homodesmotic reactions of a dimethylated monomer with oligoacetylene. Density functional theory (DFT) calculations revealed a relationship between QSE and bandgap of polymers. According to the relationships obtained for 268 homopolymers and 179 alternate copolymers selected from many different families, narrow-bandgap polymers can be designed with QSE = 0, which indicates the intermediate state between aromatic and quinoid forms. Copolymers having QSE = 0 can be achieved by combining a quinoidal monomer with an aromatic one. The main advantage of this approach of designing narrow-bandgap polymers is that it requires only information of the monomers and their linking site. The reason why the polymers show a narrow bandgap around QSE = 0 was shown to be related to level crossing of the aromatic and quinoid type orbitals. In order to rationally design ultra-narrow bandgap polymers considering aromatic-quinoidal and donor-acceptor characters, the bandgap prediction model was constructed by machine-learning methods using QSE and the difference of LUMO of acceptor and HOMO of donor as descriptors.