2023 Volume 71 Issue 6 Pages 398-405
Drug discovery is researched and developed through many processes, but its overall success rate is extremely low, requiring a very long period of development and considerable costs. Clearly, there is a need to reduce research and development costs by improving the probability of success and increasing process efficiency. One promising approach to this challenge is so-called “in silico drug discovery,” which is drug discovery utilizing information and communications technologies (ICT) such as artificial intelligence (AI) and molecular simulation. In recent years, ICT-based science and technology, such as bioinformatics, systems biology, cheminformatics, and molecular simulation, which have been developed mainly in the life science and chemistry fields, have changed the face of drug development. AI-based methods have been developed in the drug discovery process, mainly in relation to drug target discovery and pharmacokinetic analysis. In drug target discovery, an in silico method has been developed that uses a probabilistic framework that eliminates the problems of conventional experimental approaches and provides a key to understanding the pathways and mechanisms from compounds to phenotypes. In the field of pharmacokinetic analysis, we have seen the development of a method using nonclinical data to predict human pharmacokinetic parameters, which are important for predicting drug efficacy and toxicity in clinical trials. In this article, we provide an overview of these methods.
The drug development process consists of three parts: basic research, nonclinical testing, and clinical trials. Basic research includes the search for drug target molecules, the search for hit compounds from a vast compound library, and the optimization of the hit compounds. Nonclinical testing involves a drug’s pharmacodynamics, in vivo kinetics, and adverse effects in animals. Clinical trials confirm pharmacokinetics (absorption, distribution, metabolism, excretion [ADME]) and safety (adverse events and side effects) in humans. However, the success rate of drug development is extremely low, and major issues concern the extremely long development period and exorbitant development costs (more than 10 years of development and more than U.S. $1.8 billion in costs).1) Nevertheless, despite this high cost of drug development, the number of new drugs approved remains low. Therefore, improving the probability of success in drug development and reducing costs by improving process efficiency are pressing issues for the pharmaceutical industry. To overcome these challenges, various artificial intelligence (AI) technologies for drug discovery have been developed around the world, and some have even reached practical application. The main purpose of such AI techniques is to reduce the probability of failure by making predictions before the actual experiments are conducted. By reducing the number of experiments, AI techniques are expected to dramatically curb the cost and duration of drug development.
In recent years, the development of AI technology and the accumulation of data have led to the increased use of computers in various fields. Computational methods such as molecular dynamics simulation and machine learning tools are also being applied to the field of drug discovery. These methods are referred to as “in silico drug discovery” and are being actively developed to improve the efficiency of the drug development process. AI technologies range from conventional machine learning such as support vector machines and Random Forest to multilayer perceptron, general neural network models, convolutional neural networks, which are applied to image diagnosis in medicine, and graph convolutional neural networks. In addition, the development of high-speed computing technologies such as Graphics Processing Units (GPUs), has made it possible to complete large-scale training of AI in a practical level of computation time, enabling analysis using AI technologies. In addition, various databases are available for big data: PubChem/BioAssay, the world’s largest bioactivity database published by NIH/NCBI; ChEMBL2) and DrugBank,3) which contain activity information of compounds and target proteins; SIDER4) and FAERS, which contain information of compounds and side effects; and Kyoto Encyclopedia of Genes and Genomes (KEGG),5) a database that integrates information on diseases, drugs, and intermolecular networks such as metabolism and signal transduction.
In the drug discovery process, AI has been applied to the search for target molecules, finding hit compounds, and predicting the activity and physical properties of candidate compounds. This is because AI is compatible with processing large amounts of data, such as identifying common features among huge amounts of data and searching for candidates based on those features.
This review provides an overview of some examples of in silico drug discovery technology development in the drug discovery process, that is, in the drug target discovery process and in the nonclinical test process.
Drug target discovery is the first stage of the drug development process. To start the development of a therapeutic drug for a certain disease, the first step is to identify target molecules for the disease. Target molecules are mainly proteins. Many drugs exert their effects by binding to target proteins, which are enzymes or receptors, and inhibiting or promoting their functions. Most target molecule searches are conducted through experimental research. Target molecules are often selected based on information from basic research described in academic papers.
One of the reasons for the low success rate of drug development is the increasing number of cases in which a drug’s expected efficacy is not achieved during clinical trials, which is the latter stage of the drug development process, and the development is halted. According to a report by the Biotechnology Innovation Organization,6) the average probability of a clinical trial from Phase 1, which is a study in healthy subjects, to approval over the decade 2006–2015 was 9.6%. In particular, the probability of moving from Phase 2, a study with a small number of patients, to Phase 3, a study with a large number of patients, is quite low at 30%. The efficacy of candidate compounds is evaluated in Phase 2, which is the objective of the study. This means that a new drug candidate that fails to pass Phase 2 has failed to demonstrate efficacy for the disease. In addition, it has been reported that approximately 60% of Phase 2 terminations are due to failure to achieve drug efficacy because of a misconfiguration of the target molecule, such as the existence of a different true target.7) Therapeutics for drug targets that are easy to develop will have already been studied, but it is likely that the development of therapeutics for the remaining target molecules will remain highly challenging. In addition, development of therapeutic agents for diseases for which the target molecules have not been identified will not progress. Therefore, the pharmaceutical industry is running out of target molecules with which to start new drug development.
Various experimental and computational methods have been proposed to overcome the problem of depletion of target molecules in the pharmaceutical field. In the following, we describe the methods that have been developed and the challenges they face, and introduce a new prediction method for target molecules.8)
2.1. Previous Drug Target Discovery Methods and Challenges2.1.1. Bioactive Processes of PharmaceuticalsThe process of characterizing the biological activity of a therapeutic drug is briefly described in Fig. 1. First, the compound binds to target proteins, and the activated proteins transmit signals to cells (known as a signaling pathway). Second, the signals trigger various proteins to interact with each other. Finally, this produces a phenotypic change in the cells, such as cell death, cell proliferation, and differentiation. Understanding the complex relationships between molecular mechanisms, including compounds, target proteins, pathways, and phenotypes, will improve the success rate of drug discovery.
For the selection of target molecules, although it would be desirable if data on direct relationships between proteins and phenotypes could be used, such data rarely exist. Since information on compound–protein interactions and compound–phenotype associations has accumulated, it is conceivable that such information could be used indirectly. However, experimental data on compound–protein interactions and compound–phenotype associations are not sufficient given that there is a huge variety of both compounds and proteins. As a result, a number of new experimental and computational approaches have been proposed.
Phenotypic and target-based approaches developed in the field of chemical biology are useful methods that have been applied in drug development.9,10) The phenotypic approach is an experimental pathway to evaluate phenotypic responses of cells and tissues to chemical substances; this is also known as cellular assay or in vivo assay. In drug development, it is used to search for compounds that exhibit phenotypic responses that improve pathological conditions in cell models of disease.11,12) Historically, drug development has relied on the phenotypic approach because it is based on biological functions and phenotypes. However, this approach is dependent on the researcher’s experience regarding, for example, which model cells to use and what cellular changes to measure. Moreover, this approach relies on trial and error and does not necessarily reflect the mechanism of action of the test compound.13) On the other hand, the target-based approach is a rational approach to search for drug candidates by targeting disease-causing molecules in vivo.14) With recent advances in high-throughput experimental techniques, new drug development that applies the target-based approach has become popular. However, this approach is difficult to apply when the target molecule is unknown.
Thus, phenotypic and target-based approaches each have their own advantages and disadvantages. In general, many drugs with novel mechanisms of action have been identified by phenotypic approaches, whereas target-based approaches can reduce the incidence of side effects and toxicity, clarify pharmacological responses, and provide a certain understanding of the mechanism of action, since they are focused on the target molecule. Recently, a combination of these two approaches has increasingly been adopted. It is reported that each method, compensating for the other’s shortcomings, increases the success rate of development and advances our understanding of molecular mechanisms and disease biology.15)
2.1.3. Two Serious IssuesTwo serious issues remain for both phenotypic and target-based experimental approaches, namely target deconvolution and polypharmacology12,16–19) (Fig. 1). Target deconvolution is the identification of target molecules responsible for the observed phenotypic response. For example, even if a phenotypic approach yields a hit compound in screening that affects a specific phenotype, it does not reveal the target molecule on which the hit compound directly acts, or the relationship between the target molecule and the phenotype.12,16) In addition, the target-based approach cannot be applied when the target molecule is unknown because it is a method that searches for compounds that act directly on a known target molecule.9,20) Target deconvolution is thus a bottleneck between phenotypic and target-based approaches. Several experimental methods have been developed in the fields of chemical biology and chemical genetics research to circumvent this bottleneck. However, it remains difficult to understand the mechanism of action, and experimental costs can be very high.
In drug discovery, the second serious issue is polypharmacology, where “many drugs do not act on only a single target biomolecule, but act on multiple target biomolecules and affect phenotypes such as efficacy and toxicity.”18,19) In other words, to consider polypharmacology, it is necessary to achieve targeted deconvolution of multiple target molecules that may affect drug efficacy and toxicity, making experiments extremely challenging. Furthermore, even if target deconvolution of multiple molecules could be achieved, it would be very difficult from both labor and cost perspectives to synthesize compounds that would interact with those target molecules by experiment.
2.2. A Probabilistic Framework of Drug Target Molecule Prediction Based on AITo overcome the two serious issues of target deconvolution and polypharmacology, we have developed a new in silico method based on a probabilistic framework. For more details, please refer to our previous article.8) This method uses machine learning to estimate compound–target protein phenotype association networks by integrating experimental data of compound–target protein interaction information from the target-based approach and compound–phenotype assay information from the phenotypic approach.
The method consists of two steps corresponding to the target-based approach and the phenotypic approach. The analysis procedure for each step is shown in Fig. 2. In the first step, a machine learning model is constructed using the interaction information between compounds and target proteins obtained from the ChEMBL database,21) that is, the data produced by the target-based approach, to predict the interaction of unknown compound–protein combinations. We employed a chemical genomics-based virtual screening method22) as a machine learning method for training and predicting compound–protein interaction information. The specific analysis procedure is as follows (corresponding to the lowercase Roman numerals in Fig. 2).
It consists of two steps: construction of a model for predicting compound–protein interactions and prediction of interactions using the model; and construction of a model for predicting compound–target protein phenotype associations and selection of target proteins related to a phenotype using the model. Figure 2 is adapted from Ref. 8.
Pharmacokinetic research plays an important role in drug development. In the early stages of the drug discovery process, simple compound synthesis and in vitro experiments are the main activities, and the costs involved are small. Therefore, even if the process is not successful, it is easy to change the business strategy. On the other hand, in the latter stages of the drug discovery process, the cost of clinical research is enormous in addition to the cumulative costs incurred up to that point. Therefore, the situation is such that failure is impossible. Nevertheless, despite all pharmaceutical companies being acutely aware of this situation, there is no shortage of examples of clinical trial failures in late-stage development.26,27) In the 1990 s, about 40% of failures were due to the absence of pharmacokinetics. Since then, the importance of pharmacokinetic studies has been reevaluated and pharmacokinetic discovery studies have been incorporated into the early stages of drug discovery in pharmaceutical companies. In the 2000 s, this failure rate improved to about 10%.28) Today, pharmacokinetic studies in drug discovery play an important role in drug development, involving the entire process that begins with hit compound discovery and continues through lead compound optimization, preclinical animal studies, and clinical trials.29)
Accurate prediction of human pharmacokinetic parameters, mainly clearance (CL) and volume of distribution (Vd), which are important for predicting drug efficacy and toxicity in clinical trials, enables optimal clinical dose estimation. The challenge is to predict human pharmacokinetic parameters from nonclinical data. In recent years, the development of high-throughput in vitro screening technologies has led to the accumulation of large ADME data sets, and research on in silico prediction of ADME properties using machine learning methods has been active.
3.1. Methods of CL and Vd Value PredictionsCL is a major pharmacokinetic parameter. It has multiple pathways, including metabolism and excretion, and the metabolism may occur through various metabolic enzymes. Excretion may involve various transporters in addition to simple renal excretion. Because it is based on such a complex mechanism, CL prediction is difficult. Even the most accurate allometry-based method (the rat single species method) has an average prediction error of more than 2-fold.30) Another method is the in vitro/in vivo extrapolation method, which scales up intrinsic CL derived from microsomes using human hepatocytes, but this method is only applicable to drugs metabolized in the liver.31) Furthermore, machine learning methods have been proposed that use chemical descriptors as explanatory variables, but the accuracy of these methods is comparable with allometry.32)
Vd refers to the volume of the drug that, after entering the body, spreads throughout the systemic vascular network via the bloodstream and reaches the target tissue or organ. Prediction of Vd is largely determined by the physical properties of the drug, such as protein binding and membrane permeability.33) Therefore, Vd prediction from nonclinical data has been relatively successful. For example, allometric methods and their variants (e.g., the Øie–Tozer model34)) are used, and their prediction errors are reported to be within a factor of 2 on average.35) In addition to allometry methods, machine learning methods such as Random Forest, Partial Least Squares, and Support Vector Machine have been proposed recently that use chemical descriptors calculated by various software packages as explanatory variables. Their accuracy and stability are as good as or better than allometry.36,37)
We have proposed a new prediction method to improve the accuracy of total body CL (CLtot) and Vd of steady state (Vdss), which combines three methods: (1) a multimodal machine learning model with compound structure information and nonclinical data as input; (2) data interpolation to predict missing nonclinical data using in silico methods to increase the training data; and (3) feature selection of nonclinical data in machine learning model construction.38) These three methods are described in the following sections.
3.1.1. Multimodal ModelA multimodal model is a method of predicting objective variables using data of different qualities as explanatory variables. At the outset of the research, a prediction model is constructed for an image classification problem by learning not only the image but also the corresponding text.39) In the field of drug discovery, attempts have been made to use multimodal models to predict compound–protein interactions.40,41) In contrast, in the pharmacokinetic field, no such explicit attempts have been reported. Wajima et al. described a method to formulate multiple regression equations to predict human CL using animal CL and simple physicochemical parameters of the compounds as their explanatory variables, which may be considered a multimodal prediction method.42) However, their data set of compounds was insufficient and the number of compounds very small at 68.
Therefore, we proposed a multimodal model for CLtot and Vdss prediction that uses different qualities of chemical structure-derived graph descriptors and nonclinical data as explanatory variables.43) As chemical structure-derived descriptors, extended-connectivity fingerprints up to four bonds (ECFP4) are used for the XGBoost method. Chemical structure graphs are used for Deep Tensor, a graph-based deep learning method. In the case of CLtot value prediction, rat CLtot, dog CLtot, monkey CLtot, unbound fraction of human (human fu), rat fu, dog fu, monkey fu, pKa acid, pKa base, solubility, and caco-2 permeability were used to construct a prediction model. In the case of Vdss value prediction, rat Vdss, dog Vdss, monkey Vdss, human fu, rat fu, dog fu, monkey fu, pKa acid, pKa base, solubility, and caco-2 permeability were used.
3.1.2. Data Interpolation for Missing ValuesThe multimodal approach, which uses chemical structure and rat experimental values as explanatory variables, has increased prediction accuracy.43) Prediction accuracy is enhanced by using not only rat values as explanatory variables, but also values from various animals (e.g., dogs and monkeys) and in vitro experimental values such as protein binding rates in various animals. Indeed, animal experimental data (CLtot, Vdss, and unbound fraction values of rat, dog, and monkey) and human fu data were obtained for compounds with measured human CLtot and Vdss values. In addition, pKa acid, pKa base, solubility, and caco-2 permeability data, including calculated values for each compound, were obtained from PubChem and DrugBank. As a result, the number of compounds for which all 11 of the above data items were available was 46 in the CLtot data and 45 in the Vdss data. Subsequently, we named these groups of compounds the “evaluation data set” and used them to evaluate the prediction accuracy. Figure 3 indicates that there were 741 and 751 compounds with actual measured values of human CLtot and Vdss, respectively, so the possibility of having complete nonclinical data is significantly reduced. In other words, we were faced with the problem of missing values in the experimental data, resulting in a significant reduction in the number of compounds available.
Human CLtot prediction flow. (i) There were 741 compounds with human CLtot data and 46 that had values for all 11 nonclinical data. (ii) All feature values were predicted by ADMEWORKS. (iii) Feature selection was performed using XGBoost or Random Forest, and a prediction model was constructed. Human Vdss prediction flow. (i) There were 751 compounds with human Vdss data and 46 that had values for all 11 nonclinical data. (ii) All feature values were predicted by ADMEWORKS. (iii) Feature selection was performed using XGBoost or Random Forest, and a prediction model was constructed.
Given these circumstances, we developed a method for predicting human CLtot and Vdss values using missing value interpolation as a preliminary step to the multimodal learning-based machine learning method (Fig. 3). First, to interpolate missing values, a set of prediction models was created for each of the 11 items in the nonclinical data using ADMEWORKS (Fujitsu Limited, Japan). Prediction against missing values using that set of models allowed us to use a large amount of nonclinical data without reducing the training data. Then, a machine learning model was constructed to predict human pharmacokinetic parameters using the interpolated nonclinical data with missing values and chemical structure information as explanatory variables.
3.1.3. Feature SelectionSome of the 11 nonclinical data items were not useful for prediction, and inappropriate missing value interpolation may adversely affect prediction results. Therefore, feature selection was used to remove some of the 11 nonclinical data items that did not contribute to the prediction. This feature selection is based on the importance of the explanatory variables determined during the construction of the predictive model using the Random Forest and XGBoost methods. First, predictive models for human CLtot and human Vdss were constructed using the Random Forest or XGBoost methods with all 11 nonclinical data items as explanatory variables. This allowed us to evaluate the importance of each of the 11 variables in the model. The most accurate explanatory variable was selected by evaluating k from 1 to 11. For the detailed analysis method, please refer to the paper.38) As a result of feature selection for CLtot prediction, we used XGBoost: rat CLtot, dog CLtot, human fu, and pKa acid, and Deep Tensor: rat CLtot, dog CLtot, human fu, and pKa acid. For Vdss prediction, we used XGBoost: rat Vdss, dog Vdss, pKa acid, pKa base, and human fu; and Deep Tensor: dog Vdss, rat Vdss, pKa acid, pKa base, solubility, and human fu.
3.2. Results of CL and Vd Value PredictionsTable 1 shows the results of evaluating the accuracy of 11 methods which are five conventional methods; machine learning models using only chemical structure (CS); a multimodal model using CS and all 11 nonclinical data items; and machine learning models with nonclinical data removed through feature selection, using a common evaluation data set. First, we discuss the CLtot prediction results. Among the five conventional methods, those using monkey CLtot data such as SSS monkey (geometric mean fold error (GMFE): 1.93, % of 2-fold error: 58.7%) and the fraction-unbound corrected intercept method (FCIM) (GMFE: 1.99, % of 2-fold error: 52.2%) showed the highest accuracy. However, these highly accurate prediction methods are challenging to use because of the high cost of data from large animals such as dogs and monkeys, and ethical issues with large animal models. On the other hand, among the multimodal models with nonclinical data using missing value assignment and CS proposed in this study, using all 11 items (CS +11 features), XGBoost (GMFE: 2.06, % of 2-fold error: 58.7%) and Deep Tensor (GMFE: 2.11, % of 2-fold error: 52.2%) had relatively high accuracy. In addition, models in which some features were removed by feature selection (CS + selected features) showed that XGBoost (GMFE: 1.98, % of 2-fold error: 50.0%) and Deep Tensor (GMFE: 1.92, % of 2-fold error: 66.5%) were comparable with conventional methods using monkey data. These results showed that the predictive model could be improved by increasing the number of compounds used for training and by interpolating missing data with prepredicted values.
CLtota) | Vdssa) | ||||
---|---|---|---|---|---|
Methodb) | GMFEc) | % of 2-fold error | Methodb) | GMFEc) | % of 2-fold error |
SSS rat | 2.36 | 43.5 | SSS rat | 1.91 | 62.2 |
SSS dog | 2.30 | 39.1 | SSS dog | 1.93 | 71.1 |
SSS monkey | 1.93 | 58.7 | SSS monkey | 1.60 | 80.0 |
SA | 2.33 | 45.7 | SA | 2.07 | 68.9 |
FCIM | 1.99 | 52.2 | Øie-Tozer | 1.46 | 84.4 |
XGBoost: Only CS | 2.40 | 50.0 | XGBoost: Only CS | 1.70 | 77.8 |
XGBoost: CS +11 features | 2.06 | 58.7 | XGBoost: CS +11 features | 1.64 | 71.1 |
XGBoost: CS + selected features | 1.98 | 50.0 | XGBoost: CS + selected features | 1.66 | 71.1 |
Deep Tensor: Only CS | 2.44 | 45.7 | Deep Tensor: Only CS | 1.85 | 62.2 |
Deep Tensor: CS +11 features | 2.11 | 52.2 | Deep Tensor: CS +11 features | 1.75 | 69.8 |
Deep Tensor: CS + selected features | 1.92 | 66.5 | Deep Tensor: CS + selected features | 1.74 | 74.2 |
We summarize the results of the Vdss predictions as follows. First, the Øie–Tozer method (GMFE: 1.46, % of 2-fold error: 84.4%) showed the highest accuracy among the conventional methods evaluated, followed closely by the SSS monkey (GMFE: 1.60, % of 2-fold error: 80.0%). The XGBoost multimodal model using CS data and nonclinical data with missing value interpolation (GMFE: 1.64, % of 2-fold error: 71.1%) did not differ significantly from the conventional methods. The feature selection model using XGBoost also showed similar results (GMFE: 1.66, % of 2-fold error: 71.1%). Deep Tensor also yielded CS +11 features (GMFE: 1.75, % of 2-fold error: 69.8%) and CS + selected features (GMFE: 1.74, % of 2-fold error: 74.2%). Although the overall accuracy was slightly improved, as expressed by the % of 2-fold error values, it was found that, unlike the CLtot prediction, the Vdss prediction was still more accurate using the traditional Øie–Tozer method based on animal scale-up data.
In this review, we first described the depletion of target molecules, which is one of the reasons for the stagnation of drug development. We then introduced our in silico drug discovery method to address this problem. Our method overcomes the twin challenges of target deconvolution and polypharmacology, and is expected to facilitate drug discovery while providing insights into understanding compound–phenotype pathways and molecular mechanisms.8) We demonstrated that CLtot and Vdss can be predicted with high accuracy by combining two prediction methods, a multimodal model that uses compound structural information and nonclinical data as input, and a data interpolation model that uses missing nonclinical data interpolated by prediction as input. Because this latter method does not require new animal experiments, it can be used from the early stages of the drug development process.38,43)
We presented two examples of the application of AI technology in the drug discovery process. One is the application of AI technology in the drug target discovery process and the other is the prediction of pharmacokinetic parameters in the nonclinical trial process. Given that the overall success rate of drug development is extremely low, requiring a lengthy process and high costs, the use of AI is expected to reduce the probability of failure by making predictions before conducting actual experiments and trials. Thereby, the application of AI should dramatically reduce the cost and time required for development by reducing the number of experiments and tests.
In addition to the processes described above, many unexplored themes in the field of drug discovery remain, and expectations for future development are high. Many of the methods discussed here are still under development, and only a few have been put into practical use in actual drug discovery settings. Thus, a vast number of important technical issues need to be solved in AI technology, such as AI’s black box problem (i.e., AI models tend to be difficult to explain), and the quality and quantity of training data. It is hoped that these issues will be solved one by one in the future, and that the practical application of ICT will increase the probability of success and process efficiency in drug research and development.
This article is based on partial results obtained from a project commissioned by JSPS KAKENHI (Grant Number JP20K12063) and a Grant-in-Aid from the Fugaku Trust for Medicinal Research.
The author declares no conflict of interest.