Quantitative prediction of hERG inhibitory activities using support vector regression and the integrated hERG dataset in AMED cardiotoxicity database

The inhibition of hERG potassium channel is closely related to the prolonged QT interval, and to assess the risk could greatly contribute to the development of safer therapeutic compounds. In the hit-to-lead optimization stage of drug development, quantitative prediction of hERG inhibitory activity is crucial to design drug candidates without cardiotoxicity risk. Here, we developed a hERG regression model combining support vector regression (SVR) and descriptor selection by non-dominated sorting genetic algorithm (NSGA-II) based on AMED cardiotoxicity database consisting of hERG blocking information built by integrating public and commercial databases. To construct a regression model, 6,561 compounds with IC50 and/or Ki values were derived from AMED cardiotoxicity database, and randomly separated into training set (70%) for model building and test set (30%) for performance evaluation. To avoid overfitting by employing many non-relevant explanatory variables, NSGA-II, a variation of genetic algorithm for multiple objective optimization, was used for descriptor selection in order to maximize Q2 and minimize RMSE in 5-fold cross validation and minimize the number of used descriptors spontaneously. The prediction performance was then compared to those of ADMET predictor, commercial software providing various ADMET property predictions. The SVR model recorded R2 of 0.594 and RMSE of 0.604 for test set, clearly exceeding those of ADMET predictor (0.134 and 0.690, respectively). The regression model is available at our home page (https://drugdesign.riken.jp/hERG).


Introduction
Drug safety is an important issue in drug discovery process [1]. While lack of efficacy and safety covered approximately 30% of the failures in clinical trials in the 2000s, cardiotoxicity, hepatotoxicity, genotoxicity, and phototoxicity are also frequently observed. It was reported that cardiotoxicity and hepatotoxicity were found in 30 to 40% of clinical trials and post-approval studies, respectively [1]. The human Ether-a-go-go Related Gene potassium channel (hERG) is strongly related to drug-induced QT interval prolongation and critical arrhythmia, Torsades de Pointes (TdP) [2,3]. Because hERG can be be bound with structurally diverse compounds, various drugs (e.g., terfenadine [4], cisapride [5], and sertindole [6]) have been withdrawn from the market due to the hERG inhibition. Since, the assessment of hERG inhibitory activity is of great importance in drug discovery, various computational prediction methods have been developed using both structure and ligand-based approaches [7,8]. These studies included statistical models based on the structural information of small molecule inhibitors, and structure-based approaches employing docking simulations using modeled 3D structures of hERG. Although the electronic microscopy structure of hERG was reported in 2017 [9], docking simulations with hERG are still difficult challenges because of the high structural flexibility. In contrast, recent increase of the bioactivity information about hERG inhibitors in public databases (e.g., ChEMBL, PubChem) have been accelerating the improvement of statistical models using machine learning techniques.
The previous studies about machine learning models for hERG inhibition were summarized by Wang [7] and Villoutreix [8]. Among the recent studies, Pred-hERG, reported by Braga et al. [10] in 2005, recorded correct classification rate of 0.84 by building the classification models based on Morgan fingerprints and Chemistry Development Kit descriptors for 5,984 compounds in ChEMBL. Schyman et al. combined 3D similarity conformation and 2D similarity ensemble approach, and achieved 69% sensitivity and 95% specificity on an independent external data set [11]. In the previous study, we integrated the hERG-associated information from ChEMBL [12], GOSTAR [13], NIH Chemical Genomics Center dataset registered in PubChem [14], and hERGCentral [15] into a dataset consisting of more than 291,000 structurally diverse compounds named AMED cardiotoxicity database [16], and developed a novel hERG classification model using support vector machine with descriptor optimization by non-dominated sorting genetic algorithm-II (NSGA-II), resulting in the superior prediction performance to 3 commercially available software with accuracy of 0.984, ROC score of 0.966, and kappa statistics of 0.733 [17].
Though discrimination models can cover larger chemical space because of the availability of larger number of assay information and are especially useful to filter hit compounds obtained from initial high throughput screening, quantitative prediction is needed in hit-to-lead optimization stage in which chemical derivatization of a potent compound are performed to improve various ADMET properties while maintaining its binding affinity to the target protein. Among the previous works listed by Viloutreix [8], the largest dataset used for regression study was 882 compounds by Cianchetta [18]. The limited number of training data could result in the narrower applicability domain of the prediction model. In this study, we developed a regression model for hERG inhibitory activity by applying support vector regression (SVR) to AMED cardiotoxicity database as successfully applied to discrimination analysis in the previous study. The prediction performance of the SVR model was then compared to those by commercially available software.

Data set
To build a regression model for hERG inhibition, hERG blockage information registered in AMED cardiotoxicity database was used. AMED cardiotoxicity database was developed to construct comprehensive database of small molecule inhibitors for cardiac ion channels by integrating information available in various public databases and in-house assay results measured in RIKEN. The hERG inhibitory activity information was derived from ChEMBL (v23), GOSTAR, the NIH Chemical Genomics Center (NCGC) dataset in PubChem bioassay, hERGCentral. Detailed procedures of the curation to format ontologies, the standardization of chemical structures to merge data entries, and the classification of hERG inhibitors and inactive compounds were reported previously [16]. Among the hERG inhibitory activity information, assay results reporting valid IC50 (or Ki, Kd) values were derived, then mean pIC50 value was calculated for each compound. As the results, a dataset consisting of 6,561 compounds with mean pIC50 values ranging from 3.015 to 9.854 was compiled. The distribution of the pIC50 values was shown in Figure 1. The mean and median values of the pIC50 of the data set were 5.541 and 5.377, respectively. To evaluate the regression models, the dataset was randomly separated into training set (70%, 4,630 compounds) for model building and test set (30%, 1,931 compounds) for performance evaluation.

Machine learning
To select a machine learning algorithm for detailed optimization, linear regression, SVR, random forest (RF), and deep neural network (DNN) were first applied to build hERG regression models using R packages implemented in Pipeline Pilot [19] with default parameter settings. Considering the results of initial evaluation, SVR was selected for further optimization by descriptor selection and parameter tuning. SVR is a nonlinear regression model by mapping the data vectors to a high-dimensional descriptor space using kernel trick. In this study, a radial basis function (RBF) was chosen as a kernel function. During the descriptor selection described below, the gamma for the RBF and C, the constant for the slacks variant, were fixed to the default values of Pipeline Pilot (1.0 and 1.0/nx, respectively (nx indicates the number of descriptors)), to reduce computational cost. After the descriptor selection, C and gamma were optimized by a grid search in the 5-fold cross validation on the training set. The search ranges of C and gamma are as follows: C = 0.5, 1.0, or 2.0, Gamma = 0.5/nx, 1.0/nx, or 2.0/nx, where nx represented the number of explanatory variables. Other parameters were set to the default values of Pipeline Pilot.

Descriptor selection
Following the successful application in the hERG discrimination model in previous study [17], NSGA-II, a genetic algorithm based multi-objective optimization algorithm, was applied for the descriptor selection to build a regression model for hERG inhibition. In this study, 1,111 descriptors and ECFP4 structural fingerprint were calculated using RDKit [20], CDK [21], and Mordred [22] as the source of explanatory variables. Then, descriptors showing zero variance among the dataset, descriptors with missing values, highly correlated descriptors whose correlation coefficient to other descriptors exceeded 0.9, and obviously non-relevant descriptors such as the number of unknown stereo atoms were removed. To decrease the complexity, only ECFP4 was considered among the various structural fingerprints because inclusion of multiple high-dimensional fingerprints could increase the data size and calculation time. Descriptors consisting of multiple values were treated as a set, when their usage or exclusion for the model building was considered in the descriptor selection. As the results, 309 sets of descriptors were considered for the descriptor selection by NSGA-II. The list of 309 descriptor sets was available in supporting information (Table S1).
NSGA-II is an optimization algorithm to find the Pareto optimal for multiple objective functions. First, an initial population is generated randomly. The individuals in each population are sorted according to their dominance levels. The individuals in the Pareto front, which no other individual shows higher values for all objective functions than, are defined as level 1. Then, the individuals only dominated (overtaken for all of the objective functions) by the level 1 individuals are defined as level 2, and so on. In addition to the dominance level, the crowding distance is also calculated to rank the individuals with the same dominance level. The crowding distance represents how close an individual is to its neighbors. In each generation, a given number of individuals were selected based on their dominance level. If individuals with the same dominance level remain, then those with a larger crowding distance are selected. The selected individuals become the parents of the next generation, and generate the offspring by mutation and crossover. The selected individuals and offspring compose the population of the next generation. This procedure is iterated until it reaches the specified number.
In this study, NSGA-II was employed to select descriptor set by simultaneously maximizing Q 2 , minimizing root mean square error (RMSE), and minimizing the number of used descriptors by 5-fold cross validation for 50 iterations. The parameters of NSGA-II were defined as follows: generations = 50, population size = 20, mutation rates = 0.02 (off to on) and 0.2 (on to off), and crossover rate = 0.6.

Performance metrics
To evaluate the prediction performance, coefficient of determination (R 2 ) and RMSE were calculated for each machine learning method. During the descriptor selection, Q 2 and RMSE were calculated from the 5-fold cross validation and used as the objective function, while the separated test set was used for the evaluation of the optimized SVR model. To compare the SVR model to commercial software, the built-in hERG regression model of ADMET predictor (version 8.0) [23] was also employed to predict the inhibitory activities of the test set compounds.

Comparison of machine learning algorithms
The prediction performances of machine learning algorithms using all 309 descriptor sets as the explanatory variables were summarized in Table 1. Among the machine learning techniques, SVR provided the lowest RMSE (0.653) for the prediction of the test set, while RF recorded slightly higher R 2 value (0.536). Since the initial assessment was performed only using the default parameters implemented in Pipeline Pilot, SVR contained more parameters to optimize, such as cost factor and gamma of RBF kernel function than RF, thus, SVR was selected for the further optimization by the NSGA-II based descriptor selection.

Descriptor selection by NSGA-II
The proceeding of descriptor selection by NSGA-II for 50 iterations was shown in Figure 2. The chart shows the mean value of Q 2 and RMSE in the cross validation and the number of used explanatory variables for descriptor sets derived in each generation. Along with the optimization, Q 2 and RMSE slightly improved with less explanatory variables, and almost reached the plateau around 30th iteration, indicating the sufficient optimization was achieved by the NSGA-II for 50 iterations. Among the derived descriptor sets, the descriptor set which showed the highest R 2 value consisted of 91 descriptors (323-dimensional vector in total) and recorded Q 2 of 0.526 and RMSE of 0.655 in the 5-fold cross validation of training set, and was used as the explanatory variables for the SVR models with the optimization of the parameter C and gamma value. The selected descriptors were listed in supporting information (Table S2).

Prediction performance of optimized SVR model
The prediction performances of the SVR model based on the 91 descriptors selected by NSGA-II was assessed using the separated test set. The SVR model recorded the R 2 value of 0.594 and RMSE of 0.604, showing steady improvement from the SVR model using all the calculated descriptors and default parameter settings built in the initial assessment. The optimized SVR model was then compared to commercially available hERG inhibitory activity prediction software. Figure  3 shows the scatter plots of the experimentally determined hERG pIC50 and the predicted pIC50 by the optimized SVR model and ADMET predictor. The presented SVR model clearly outperformed AMED predictor in the prediction of the test set with R 2 value of 0.594, compared to 0.134 by ADMET predictor. To assess how the SVR model outperformed ADMET predictor, the 5 compounds in the test set which recorded the largest differences in the absolute errors by the SVR model and ADMET predictor were listed in Table 2. Though the results contain both over and underestimate errors by ADMET predictor, all 5 compounds have a commonality in the existence of sterically bulky substituents around cationic tertiary amine. These bulky substituents make the assessment of the accessibility of the cationic atoms to the potential cation-pie interaction counterpart, hERG Y652 amino acid residue, difficult, which could result in the dramatic effect to the hERG inhibitory activities. By the use of comprehensive hERG information collected in AMED cardiotoxicity database, the SVR model could learn the potency of these scaffolds, indicating the importance of the structural diversity of the training dataset. Along with the structural diversity, our previous study revealed that the distributions of molecular properties of compounds in each database integrated in AMED cardiotoxicity database could differ due to their data source [16]. For example, ChEMBL and GOSTAR which consist of scientific literatures and patents tend to contain larger compounds compared to high throughput hERG assay results of LOPAC1280 library provided by NCGC, because manuscripts and patents often contain highly optimized structures compared to screening library compounds. Thus, more robust property distribution could be expected for AMED cardiotoxicity database by integrating various databases, which could contribute the higher prediction performance of the presented SVR model. To further evaluate the prediction performance of the SVR model, absolute errors between the experimental and predicted hERG pIC50 were compared to the structural similarities of the test set compounds to the training set compounds. Figure 4 showed the relationship by box plot with the bin size of 0.1. In the similarity range above 0.4, the mean, median, and upper limit of the absolute error constantly decreased when more similar compounds were contained in the training set. Considering the rapid increase of publicly available activity information for hERG inhibition, the use of AMED cardiotoxicity database developed by integrating multiple small molecule databases would contribute the prediction performance of the SVR model by providing structurally diverse small molecules as the training data.

Conclusion
In the presented study, the hERG regression model using SVR and NSGA-II based descriptor selection was developed. In the initial comparison of machine learning methods, SVR recorded the lowest RMSE and selected as the algorithm for further optimization. Considering that the initial comparison was performed using the default parameters of Pipeline Pilot for the model buildings, the prediction performance of DNN could be somewhat underestimated, because the wide customize capacity of the DNN, including the number of layers, the number of neurons, the use of dropout, was not utilized in the procedure. For the future work, rigorous hyperparameter optimization of DNN could improve the prediction performance of the hERG regression model.
Along with the previously reported discrimination analysis, NSGA-II based descriptor selection successfully trimmed the non-relevant descriptors for the prediction, and yielded the descriptor set consisting of 91 descriptors showing the prediction performance of R 2 =0.594 and RMSE=0.604, which clearly exceeded the prediction performance of ADMET predictor (R 2 =0.134, RMSE=0.690) for the test set. Along with the optimization by the descriptor selection, the increased size of training data by the use of AMED cardiotoxicity database could contribute the improvement of prediction performance, considering the observed correlation between the absolute error of the prediction and structural similarity to training data. To improve the generalization property of the prediction model, further analysis to diminish the training set dependency of the prediction performance could provide better applicability to novel chemical series in future. The prediction model is available at https://drugdesign.riken.jp/hERG/, and will be updated with the AMED cardiotoxicity database. Since the cardiac active potential was modulated by various ion channels, additional consideration of effect from other ion channels to hERG inhibition could improve the assessment of cardiotoxicity risk of small molecules. Thus, more comprehensive analysis of the ion channel inhibition is planned for the future study.