2024 Volume 72 Issue 2 Pages 173-178
Histone deacetylase 8 (HDAC8) is a zinc-dependent HDAC that catalyzes the deacetylation of nonhistone proteins. It is involved in cancer development and HDAC8 inhibitors are promising candidates as anticancer agents. However, most reported HDAC8 inhibitors contain a hydroxamic acid moiety, which often causes mutagenicity. Therefore, we used machine learning for drug screening and attempted to identify non-hydroxamic acids as HDAC8 inhibitors. In this study, we established a prediction model based on the random forest (RF) algorithm for screening HDAC8 inhibitors because it exhibited the best predictive accuracy in the training dataset, including data generated by the synthetic minority over-sampling technique (SMOTE). Using the trained RF-SMOTE model, we screened the Osaka University library for compounds and selected 50 virtual hits. However, the 50 hits in the first screening did not show HDAC8-inhibitory activity. In the second screening, using the RF-SMOTE model, which was established by retraining the dataset including 50 inactive compounds, we identified non-hydroxamic acid 12 as an HDAC8 inhibitor with an IC50 of 842 nM. Interestingly, its IC50 values for HDAC1 and HDAC3-inhibitory activity were 38 and 12 µM, respectively, showing that compound 12 has high HDAC8 selectivity. Using machine learning, we expanded the chemical space for HDAC8 inhibitors and identified non-hydroxamic acid 12 as a novel HDAC8 selective inhibitor.
Reversible acetylation/deacetylation of protein lysine residues is catalyzed by histone acetyltransferases/histone deacetylases (HDACs), which play an important role in regulating various biological functions.1) The HDAC isoforms are categorized into zinc-dependent and NAD+-dependent enzymes; the zinc-dependent HDACs are subdivided into class I (HDAC1, 2, 3, and 8), class IIa (HDAC4, 5, 7, and 9), class IIb (HDAC6 and 10), and class IV (HDAC11); NAD+-dependent HDACs are Sir2-like proteins as class III HDACs (SIRT1–7). HDAC8, a zinc-dependent HDAC, localizes to the nucleus or cytoplasm and catalyzes the deacetylation of non-histone proteins, such as cohesin.2) Furthermore, HDAC8 is implicated in the growth of various cancers, including adult T-cell leukemia.3–5) Therefore, HDAC8 inhibitors are expected to have potential as anticancer agents.
To date, many HDAC8 inhibitors have been reported and most of them possess a hydroxamic acid structure6–12) (Fig. 1). The hydroxamic acid structure is an important pharmacophore for HDAC inhibition because it coordinates the catalytic zinc ion of HDAC. However, it is associated with potential mutagenicity13,14); consequently, there is a need to discover non-hydroxamic acids as HDAC8 inhibitors. To achieve this, many researchers have undertaken exploratory studies; however, the chemical space of HDAC8 inhibitors is limited, and there are a few non-hydroxamic acid series that inhibit HDAC8.15) Although several HDAC8 inhibitors without hydroxamic acid scaffolds have been already reported, their application studies have been hardly performed. It might be caused by their cytotoxicity or unsafety. In addition, they lack structural diversity. Therefore, a novel approach that can expand the potential chemical space is necessary to identify novel HDAC8 inhibitors without hydroxamic acid scaffolds. In this study, we focused on machine learning to solve this problem.
Recently, machine learning has received much attention in the early stages of drug discovery studies.16) Machine learning has provided opportunities to expand the potential chemical space beyond the confines of current experimental techniques.17,18) For example, Collins and colleagues identified a new type of antibiotic that is structurally distinct from known antibiotics using a deep neural network.18) The key to achieving such success is the construction of a suitable machine-learning model and the organization of a training dataset appropriate for the model. In the field of medicinal chemistry, various machine learning models employ algorithms such as decision tree (DT), random forest (RF), support vector classifier (SVC), k-nearest neighbors (kNN), Gaussian naive Bayes (GNB), and deep neural network.18–20) Although we can easily access the reported machine learning models through open sources, it is usually difficult to select and construct the most suitable model for a specific target protein without empirical validation. In addition, it is essential to prepare an appropriate training dataset for the machine learning models. However, data of sufficient quality may not be obtained. For example, imbalanced data, where the distribution of active and inactive compound data is uneven, prevents the preparation of high-quality datasets and identification of active compounds.21) Various strategies to address imbalanced data are known, including oversampling of minority classes or undersampling of majority classes. Among the various oversampling methods, the synthetic minority oversampling technique (SMOTE)22) has been previously employed in drug discovery to enhance the prediction accuracy.23) In this study, SMOTE contributed to the identification of a novel HDAC8 inhibitor.
Herein, we describe the details of our drug discovery study of HDAC8 inhibitors based on a machine learning model combined with SMOTE.
Initially, learning datasets were prepared by extracting data on the structures of the compounds and their HDAC-inhibitory activities from the ChEMBL24) database. After curating the datasets by eliminating compounds without an IC50 value for HDAC8 and duplicated compounds, the other compounds were defined as active or inactive samples based on their pIC50 values. Compounds with a pIC50 value of more than 7 were labeled as active compounds, and the others were labeled as inactive compounds. As shown in Table 1, the distribution of active and inactive compounds was imbalanced, with a ratio of 1 to 8.28. The compounds in this dataset were transformed into input data suitable for machine learning. Several descriptors have been used for the construction of predictive models, including extended-connectivity fingerprint (ECFP)25) and PubChem fingerprint. ECFP, which is generated by a hash function, is frequently used for machine learning in cheminformatics. However, the bits of the fingerprint do not correspond to the structural fragment. This often makes it difficult to analyze what structural fragments contribute to the predictive activities of a compound. On the other hand, PubChem fingerprint, which is a structural fingerprint, has each bit position and corresponds to the structural fragment, enabling us to analyze how each model predicts compound activities from a chemical structure. To take into consideration the relationship between substructures and predictive activity in this study, we used PubChem fingerprint as a descriptor.
Target | Active | Inactive | Ratio |
---|---|---|---|
HDAC8 | 218 | 1805 | 1 : 8.28 |
Active compounds had pIC50 values greater than 7.
Each dataset was randomly split into training and testing datasets at a ratio of 8 : 2. The training dataset was applied to the training of five machine learning models (DT, RF, SVC, kNN, and GNB models), and their areas under the curves (AUCs) of receiver operating characteristic (ROC), which were estimated using the testing dataset, were compared (Table 2). The results indicated that the AUCs of the ROC were below 0.8 in all the models, which suggested their prediction accuracy was insufficient because values over 0.9 are required. We attributed this to the imbalance between the active and inactive compound data in the training dataset. Therefore, we applied oversampling to minority classes (active compounds) using SMOTE. Using SMOTE, the ratio of active-to-inactive-compound data was adjusted to 1 : 1, resulting in the construction of a balanced dataset. As a result, the AUCs consistently exceeded 0.9 in all four machine learning models, except for the GNB model, with the RF model exhibiting the highest predictive accuracy. Furthermore, the precision (0.94), recall (0.96), and F1 score (0.95) of the RF model combined with SMOTE (RF-SMOTE model) were superior to those of the RF model with an actual distribution. These results suggest that the RF-SMOTE model is useful for predicting the HDAC8 inhibitory activity of screening compounds.
Target | Actual Distribution | SMOTE | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
DT | RF | SVC | kNN | GNB | DT | RF | SVC | kNN | GNB | |
HDAC8 | 0.66 | 0.75 | 0.75 | 0.71 | 0.56 | 0.91 | 0.98 | 0.95 | 0.95 | 0.76 |
(DT, decision tree; RF, random forest; SVC, support vector classifier; kNN, k-nearest neighbors; GNB, Gaussian naïve Bayes).
Subsequently, we screened the Osaka University library using the scheme illustrated in Fig. 2. We performed screening using the trained RF-SMOTE model to predict 1566 active compounds. Then, 68 compounds, which are highly similar to compounds recorded in ChEMBL dataset, were eliminated, because the purpose of this study is to find a novel scaffold for HDAC8 inhibition. The other 1488 compounds were screened by absorption, distribution, metabolism, excretion and toxicity (ADMET) to eliminate those with inadequate drug-like properties. This screening was performed using ADMETlab2.0,26) and DataWarrior drug-likeness scores.27) Consequently, the top 50 compounds were selected as HDAC8 inhibitor candidates and were experimentally evaluated using an HDAC8 inhibition assay. The HDAC8-inhibitory activity of the 50 compounds at 10 µM was tested in a commercially available HDAC8 assay kit. Unfortunately, the results indicated that none of the compounds inhibited HDAC8 enzymatic activity by more than 50% (Supplementary Fig. S1).
In the context of the results of the first screening, we believe that the machine learning model needs to be improved. Because the top 50 compounds selected in the first screening were inactive, they were labeled as inactive samples and integrated into the training dataset for the second screening. The five machine learning models were retrained based on the integrated training dataset. As was the case with the first screening, the RF-SMOTE model showed the best AUC of the ROCs among all models (Table 3). Furthermore, the potent precision (0.94), recall (0.98), and F1 score (0.96) of the RF-SMOTE model were equal to or higher than those of the first screening test. Thus, the RF-SMOTE model was applied for a second screening.
Target | Actual Distribution | SMOTE | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
DT | RF | SVC | kNN | NB | DT | RF | SVC | kNN | NB | |
HDAC8 | 0.68 | 0.84 | 0.82 | 0.79 | 0.64 | 0.92 | 0.98 | 0.96 | 0.96 | 0.72 |
In the second screening of the Osaka University library using the retrained RF-SMOTE model (Fig. 3), the number of compounds predicted to be active decreased from 1566 to 235. Unlike in the first screening, none of the HDAC8-inhibitory activity hits were recorded in the ChEMBL database. In the screening using ADMETlab2.0, 62 compounds successfully met the screening criteria. In the second screening, DataWarrior drug-likeness scores were not employed, as screening through ADMETlab2.0 was deemed sufficient for filtering compounds with low drug-like properties. Instead, the docking scores for their binding to HDAC8 calculated using Glide software were utilized to sort the compounds and the top 50 compounds were selected for further analysis. Incidentally, the 50 compounds in the second screening were predicted to be negative in the first screening.
The top 50 compounds were subsequently subjected to the HDAC8-inhibitory activity assay (Fig. 4). Notably, non-hydroxamic acid compound 12 with 1,2,4-triazole-3-thione scaffold exhibited the most potent HDAC8-inhibitory activity with about 95% inhibition at a concentration of 10 µM. We then determined the IC50 value of compound 12. Compound 12 inhibited HDAC8 expression in a dose-dependent manner with an IC50 value of 0.87 ± 0.17 µM (Supplementary Fig. S2).
To understand why we were successful in the identification of compound 12 in the second screening but not the first screening, we attempted to analyze what structural fragments were predicted to strongly affect the HDAC8-inhibitory activity in the first and second prediction models. However, in the first screening, the prediction models could not catch the important structure contributing to the HDAC8-inhibitory activity. This resulted in high false positive rates. In contrast, the second prediction model captured substructures affecting the HDAC8-inhibitory activity, suggesting that the substructures expressed in SMARTS notation, such as “N–C–C–C–N” and “O–N–C–C–C,” would work negatively for HDAC8 inhibition. Compound 12 does not have these structures, suggesting that the second prediction model had a higher true positive rate due to capturing these substructures that are predicted to bring about a negative effect on HDAC8 inhibition.
Next, we evaluated the HDAC8 selectivity of compound 12 over other class I HDACs. Before experimentally testing the HDAC1 and HDAC3 inhibitory activities of 12, we predicted them using machine-learning models. Similar to the prediction of HDAC8 inhibition, we used SMOTE and established an RF-SMOTE model because the distributions of compounds labeled as active or inactive in the HDAC1/HDAC3 inhibitory activities were also imbalanced (Supplementary Table S1). The AUCs of the ROCs for HDAC1 and HDAC3 in the RF-SMOTE model were 0.94 and 0.95, respectively (Supplementary Table S2). In these models, compound 12 was predicted to be inactive for both HDAC1 and HDAC3. In actuality, the experimental results demonstrated that compound 12 exhibited weak inhibitory activity with IC50 values of 38 and 12 µM for HDAC1 and HDAC3 inhibition, respectively (Table 4). In contrast, vorinostat, a pan-HDAC inhibitor, inhibited HDAC1 and HDAC3 more strongly than it did HDAC8 in these assay systems. These results indicate that compound 12 shows high HDAC8 selectivity over other class I HDACs.
Compound | IC50 (µM) | ||
---|---|---|---|
HDAC1 | HDAC3 | HDAC8 | |
12 | 38 ± 17 | 12 ± 0.60 | 0.87 ± 0.17 |
Vorinostat | 0.046 ± 0.003 | 0.17 ± 0.025 | 1.80 ± 0.53 |
Finally, we present the mode of binding of compound 12 to HDAC8. It was simulated using Glide software (Fig. 5). These results indicated that the 1,2,4-triazole-3-thione moiety of compound 12 underwent tautomerization of the 1H-1,2,4-triazole-3-thiolate structure. Interestingly, the thiolate anion coordinated with the zinc ion chelated by Asp178, His180, and Asp267 at the active site of HDAC8 (Fig. 5a). The 1H-1,2,4-triazole-3-thiolate structure of compound 12 interacted with His143 and His180 via a π–π interaction (Fig. 5b). The piperidine moiety entered the pocket of HDAC8 and interacted with Tyr306 through hydrogen bonding (Fig. 5b). In contrast, a phenyl group was observed on the protein surface, implying that it did not directly contribute to the binding process (Fig. 5b). To obtain further structural insights, we need to test the activity of peripheral compounds of 12. However, the peripheral compounds were not included in the library and we have not evaluated their activity yet. This will be a future challenge for further development of HDAC8 inhibitors.
In recent years, machine learning has been regarded as a useful tool and widely applied in drug discovery studies. However, we must construct suitable models and prepare high-quality datasets to take advantage of machine learning. In our search for HDAC8 inhibitors, using SMOTE to improve the imbalanced data, an RF-SMOTE model was constructed to identify non-hydroxamic acid inhibitors. Following comprehensive screening and filtration, compound 12 emerged as an HDAC8 inhibitor with an IC50 value of 0.87 µM. Interestingly, this is substantially lower than the IC50 values of compound 12 for HDAC1 and HDAC3 inhibition (38 and 12 µM, respectively), indicating that compound 12 exhibits HDAC8 selectivity over other class I HDACs. These results suggested that the RF-SMOTE model enabled us to explore a broader chemical space for HDAC8 selective inhibitors, thereby facilitating the identification of non-hydroxamic acid HDAC8 inhibitors.
ChEMBL data were retrieved using the ChEMBL web source client in Python 3.9. Human HDACs were chosen and the dataset was filtered for retrieval of compounds with IC50 values and the removal of any duplicates. Subsequently, the IC50 values was converted to the pIC50 values, and compounds with a pIC50 greater than 7 were labeled as active and the others as inactive. Canonical SMILES of each structure were used to generate PubChem fingerprints using the PaDELPy package (https://github.com/ecrl/padelpy).
Machine Learning Models and PredictionThe HDAC8 dataset was randomly divided into training and testing datasets in a ratio of 8 : 2. The models were built using the Scikit-learn package with default parameters. SMOTE was implemented using the imbalanced-learn package. The AUCs of the ROC from the test set of the RocCurveDisplay module from Scikit-learn were used. For prediction, the canonical SMILES of the Osaka University library were converted into PubChem fingerprints and the fingerprints were used to predict with the output as a label of ‘active’ or ‘inactive’ for each compound. Compounds labeled ‘active’ were filtered for the next step.
Molecule Comparison to Known ChEMBL Active CompoundsTo eliminate the rediscovery of known compounds, the compounds predicted to be active were compared with the ChEMBL database. This was made possible using DataWarrior’s Get Similar Compounds from the ChEMBL Actives option. Using this option, DataWarrior compared the predicted compounds with the ChEMBL database based on their SkelSphere descriptors28) and calculated their Tanimoto similarity scores. Compounds that had a Tanimoto similarity score of 0.8 or higher compared to any known active ChEMBL compounds were filtered out.
ADMET Screening Using ADMETlab2.0The ADMETlab2.0 web resource (https://admetmesh.scbdd.com/) was used for ADMET screening. The compound was deemed to pass this screening if it met all the criteria.
Docking Simulation Using GlideDocking was performed using the Glide software. The protein crystal structure of HDAC8 was imported from the Protein Data Bank (ID:1T69). Grid generation was performed excluding water molecules. Ligand preparation was performed in LigPrep with an OPLS4 force field and ionization at pH 7.4 using Epik, and a metal-binding site was added. The structures were desalted and tautomers were generated. For sorting, docking was performed in the SP mode, whereas for the binding mode calculation of compound 12, docking was performed in XP mode.
HDAC8 Inhibitory AssayThe HDAC8 inhibitory assay was performed using the CycLex® HDAC8 Deacetylase Fluorometric Assay Kit. To screen the top 50 compounds, the final concentration of each compound was set to 10 µM. Following their protocol, the reaction was performed in a black 96-well plate at 30 °C for 30 min in a shaker at 500 rpm. After the stop solution was added, fluorescence measurements were performed using an EnSight Fluorescence plate reader at an excitation wavelength of 380 nm, an emission wavelength of 460 nm, and 100 flashes. The experiment was performed in triplicate. The IC50 of compound 12 (Enamine Ltd., Z238423340; its purity and mass spectrum were provided in Supporting Information) was calculated using GraFit software.
HDAC1 and HDAC3 Inhibitory AssayThe reaction mixture containing the HDAC assay buffer, compound 12, and recombinant HDAC was mixed in a 96-well plate and incubated at 30 °C for 30 min in a shaker at 500 rpm. The reaction substrate, which was a peptide with the sequence RHKK(Ac)AMC, was added, and the mixture was incubated again under the same conditions. Subsequently, the developer solution was added, and the mixture was incubated again under the same conditions. The fluorescence level was measured using an EnSight Fluorescence plate reader with an excitation wavelength of 380 nm, an emission wavelength of 460 nm, and 100 flashes. The experiment was performed in triplicate. The IC50 was calculated using GraFit software.
This work was supported by Grants for JSPS KAKENHI (21K15226 to Y.Y.), and the Platform Project for Supporting Drug Discovery and Life Science Research (Basis for Supporting Innovative Drug Discovery and Life Science Research; BINDS) from the Japan Agency for Medical Research and Development (AMED) (JP23ama121054).
The authors declare no conflict of interest.
This articles contains supplementary materials.