Chemical and Pharmaceutical Bulletin
Online ISSN : 1347-5223
Print ISSN : 0009-2363
ISSN-L : 0009-2363
Regular Articles
CBDPS 1.0: A Python GUI Application for Machine Learning Models to Predict Bitter-Tasting Children’s Oral Medicines
Guoliang BaiTiantian WuLibo ZhaoXiaoling WangShan LiXin Ni
Author information
JOURNAL FREE ACCESS FULL-TEXT HTML
Supplementary material

2021 Volume 69 Issue 10 Pages 989-994

Details
Abstract

Bitter tastes are innately aversive and are thought to help protect animals from consuming poisons. Children are extremely sensitive to drug tastes, and their compliance is especially poor with bitter medicine. Therefore, judging whether a drug is bitter and adopting flavor correction and taste-masking strategies are key to solving the problem of drug compliance in children. Although various machine learning models for bitterness and sweetness prediction have been reported in the literature, no learning model or bitterness database for children’s medication has yet been reported. In this study, we trained four different machine learning models to predict bitterness. The goal of this study was to develop and validate a machine learning model called the “Children’s Bitter Drug Prediction System” (CBDPS) based on Tkinter, which predicts the bitterness of a medicine based on its chemical structure. Users can enter the Simplified Molecular-Input Line-Entry System (SMILES) formula for a single compound or multiple compounds, and CBDPS will predict the bitterness of children’s medicines made from those XGBoost–Molecular ACCess System (XgBoost–MACCS) model yielded an accuracy of 88% under cross-validation.

Introduction

Most children are administered oral medications to treat various diseases when they are sick. However, some children refuse to take medications for reasons that include the child’s age, difficulty swallowing, and taste preferences. In pediatric patients, unpleasant drug tastes are the most common cause of treatment refusal.1) Humans can perceive five different taste modalities: bitter, salt, sour, sweet and umami.2,3) Bitter is the primary culprit in oral medication noncompliance; more than 90 percent of pediatricians report that taste and palatability are the main obstacles to completing treatment.4) Bitterness is related to toxicity, and many bitter compounds are derived from plants. These bitter compounds evolved in plants to prevent them from being eaten; therefore, animals tend to reject substances that taste bitter.

The sense of taste is controlled by different cell types that express unique receptors. Recently, it was proven that taste receptors are located not only on the tongue but also in other nontaste tissues that express G-protein coupled receptors (e.g., the liver, pancreas, brain, testes, and sperm).5,6) Twenty-five members of the Taste Receptor Type 2 (TAS2R) bitter receptor family have been reported over the past ten years; these receptors have selective sensitivity to specific compounds and are genetically quite different.7,8) Many molecules have a bitter taste, including amino acids, alkaloids, glucoides, flavonoids and terpenoids. Many common dietary phytonutrients in fruits, vegetables, oral pharmaceuticals and herbal medicine-based drugs can cause bitterness reactions, and this bitterness is the main reason for poor compliance in children.911)

It is not easy for humans to taste candidate drugs because different people have different experiences, resulting in inconsistent evaluation results. Some drugs are potentially toxic and must be ethically approved. Comprehensive toxicology research can achieve taste evaluation. An electronic tongue may be a good choice for assessing a drug’s taste characteristics through a range of sensors, but it is limited to evaluating water-soluble drugs, and the pH of the formulation and excipient may affect the sensitivity of the sensor.12) Therefore, bitter taste evaluation technology is highly significant in drug research and development, taste correction and taste masking. However, developing taste evaluation technology is not limited to a single pharmaceutical field; it requires multidisciplinary efforts. In this space, chemoinformatics based on computer models plays an important role in supporting and advancing research related to taste chemistry.13)

Research attention has shifted toward faster and more efficient machine learning methods that rely on a drug’s chemical structure. Several computational approaches have been developed to predict chemical bitterness. For example, “BitterPredict” is a machine learning classifier that predicts whether a compound is bitter based on its chemical structure and has achieved an accuracy of 80%.14) “BitterSweetForest” was the first open access model based on the KNIME workflow to provide a platform to predict bitter and sweet chemical compound tastes using molecular fingerprints and a random forest-based classifier. The BitterSweetForest model yielded an accuracy of 95% and an area under the receiver operating characteristic (ROC) curve (AUC) of 0.98 under cross-validation.11) Quantitative structure–activity relationship (QSAR) models have been developed to build mathematical relationships between chemical structures and their respective properties.11,15) The absorption, distribution, metabolism, excretion (ADME)/Tox machine learning model was developed to predict the bitterness of compounds.14) Many bitterness prediction models have been reported in the literature; currently, the best-known public bitterness database is BitterDB (a database of bitter compounds), which was established by the Hebrew University of Israel and contains 1041 compounds consisting of both synthetic compounds and natural monomers extracted from plants.14,16,17) However, many of the compounds in BitterDB are not ingredients of medicines. Therefore, we built a bitterness prediction model to construct a database of oral drugs for children.

In this study, we developed a machine learning system named the “Children’s Bitter Drug Prediction System” (CBDPS) based on Tkinter and built a “bitter database for children’s oral medication,” which uses an XGBoost–Molecular ACCess System (XgBoost–MACCS) model to predict medicinal bitterness based on chemical structure. Users can directly input the Simplified Molecular–Input Line–Entry System (SMILES) files of the main drug ingredients to predict the taste of children’s medicine.

Experimental

Model Dataset Preparation

The training datasets used in this study were collected both from the published literature and open databases. The key step in the machine learning model is to construct datasets containing both positive and negative samples. The chemical structures of 2367 chemical compounds consisting of both artificial and natural nonbitter compounds (1456 chemical structures, including tasteless and sweet compounds) as well as bitter compounds (911) (Table 1) were extracted from BitterDB16,17) and other available published data.3,14,18) Using the number of bitter molecules as a benchmark, the function “Pandas.DataFrame. sample” randomly samples nonbitter molecules at a ratio of 1 : 1 to construct a new dataset. Finally, a dataset was obtained that balances bitter and nonbitter samples. In this paper, a train-test-split function is used to divide the abovementioned balanced data, where train_size = 0.8, which means that the training data are divided into a training set and validation set at an 8 : 2 ratio. The training set samples are used for model training and fitting, and the validation set samples are used to adjust the model parameters and perform a preliminary evaluation of the model’s ability. The test set is used to evaluate the generalizability of the final model. The data composition is shown in Tables 2 and 3.

Table 1. Data Sources and Quantities
Data categoryReferenceNumber
BitterRojas et al. (Theor. Chem. Acc., 2016)27)81
Sarah Rodgers (J. Chem. Inf. Model., 2006)28)29
Fenaroli’s Handbook of Flavor Ingredients33
Biochemical Targets of Plant Bioactive Compounds by Gideon Polya39
BitterDB592
The Good Scents Company Database43
Ayana Dagan-Wiener et al. (Scientific Reports, 2017)94
NonbitterAyana Dagan-Wiener et al. (Scientific Reports, 2017)66
TastelessRojas et al. (Theor. Chem. Acc., 2016)133
Fenaroli’s Handbook of Flavor Ingredients3
ToxNet72
SweetRojas et al. (Theor. Chem. Acc., 2016)433
Fenaroli’s Handbook of Flavor Ingredients426
Biochemical Targets of Plant Bioactive Compounds by Gideon Polya32
SuperSweet199
The Good Scents Company Database158
Table 2. Dataset Based on Chemicophysical Descriptors
Data categoryTraining setValidation setTest set
Bitter data71017891
Nonbitter data71017821
Table 3. Dataset Based on MACCS Fingerprints
Data categoryTraining setValidation setTest set
Bitter data64816291
Nonbitter data64816221

Extracting Children’s Oral Medication Data

The data on children’s oral chemical medicines (COCMs) were collected from the WHO Model List of Essential Medicines for Children (WHO-ELMC) and from the medical data of 30 Chinese children’s hospitals, including Beijing Children’s Hospital. The WHO-ELMC dataset contains 319 medicines, of which 158 are oral chemical medicines. The 30 children’s hospitals in China use 206 oral chemical medicines after proprietary Chinese medicines are excluded. The Key Project from the Ministry of Science and Technology of China (Grant No. 2018ZX09721003, in progress) lists 129 medicines, 68 of which are oral chemical medicines. In this study, 222 children’s medicines for children were screened together (Table 4 and supplementary Table S1).

Table 4. Children’s Oral Chemical Medicines
Data sourceOral chemicals medicinesCombined medicines
WHO Model List of Essential Medicines for Children158222
Beijing Children’s Hospital and other 30 children’s hospitals in China206
Key Project from the Ministry of Science and Technology of China (Grant No. 2018ZX09721003, in progress)68

Evaluation Metrics

The accuracy (ACC) metric reflects the proportion of all the correct model classification results to the total observation value.

  

Precision (positive predicted value, PPV) reflects the proportion of the model prediction pair in all the model prediction results that are positive.

  

Sensitivity (true positive rate, TPR) reflects the proportion of all true results to the model true positive predictions.

  

Specificity (true negative rate, TNR) reflects the proportion of all the true negative results to the model true negative predictions.

  

The F1-score is an average result of the accuracy rate and the recall rate.

  

The ROC curve reflects the trends of sensitivity and accuracy of a model under different thresholds. The advantage of the ROC curve is that even when the distribution of the positive and negative samples changes, the ROC curve shape remains largely unchanged. Therefore, this evaluation index can reduce interference caused by different test sets and measure the performance of the model itself more objectively.19)

This study mainly trains a model based on the random forest (RF) and XGBoost (XgB) algorithms based on chemicophysical descriptors and Molecular ACCess System (MACCS) fingerprints. The RF algorithm integrates multiple trees using the ensemble learning concept. Its basic unit is a decision tree, which essentially belongs to a large branch of machine learning—integrated learning methods. The working principle is to construct various decision trees during training (usually greater than 100); each decision tree uses a subset of features and data points. After training, the predictions made by the individual decision trees are summarized to produce a final prediction.20) XgB is a new machine learning method introduced by Chen and He,21) which is based on similar principles to gradient boosting machines (GBMs) but adopts a more rigorous model to control overfitting. It integrates multiple tree models, improves them on the basis of boosting, calculates a residual in each iteration, and then continuously reduces the residual so that the residual of the previous model is further reduced in the gradient direction. Finally, all the models are linearly combined to obtain the model, which can make use of the parallel computing advantages of multicore central processing units (CPUs) to improve the accuracy and speed of calculation. It is a tree lifting system that can be extended in an end-to-end manner.22) The results show that XgB performs better than RF in some problem domains involving difficult learning tasks.23)

Chemicophysical descriptors can reflect the physical and chemical properties of compounds, including atomic mass, charge number, electronegativity, number of compound bonds, structural information and the number of rings. We used ChemoPy24) as a tool to calculate chemical physical descriptors and selected ten groups to describe the physical and chemical properties and structures of molecules from different angles, for a total of approximately 560 kinds of molecular descriptors. The MACCS key is a fingerprint derived from the chemical structure database developed by Molecular Design Limited (MDL). It is famous for chemical informatics. As a fast method for screening substructures in molecular databases, the pattern matching of structural fragments is defined by experts in this field. Each structural fragment is a bond that occupies a fixed position in the description space. MACCS keys contain 166 bonds. When they contain the molecular structure, a value of 1 is displayed; otherwise, a value of 0 is displayed. They also include the RDKit-related content because a molecule has a 167-bit-long molecular fingerprint.25,26) We trained four machine learning models (RF + descriptors, RF + MACCS, XgB + descriptors, and XgB + MACCS).

Results and Discussion

Algorithms and Updates

The most important parameters of XgB are n_estimators and max_depth, which have the same functions as in the RF algorithm. As shown in Table 5 and Fig. 1E, the classification effect of the model constructed based on MACCS fingerprints is better than that of the chemicophysical descriptors, and the XgB models perform better than do the RF models. The best model effect was obtained by the XgB + MACCS method, and its accuracy reached 0.88. Except for the RF + descriptors combination, the sensitivity of the other three combinations is higher than or equal to their specificity, indicating that the models can accurately identify bitter compounds.

Table 5. Classification Results of Different Training Models
ModelCharacteristic attribute typeOptimal parameter listACCPPVTPRTNRF-score
RF + descriptorsdescriptorsn_estimators = 400, max_features = auto, max_depth = 9,0.8330.8520.8300.8380.826
RF + MACCSMACCS fingersn_estimators = 1100, max_depth = 9, max_features = auto0.8480.8460.8460.8480.848
XgB + descriptorsdescriptorsn_estimators = 100, max_depth = 9, subsample = 0.80.8650.8710.8730.8460.872
XgB + MACCSMACCS fingersn_estimators = 100, max_depth = 8, subsample = 0.80.8820.8800.8890.8800.881
Fig. 1. ROC Curve of the Training Model and Classification Results

Each point on the ROC curve reflects the model’s susceptibility to the same signal stimulus, and the true positive rate of false positives under the inability to identify threshold is plotted, which is usually used for binary classification. The ROC curves of the four models are shown in Figs. 1A–D; their AUC values are 0.92, 0.93, 0.93 and 0.95. XgB + MACCS has the largest AUC, which means that the XgB + MACCS model performs best. Therefore, in subsequent experiments, we used the XgB + MACCS combination to establish the bitterness prediction model.

Feature Importance

This paper calculates 10 groups of feature descriptors based on ChemCopy, a third-party Python library, for a total of 560 molecular descriptors. Because the information expressed by some feature descriptors is redundant, it is necessary to select the features of the molecular descriptors before constructing the model. In this paper, the two-point correlation coefficient, Pearson correlation coefficient and support vector machine-recursive feature elimination (SVM-RFE)-based feature selection method are used to select the molecular descriptors. First, the two-point correlation coefficient was used to calculate the correlation between the attributes and results. To make the correlations between the attributes and the results as high as possible, the characteristics of the retained variables with correlations coefficients greater than 0.05 were studied. Then, the Pearson correlation coefficients between the feature attributes were calculated. A correlation coefficient greater than 0.9 indicates a strong correlation, and the characteristics with low correlations with the results were deleted. Finally, the SVM-RFE method was used to calculate the importance of all the features, eliminate the least important features, and retain the most appropriate features. Finally, the number of feature descriptors was reduced from 560 to 114. Table 6 lists the correlation coefficients between the partial feature descriptors and the results.

Table 6. Correlation Coefficients for Some of the Feature Descriptors
Feature descriptorCorrelation coefficientMeaningFeature descriptorCorrelation coefficientMeaning
PEOEVSA70.389MOE-type descriptors using partial charges and surface area contributionsbcute120.235Burden descriptors based on atomic electronegativity
bcute10.379Burden descriptors based on atomic electronegativitySmax280.234Sum of E-State of atom type: dsN
UI0.331Unsaturation indexSmax110.233Sum of E-State of atom type: dsCH
Smax290.320Sum of E-State of atom type: aaNGATSm20.230Geary autocorrelation descriptors based on atom mass
bcute90.319Burden descriptors based on atomic electronegativityIC20.230Information content with order 2 proposed by Basak
nring0.285Number of ringsbcutm80.229Burden descriptors based on atomic mass
mChi10.282Mean chi1 (Randić’s connectivity index)MATSv40.225Moran’s autocorrelation descriptors based on atomic van der Waals volume
bcute150.281Burden descriptors based on atomic electronegativitybcutv30.224Burden descriptors based on atomic volumes
slogPVSA50.278MOE-type descriptors using SLogP contributions and surface area contributionsnnitro0.223Number of N atoms
slogPVSA70.278MOE-type descriptors using SLogP contributions and surface area contributionsPEOEVSA60.216MOE-type descriptors using partial charges and surface area contributions
naro0.275Number of aromatic bondsSmax170.214Sum of E-State of atom type: aasC
Chi6ch0.265Simple molecular connectivity Chi indices for cycles of 3-6QNmin0.212Most negative charge on an N atom
MRVSA20.264MOE-type descriptors using MR contributions and surface area contributionsQNmax0.209Most positive charge on an N atom
EstateVSA10.247MOE-type descriptors using Estate indices and surface area contributionsncarb0.206Number of C atoms
MATSm40.241Moran’s autocorrelation descriptors based on the atomic mass......

Performance Comparison between the Different Models

Among the five models (Table 7), the accuracy and precision of BitterSweetForest are the highest (0.96 and 0.95). The performance of the CBDPS developed by us occupies the middle level (0.88 and 0.87). However, due to the different datasets used by different models, the performance of each model is not of great reference value.

Table 7. Performance Comparison between the Different Models
Model nameAccuracyPrecisionSensitivitySpecificity
CBDPS0.880.870.890.87
BitterX0.87>0.9>0.9>0.9
BitterPredict0.830.660.770.86
E-bitter0.930.920.950.9
BitterSweetForest0.960.95

CBDPS Functional Modules

The platform portable Python language is used to develop CBDPS with a graphical user interface (GUI) in this study. The system judges whether the drug is bitter according to the SMILES files of the main components of the children’s drug. Because the CBDPS is only a simple small system and needs to be used across platforms, it is more appropriate to use Tkinter to develop a GUI.

The main interface is divided into three functional modules: training model, data input and bitterness prediction and results display (Fig. 2).

Fig. 2. The CBDPS GUI

  • (1) Training Model

    • The collected data (and consequently, the training data) are small; therefore, a future model trained after more data have been collected later can be retrained to improve the accuracy.

  • (2) Data Input and Bitterness Prediction

    • Users can input the SMILES formula for a compound; when the file contains too much data to be predicted, users can choose to upload a specific document for prediction. The model processes the data and calculates the MACCS fingerprint based on the SMILES content as a feature attribute. Then, the model predicts the bitterness of the compound by executing the trained model.

  • (3) Results Display

    • If the input data consist of a single SMILES file, the result is displayed immediately; if the input consists of multiple files, the results will be saved and downloaded. In addition, the result page also displays the model’s performance indicators and an ROC curve. The current experiments were executed on a Windows-based computer equipped with an Intel Xeon W CPU @ 4.0 GHz and 32 GB of RAM.

Children’s Bitter Oral Chemical Medicines

The applicability of a model is as important as its development. Among the 222 kinds of oral children’s medicines (OCMs) from the initial screening and collection, consulting the literature and the BitterDB identified 84 known bitter medicines. Of the remaining 138 predicted by the CBDPS model, 105 have bitter tastes. Thus, a total of 189 known and predicted bitter medicines were obtained (Table 8 and supplementary Table S2). This result indicates that more than 85% of oral medications for children are bitter, leading to their poor compliance with oral medication, which makes it difficult for children to take these clinically prescribed medications. Based on the bitterness of children’s medicines, appropriate taste correction and taste-masking strategies should be selected in the flavors and dosage forms that children prefer.

Table 8. Bitter Children’s Oral Chemical Medicines
Oral chemicals medicinesKnown as bitterPredicted as bitterTotal
22284105189

Bitterness Database of Children’s Oral Chemical Drugs

We created a web-based database for the 189 types of OCMs identified as bitter to provide support for the development of children’s drug taste corrections or taste masking. The database can search for these compounds by drug name, chemical structure, CAS registration number, and treatment-related diseases. The database also provides information on the molecular properties of related drugs (their molecular formula, molecular weight, number of hydrogen-bond donors and acceptors, lipid-water partition coefficient, number of aromatic rings, etc.), the source of the bitterness (literature or prediction), administration method and dosage, identifiers of different compounds (SMILES structural formula, International Union of Pure and Applied Chemistry (IUPAC) system designation), and a link to the PubChem and DrugBank entries for these compounds. We hope to make this database a publicly available, electronically searchable database for COCMs with bitter molecules.

Conclusion

Bitterness is a major obstacle to the development of pharmaceutical products, especially for pediatric patients; thus, knowing which drugs are bitter is particularly important for drug preparation experts. CBDPS is a bitterness prediction model for children’s oral medicines designed based on XgB + MACCS, and it is the first model for predicting the bitterness of COCMs. We also plan to establish a web server that supports model predictions and links to reports in the literature of bitter oral medicines for children. By integrating the latest information of children’s bitter drugs into the database, we hope to improve the prediction accuracy of the model. The links to reports are of considerable value for developing taste masking for children’s drug preparations. We plan to embed the CBDPS model into the web server, providing services to children’s drug developers.

Acknowledgments

This work was supported by the National Major Science and Technology Projects of China, China (Grant No. 2018ZX09721003).

Conflict of Interest

The authors declare no conflict of interest.

Supplementary Materials

The online version of this article contains supplementary materials.

References
 
© 2021 The Pharmaceutical Society of Japan
feedback
Top