Improvement of Blood-Brain Barrier Permeability Prediction Using Cosine          Similarity

Hiroshi SAKIYAMA; Ryushi MOTOKI; Takashi OKUNO; Jian-Qiang LIU

doi:10.2477/jccjie.2023-0017

Abstract

Prediction of blood-brain barrier permeability for chemicals is one of the key issues in brain drug development. In this study, the effect of using training data relatively similar to the test data was investigated in order to improve the performance of machine learning methods in predicting blood-brain barrier permeability. The results showed that selecting training data with high cosine similarity to the test data improved prediction performance with a smaller number of training data. The best model in this study also showed improved scores on two external test sets to examine generalization performance, outperforming excellent existing models. The cosine similarity method is expected to be effective for predicting the properties of compounds with large diversity and a small number of data.

1 INTRODUCTION

The blood-brain barrier function originates from the structure of brain capillaries and controls the influx of chemical substances from the blood to the brain [1,2,3,4,5,6]. Understanding the permeability of chemicals through the barrier is important for the development of therapeutic agents for brain diseases such as dementia, and animal experiments have been conducted using rats and mice. Many efforts have been also made to predict the blood-brain barrier permeability (or penetration) using machine learning [7,8,9,10,11,12,13], and in recent years, high prediction performance has been reported [12, 13].

The problem here is the diversity of chemical substances. Candidates for future drugs may not always be extensions of known substances, and statistical predictions may break down when the new candidates are completely dissimilar to the existence ones. To overcome the diversity problem, using hundreds of thousands of data are effective, and indeed, huge amounts of data have been accumulated [14], but the possibility remains that predictions for the unknowns will not work.

To overcome the diversity problem, MoleculeNet introduced scaffold split method in predicting blood-brain barrier penetration (BBBP) [8]. In this method the chemicals in the test set are neither similar to those in the training set nor to those in the validation set. This method first classifies the entire data set into scaffold groups with common scaffolds, then sorts the scaffold groups in descending order of size, and finally splits the entire data set into training set, validation set, and test set in the ratio of 8:1:1. In the MoleculeNet paper [8], a BBBP data set consisting of 2053 data [7] was used, and the resulting test set consisted only of substances with no common scaffold. With the MoleculeNet’s original algorithm, the obtained test set differed significantly from the training and validation sets, and thus the prediction score (area under the receiver operating characteristic curve [ROC-AUC]) was only around 0.729 [8], which subsequently improved to 0.753 [11]. Therefore, prediction using MoleculeNet’s scaffold split is one of the most challenging tasks in BBBP prediction, and models that score better on this task are expected to have better generalization performance.

In our previous work [15], the BBBP data set [7], consisting of 2053 data, was carefully curated to create two data sets (1957 data each), free-form and in-blood-form data sets, and the highest prediction score (0.773 by a single model [15]) was obtained for the difficult task created by the MoleculeNet’s scaffold split. In addition, the model scored well on predictions for two external test sets, showing good generalization performance.

In this study, we examined the effect of collecting data relatively similar to the test set in the training set, with the aim of making better predictions for unknown chemicals that may emerge in the future. It should be emphasized here that the features used to consider similarity were derived solely from chemical binding information on substances and did not include features derived from target permeability data. Therefore, no data leakage has occurred that should be avoided. Cosine similarity was chosen in this study because it gives a normalized similarity between 1 and −1.

2 METHODS

2.1 Computations

Computations were conducted in two environments: one with Python (3.7.9) [16] in Anaconda (2020.02) [17] and Jupyterlab (1.2.6) [18] environment on a Windows 11 computer with Intel Core i7-10510U CPU, Dell Inspiron 7391, and the other with Python (3.7.12) [16] in the Kaggle [19] environment.

2.2 Data sets

Curated free-form BBBP data set (bbbp free.csv) [15] was used in studying the effect of cosine similarity method. Two external BBBP test sets (lbbb31.csv and tx95.csv) [15] were used for evaluating the generalization performance. Molecular descriptors as features were obtained by RDKit (2019.09.3.0) [20] and Mordred (1.2.0) [21], according to the literature [15]. Names of molecular descriptor sets [15] are summarized in Table 1. Three descriptor sets, Large212, RDKit200, and RDKit61, were used in this study, and the number of features were 212, 200, and 61, respectively.

Table 1. Set of molecular descriptors

Name of descriptor set	Molecular descriptors
RDKit61	MaxEStateIndex, MinEStateIndex, MinAbsEStateIndex, qed, MolWt, MinPartialCharge, MaxAbsPartialCharge, FpDensityMorgan1, BalabanJ, BertzCT, Chi0, HallKierAlpha, LabuteASA, PEOE VSA1, PEOE VSA10, PEOE VSA11, PEOE VSA12, PEOE VSA13, PEOE VSA14, PEOE VSA2, PEOE VSA3, PEOE VSA4, PEOE VSA5, PEOE VSA6, PEOE VSA7, PEOE VSA8, PEOE VSA9, SMR VSA1, SMR VSA10, SMR VSA2, SMR VSA3, SMR VSA4, SMR VSA5, SMR VSA6, SMR VSA7, SMR VSA9, TPSA, EState VSA1, EState VSA10, EState VSA11, EState VSA2, EState VSA3, EState VSA4, EState VSA5, EState VSA6, EState VSA7, EState VSA8, EState VSA9, VSA EState1, VSA EState10, VSA EState2, VSA EState3, VSA EState4, VSA EState5, VSA EState6, VSA EState7, VSA EState8, VSA EState9, FractionCSP3, MolLogP, MolMR
RDKit200	200 RDKit descriptors [18]
Large212	RDKit200 + nH, nB, nC, nN, nO, nS, nP, nF, nCl, nBr, nI, nX from Mordred [19]

2.3 Cosine similarity split

Cosine similarity between two n-dimensional vectors, a and b, is defined as the following equations:

cosθ= a·b|a||b|= a1b1+a2b2+ ··· + anbna12+a22+ ··· + an2 b12+b22+ ··· + bn2,

where a_i and b_i is the i-th components of vectors a and b, respectively.

In the cosine similarity split, the original learning data (training plus validation data) were sorted in descending order of cosine similarity between the learning-feature vector and the test-feature vector, and a specific ratio from the larger similarity was used as the training set and the rest as the validation set.

2.4 Model

A random forest classifier model (scikit-learn (1.0.2) [22]) was used throughout this study. The “n estimators” and “max depth” parameters were fixed to values that were good in previous studies [15], 90 and 10, respectively, because small changes in the hyperparameters have been found to have little impact on the results. The “random state” parameter was varied as needed to examine the distribution of results. All other parameters were fixed to their default values

2.5 Evaluation

Prediction results were evaluated on area under the receiver operating characteristic curve (ROC-AUC) between the predicted probability and the observed target. The ROC-AUC scores were calculated by the roc auc score function of scikit-learn (1.0.2) [22].

3 RESULTS AND DISCUSSION

3.1 Baseline

First, a baseline prediction was conducted by the random forest model for the free-form BBBP test set [15], obtained by the MoleculeNet’s scaffold split, in which the free-form data set with 1957 chemicals was split into training, validation, and test sets in an 8:1:1 ratio. The numbers of chemicals in the resulting sets were 1565, 196, and 196, respectively. This test set was used unchained throughout the study. The prediction was conducted 100 times, changing the random seed parameter from 0 to 99. According to the reported method [15], the “n estimators” and “max depth” parameters were fixed to 90 and 10, respectively, and other parameters to the default parameters. The distribution of the resulting 100 ROC-AUC scores is shown in Figure 1. The average score was 0.767 (7), in which standard deviation is given in the parentheses. The 95% confidence interval was (0.765, 0.768). This score itself may not seem excellent; however, this is because it is a difficult task, as discussed in Introduction. That is, due to the MoleculeNet’s scaffold split, the test set was composed of chemicals that were not similar to the chemicals in the learning set (training set + validation set).

Figure 1.

Distribution of the ROC-AUC scores in the baseline prediction.

3.2 Random shuffle split of the learning set

Next, predictions were conducted, shuffling the learning data (training plus validation data). Except for the shuffling between training and validation sets, other conditions were equal to those in the baseline prediction. Ten thousand predictions were made by varying the respective random seed in the shuffling and in the random forest model from 0 to 99. The average score was 0.769 (8), and the 95% confidence interval was (0.7692, 0.7695). These scores were slightly better than the baseline scores, indicating that the prediction using the MoleculeNet’s scaffold split is a more difficult task.

3.3 Cosine similarity split of the learning set

For each chemical in the test set, learning set (training plus validation sets) was sorted in descending order of cosine similarity, and 1565 chemicals (80% of the full data) from the larger similarity was used as the training set. The resulting ROC-AUC score was 0.771 (8), and this was slightly better than the baseline prediction [0.767 (7)] and the random split prediction [0.769 (8)]. The scores are included in Table 2.

Table 2. ROC-AUC scores for some split methods

Split	Descriptor set	Size of training set	ROC-AUC
Scaffold (Baseline)	Large212	80% (1565/1957)	0.767 (7)
Random	Large212	80% (1565/1957)	0.769 (8)
Cosine similarity	Large212	80% (1565/1957)	0.771 (8)
Cosine similarity	RDKit200	80% (1565/1957)	0.771 (6)
Cosine similarity	RDKit61	80% (1565/1957)	0.772 (8)
Cosine similarity	RDKit61	70% (1369/1957)	0.773 (7)
Cosine similarity	RDKit61	60% (1175/1957)	0.777 (7)
Cosine similarity	RDKit61	50% (979/1957)	0.772 (8)

When the number of features used in the training was decreased, the smallest descriptor set, RDKit61, with 61 descriptors gave the better score than the larger sets, Large212 and RDKit200. In our previous work, some smaller descriptor sets were shown to perform better in deep neural network models. However, the current random forest model did not give good results with smaller descriptor sets, so they are not described here. The resulting scores are also included in Table 2. Using the best RDKit61 descriptor set as features, when the size of the training set was changed to 1369 (70%), 1175 (60%), and 979 (50%), the score of the 60%-size model was the best [0.777 (7)], and the 95% confidence interval was (0.775, 0.778). If the distribution of the scores are compared with the baseline model, as shown in Figure 2, the improvement in the score is clearly seen. Although simple comparisons are meaningless because of the different splitting methods, the result indicates that the prediction is better when there is less dissimilar data in the training set. On the other hand, if the size of the training set is reduced too much, the prediction becomes worse. From these considerations, the cosine similarity method is expected to make better predictions when the data diversity is large and the number of data are limited.

Figure 2.

Distributions of the ROC-AUC scores in the baseline prediction (left) and in the best cosine similarity model (right).

3.4 Generalization performance

In section 3.3, cosine similarity model was found to improve the ROC-AUC score of the test set by 0.01 in the best case. Predicting this test set is one of the most difficult tasks, because the chemicals in this test set are dissimilar to those in the learning set due to the MoleculeNet’s scaffold split. In this section, the generalization performance of the best cosine similarity model in section 3.3 was examined using the two external test sets, lbbb31 and tx95. In these evaluations, the full data set were sorted by the cosine similarity of the 212-dimensional feature vectors and 1175 (60%) chemicals with higher similarity were used as training set; RDKit61 features were used for the random forest model with the same hyper parameters. Each evaluation calculation was conducted 100 times, and the obtained ROC-AUC distribution is shown in Figure 3, comparing with our previous prediction result [15], using the baseline model. For the lbbb31 test set, the previous 95% confidence interval was (0.884, 0.890), and the present cosine similarity model largely improved the 95% confidence interval to (0.943, 0.947) (Figure 3a). For the tx95 test set, the previous 95% confidence interval (0.941, 0.946) was largely improved by the cosine similarity model to (0.966, 0.967) (Figure 3b). In both cases, the width of the confidence intervals decreased significantly, i.e., the bias of the prediction became smaller, and better prediction results were obtained.

Figure 3.

Distributions of the ROC-AUC scores for external test sets, lbbb31 (a) and tx95 (b); baseline model (left) and cosine similarity model (right).

4 CONCLUSION

In this study, based on the random forest model used previously [15], we examined the effect of collecting data similar to the test data on the training set. In practice, not much training data similar to the test data could be collected, but the cosine similarity of the feature vectors was made as high as possible. When examined by varying the size of the training set and the feature set used for training and prediction, the result indicates that the prediction is better when there is less dissimilar data in the training set. On the other hand, if the size of the training set is reduced too much, the prediction becomes worse. Comparing the best cosine similarity model to the baseline model, the mean ROC-AUC score improved from 0.769(8) to 0.777(7) and the 95% confidence interval of the score improved from (0.769, 0.770) to (0.775, 0.778) (Figure 2). Comparing the best model to baseline for the two external test sets, the 95% confidence intervals improved significantly from (0.884, 0.890) to (0.943, 0.947) and from (0.941, 0.946) to (0.966, 0.967) (Figure 3). The size of the learning data was not so large, but using data with relatively high cosine similarity for training improved the prediction. Therefore, the cosine similarity method is expected to be effective for predicting the properties of compounds with large diversity and a small number of data.

5 DECLARATION OF COMPETING INTEREST

The authors declare no conflicts of interest.

Acknowledgment

The authors thank for Key Scientific Research Project of Colleges and Universities of Education Department of Guangdong Province (20202ZDZX2046, 2021ZDZX2052, and 2022ZDZX2022), Guangdong Medical University Research Project (1019k2022003), the open research fund of Songshan Lake Materials Laboratory (2022SLABFN12).

REFERENCES

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）