Skin sensitizer classification using dual-input machine learning model

Skin sensitization is an important aspect of occupational and consumer safety. Because of the ban on animal testing for skin sensitization in Europe, in silico approaches to predict skin sensitizers are needed. Recently, several machine learning approaches, such as the gradient boosting decision tree (GBDT) and deep neural networks (DNNs), have been applied to chemical reactivity prediction, showing remarkable accuracy. Herein, we performed a study on DNN- and GBDT-based modeling to investigate their potential for use in predicting skin sensitizers. We separately input two types of chemical properties (physical and structural properties) in the form of one-hot labeled vectors into single- and dual-input models. All the trained dual-input models achieved higher accuracy than single-input models, suggesting that a multi-input machine learning model with different types of chemical properties has excellent potential for skin sensitizer classification.


Introduction
Skin sensitization is a key safety endpoint in occupational and consumer settings. Skin sensitizers have been traditionally assessed by animal testing approaches (OECD TG406). Recently, because Europe has imposed a ban on animal testing for cosmetic ingredients, which includes skin sensitization, in vitro testing approaches (OECD TG442C, TG442E) and in silico predictive models (e.g., Quantitative Structure-Activity Relationship (QSAR) [1]) have begun to be investigated. To meet 3R (Replacement, Reduction, and Refinement) principles, there is a high demand for the development of alternative methods for skin sensitizer prediction.
Among classification algorithms, deep neural network (DNN), which is a type of artificial neural network with more than one layer, has been used to predict chemical reactivity, and won various competitions such as the TOX21 Data Challenge 2014 [2] with a remarkable accuracy.
Recently, Kato et al. reported the high predictive performance of the QSAR/DNN model for chemical activity [3]. Light gradient boosting machine (LightGBM), a type of gradient boosting decision tree (GBDT) was proposed in 2017 [4], and several subsequent reports have verified its potential for use in assessing chemical toxicity [5]. However, there is a lack of articles investigating skin sensitizer classification using these approaches.
In this study, we investigated the study of QSAR-like DNN and LightGBM modeling to obtain insights into its potential to predict skin sensitizers. We used chemicals associated with skin sensitizers and non-sensitizers based on the definitions of Globally Harmonized System (GHS) labeling. The physical and structural properties of each chemical were collected as descriptors. The processed data were then input into models to determine its categorization accuracy.

Workflow of DNN-and LightGBM-based classification model construction
The workflow of building classification model is briefly described in the following sentences. Further details of the modeling are given in the Supplemental Information (Section A).
The list of chemicals with GHS labeling for skin sensitization was prepared using data from the National Institute of Technology and Evaluation [6]. The chemicals categorized as skin sensitizers (category 1 and sub-categories 1A/1B) were defined as "positive," and the out-of-category chemicals were defined as "negative." After data processing, the numbers of positive and negative chemicals were 408 and 275, respectively.
We used 200 chemical descriptors to construct a physical property for each chemical using RDKit library. The values of each descriptor were then standardized using z-score normalization and one-hot encoded. The one-hot encoded data were converted into vectors and then input to models. The MACCS fingerprint of each chemical was obtained as a chemical structural property using the RDKit library. The MACCS fingerprints were converted into one-hot labeled vectors that were input to the models.
The single-input DNN model has two or four hidden layers. The dual-input DNN model was developed by concatenating two single-input DNN models with a two hidden layer, followed by two hidden layers. We also developed the single-input LightGBM model by using either physical or structural properties and the dual-input LightGBM model by using both properties.

Accuracy comparison
We input the physical properties and/or structural properties of the chemicals into the singleand dual-input DNN and LightGBM models. The accuracy, loss, and precision-recall area under curve (PR-AUC) of each model were calculated by 10 iterations of five-fold cross-validation.

Results and discussion
In this study, we performed a data-driven approach to classify skin sensitizers based on the DNN and LightGBM algorithm using physical and structural properties of chemicals as input variables. In contrast to the MACCS fingerprint, physical properties of chemicals are continuous variables. Regarding such neural network models as the DNN, Hayashi et al. reported that neural networks trained with augmented discretized input were more accurate than when the original continuous input was used [7]. Indeed, the DNN model could not be successfully trained with the original physical data of chemicals (data not shown). The loss of all DNN models trained with the one-hot encoded training data decreased over the epochs, suggesting that one-hot encoding of the physical properties of chemicals is a good option for DNN training.
We then compared the performance of the single-and dual-input DNN and LightGBM models. The dual-input model delivered the best performance in both settings, outperforming the corresponding single-input model in terms of the accuracy and PR-AUC. The dual-input LightGBM had significantly good results in terms of PR-AUC. Moreover, 7.1% and 9.7% of the chemicals were uniquely predicted by the DNN and the LightGBM, respectively ( Figure S3), suggesting different potential in terms of classifying skin sensitizers.
To the author's understanding, this is the first report regarding skin sensitizer classification with a DNN and LightGBM models. The dual-input strategy was found to have potential for use in predicting skin sensitizers in both deep learning and GBDT algorithms. Although the sets of chemicals correctly predicted by the two models were almost identical, more true positive compounds were obtained by the dual-input LightGBM model, whereas the dual-input DNN model yielded more true negative compounds ( Figure S3). This might reflect differences between the algorithms, such as that in assigning importance to features. Golden et al. reported the accuracy of eight in silico skin sensitization prediction models based on the Hazardous Substances Data Bank [8]. Although it is difficult to compare the model precisions of other investigations, the dual-DNN model (72%) and the dual-input LightGBM model (74%) was fairly accurate when compared with other traditional approaches (55%-81%). Although we manually optimized the hyper-parameters within a limited scale, the results suggest the dual-input machine learning model for skin sensitizer classification performs very well. The dual-input LightGBM model showed significantly better performance than single-input models, while no statistically significant difference was found between these the DNN models. The general drawback of DNN models is that small and imbalanced datasets, such as the skin sensitizer dataset, negatively affect performance. Recently, Matsuzaka et al. developed DeepSnap, a procedure for generating an omnidirectional snapshot portraying the 3D structures of chemicals [9]. This approach would enable data to be augmented, which could solve the dataset problem of skin sensitizers. In addition, we used only the MACCS fingerprint as a structural property. Among the prediction algorithms based on molecular structures, graph convolutional networks (GCN), a type of neural network for graph-related tasks, has exhibited great performances in the classification evaluation. Kojima et al. recently developed kGCN, an open-source software which provides GCN-based prediction of chemical properties, and they also showed that the multi-modal approach delivers better predictive performance than single-task approaches [10]. These results suggests that the use of structural properties other than MACCS keys may improve the performance of the dual-input DNN model. We also should investigate the chemical space of the dual-input models next to precisely understand the potential for skin sensitizer classification.

Conflict of interest
The author is an employee of Japan Tobacco Inc. and declares no conflict of interests with respect to the research, authorship, and/or publication of this article.