Abstract
Although the supervised neural networks such as BNN (Back Propagation Neural Network) and CNN (Counter Propagation Neural Network) are useful techniques for modeling nonlinear data, the prediction ability for test set is not enough in case of using the large number of descriptors. Furthermore, interpretation of the established model is rather difficult and it is cumbersome to design new compounds. Therefore, it is important to remove the unrelevant descriptors, which have no significant contributions to the model. In order to select the significant descriptors among the huge combinations, GA (Genetic Algorithm) has been developed and used in QSAR (Quantitative Structure-Activity Relationship) studies. In the previous work, we have successfully combined CNN and GA and applied it to the structure-activity data of Phenylalkylamines. In this report, we examined the real utility of our method by using the steroid data, which have the larger number of descriptors than that of phenylalkylamies. First of all, we showed that this data set is nonlinear by PLS (Partial Least Squares). Next, we built up the CNN model with all 51 descriptors but the prediction for test set was poor. Then, GA was used for variable selection and it reduced the number of descriptors from 51 to 11. The prediction ability of the CNN model with 11 descriptors was much improved. Finally, the loading vector maps of the selected descriptors and activity were compared. The trend between the activity and each descriptor was easily understood by the coloring loading maps.