Host: The Japanese Society of Toxicology
Name : The 51st Annual Meeting of the Japanese Society of Toxicology
Date : July 03, 2024 - July 05, 2024
A chemical language model, analogous to natural language processing, is employed in machine learning to handle compound structures, with SMILES representation as its principal input format. However, disparities in processing methods across databases lead to notational inconsistencies, even in Canonical SMILES. This study explores the influence of these 'dialects' in the applications of chemical language models for QSAR tasks.
An initial investigation revealed discrepancies in SMILES representations regarding stereochemistry across databases, which could impact the application of chemical language models. To assess the influence of these inconsistencies, three pre-processing methods were implemented: standard procedures, explicit stereoisomer assignment via 3D structure calculation, and exclusion of stereoisomeric data. Three corresponding models were constructed and evaluated for their performance on the Ames test dataset, focusing on translation and classification accuracy. Findings indicated notable enhancements in translation accuracy with the proposed pre-processing methods, while classification accuracy showed marginal improvement.
From the above, we found that (1) compound databases have many notational inconsistencies, and (2) taking stereoisomerism into account may contribute to improving the accuracy when applying chemical language models. It is expected that the operation of an appropriate chemical language model will contribute to the improvement of QSAR tasks, such as Ames mutagenicity prediction.