Circulation Journal
Online ISSN : 1347-4820
Print ISSN : 1346-9843
ISSN-L : 1346-9843

この記事には本公開記事があります。本公開記事を参照してください。
引用する場合も本公開記事を引用してください。

The Importance of Interpretability and Validations of Machine-Learning Models
Daisuke YamasawaHideki OzawaShinichi Goto
著者情報
ジャーナル オープンアクセス HTML 早期公開

論文ID: CJ-23-0857

この記事には本公開記事があります。
詳細

Neural networks have demonstrated utility in the medical field for handling complex multidimensional medical data.1,2 Recent studies have shown that machine-learning (ML) models constructed by neural networks can detect disease and predict prognosis from a single recording of a 12-lead ECG beyond the capabilities of fully trained experts.310 Because ECG is an inexpensive and widely available metric, using it to detect cardiac abnormalities has the potential to enable earlier diagnosis and treatment. Furthermore, ECGs have been integrated into wearable devices, enabling recording without visiting healthcare providers. Despite these wearable devices recording ECGs of a limited number of leads, a recent study showed their feasibility for training ML models to detect specific cardiac abnormalities.11

Article p ????

In this issue of the Journal, Sato et al12 utilize a large dataset derived from multiple centers to explore the detection of multiple cardiac abnormalities from a single-lead ECG. They extracted the voltage recording of lead-I from a 12-lead ECG to develop a model to predict reduced ejection fraction (EF), wall motion abnormality, left ventricular hypertrophy, left ventricular dilatation, and left atrial enlargement (LAD). Although performance was moderate for some of the abnormalities, the study adds an important body of evidence showing that a single-lead ECG from lead-I contains information to predict left heart abnormalities other than reduced EF. As noted by the authors, their findings were not directly confirmed using ECGs recorded by wearable devices, but the results suggest potential utility in such conditions.

Although ML is a valuable tool for building clinically useful models, there is the fundamental problem of being a black box, which prevents the widespread use of this technology in the medical field, where decisions could directly lead to life-threatening conditions for patients. There is ongoing research to address this issue by improving the interpretability and/or explainability of neural networks. Various methods have been proposed to address the lack of interpretability, such as gradient class activation mapping (GRAD-CAM), local interpretable model-agnostic explanations (LIME), and SHapley Additive exPlanations (SHAP). Of these methods, Sato et al utilized the GRAD-CAM method (Figure) to visualize the localization of features within the ECG. They found that the model focused on the P wave in the detection of LAD and on the QRS wave in other detections. Although these analyses add some level of interpretability to the “black box” model, it should be noted that the methods only provide information on “where” the feature was and do not show “what” the feature is. For example, the focus on the P wave does not tell us whether the model used the amplitude, the duration or the shape of the P wave for detecting LAD. This is still a topic for future research.

Figure.

Gradient class activation mapping (GRAD-CAM) of (A) original picture. Grad-CAM localizes class-discriminative regions: (B) the cat, (C) the desktop computer and (D) the cup.

Another important aspect of improving ML models’ reliability is validation in an external cohort. Given the incomplete interpretability of neural network models, they could be using any feature present in the input, including irrelevant features, such as differences in the vendor of the ECG recorder. This could lead to bias in certain situations, such as when the vendor used in the emergency department (where the prevalence of acute diseases is high) and in the clinic (where the prevalence of chronic disease is high) is different. Usually, these conditions differ in different institutions so external validation could identify models utilizing such irrelevant features. Sato et al conducted an external validation using data from JR Tokyo General Hospital and NTT Medical Center Tokyo. Similar performance in this analysis adds confidence for the model’s reliability.

However, it must be noted that even with all these efforts, there are still challenges to be overcome before this model can be applied to screening using wearable ECGs. As noted by the authors, the model was not tested on data from an actual wearable device. First, the “lead-I ECG” is probably different between those recorded by wearable devices and those from a 12-lead ECG affected by various factors, such as standing/moving vs. laying. Validation using data from wearable devices is essential to evaluate the utility of the model on wearable devices. Second, even with a dataset from hospital patients that has a higher prevalence of diseases, the positive predictive values of the models were low. Because the primary screening population is usually healthier, the positive predictive values are expected to be lower. Third, whether the model performs well in healthy individuals remains to be elucidated. The model could be detecting serious cases better, which may lead to lower performance on those who will likely be the target of screening.

In conclusion, the application of artificial intelligence in the medical field is in the midst of development. Simply showing excellent performance metrics does not guarantee clinical utility of the model. Extensive validations and improvement of interpretability/explainability should be the next key steps for the widespread use of ML models in medicine.

References
 
© 2023, THE JAPANESE CIRCULATION SOCIETY

This article is licensed under a Creative Commons [Attribution-NonCommercial-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nc-nd/4.0/
feedback
Top