2025 Volume 7 Issue 4 Pages 257-266
Background: Low peak oxygen uptake (V̇O2), especially ≤14 mL/min/kg, is a strong indicator of poor prognosis in patients with heart failure (HF). However, measuring this parameter is sometimes difficult if the maximal workload is not reached. This study developed a predictive classification model for low peak V̇O2 in HF patients using machine learning (ML).
Methods and Results: We retrospectively analyzed the data for 343 patients with chronic HF and left ventricular ejection fraction <50% who underwent a symptom-limited cardiopulmonary exercise test and extracted 33 variables from their laboratory, echocardiographic, and exercise data up to the submaximal workload. The dataset was randomly divided into training and testing datasets in a 4 : 1 ratio. ML methods, including an exhaustive search for predictor selection, were used, and a support vector machine algorithm was applied for model optimization. We identified 5 important predictors: age, B-type natriuretic peptide, left ventricular end-diastolic diameter, V̇O2 at rest, and V̇O2 at respiratory exchange ratio of 1.00. Using these 5 predictors, an optimized predictive model was validated on the testing dataset, yielding an accuracy of 85%, F1 score of 0.81, and area under the receiver operating curve of 0.94 (95% confidence interval: 0.89–1.00).
Conclusions: Using readily available parameters, ML methods can enable accurate prediction of low peak V̇O2 in patients with HF.
Exercise capacity in patients with chronic heart failure (HF) is a useful indicator that comprehensively reflects cardiopulmonary function and muscle strength, and its significance in prognosis determination and guiding of future treatment strategies has been highlighted.1–5 In particular, peak oxygen uptake (V̇O2), measured through cardiopulmonary exercise testing (CPX), is a representative indicator of exercise capacity because it is highly reproducible in a controlled environment.6 Notably, peak V̇O2 ≤14 mL/min/kg is recognized as a benchmark for low exercise capacity associated with poor prognosis, one of the defining indices for advanced HF, and an indication for heart transplantation.7,8 Despite its importance, the measurement of peak V̇O2 is sometimes challenging in real-world clinical settings because of interruptions during submaximal exercise for reasons such as the risk of maximal loading, knee or leg pain, and lack of motivation. In a prior study, approximately 15% of subjects did not reach the minimal sufficient condition of peak V̇O2, as indicated by a peak respiratory exchange ratio (RER) of 1.10,9 with peak V̇O2 with insufficient RER exhibiting a diminished correlation with prognosis in HF patients.10 Another study showed that approximately 45% of patients with HF failed to attain a peak RER of 1.05, mainly because of leg fatigue.11 If a predictive model of low peak V̇O2 can be established using only resting or submaximal exercise data, which can be attained in most patients, it will be useful in assessing the severity and prognostic risk of HF in those patients who fail to achieve sufficient maximal load.
Conventional machine-learning (ML) models, particularly using a support vector machine, have been more accurate than neural network models or traditional linear regression models in predicting exercise capacity in healthy subjects, probably because of the nonlinear characteristics of exercise capacity.12 Deep-learning models can also predict exercise capacity in patients with HF.13 However, compared with deep-learning models, conventional ML models require fewer cases and variables for construction, and the contributions of important features to the model are interpretable, suggesting the model’s potential for deployment in clinical applications such as interventions for each important predictor. Additionally, in constructing accurate and robust ML models, it is better to extract useful predictors from various candidate variables14 and avoid a highly heterogeneous cohort. Thus, in this study, we focused on patients with HF and left ventricular ejection fraction (LVEF) <50% to reduce the heterogeneity of HF characteristics and planned to extract important features and construct a predictive classification model of exercise intolerance, defined as a peak V̇O2 ≤14 mL/min/kg, for risk classification by using ML methods comprising the filter method, the wrapper method, and an exhaustive search for predictor extraction and support vector machine algorithm for final model optimization.
Data for patients with chronic HF and LVEF <50% who underwent symptom-limited CPX at Osaka University Hospital between January 2015 and December 2020 were retrospectively analyzed. Patients who were less than 20 years old, had undergone heart transplantation, had a left ventricular assist device, had congenital heart disease or primary pulmonary vascular disease, or were on maintenance dialysis were excluded. To accurately evaluate exercise capacity, patients with peak RER <1.10, indicating an inability to perform symptom-limited exercise,15–18 were also excluded.
This study adhered to the tenets of the Declaration of Helsinki and Good Clinical Practice, and was approved by the Ethical Review Board Osaka University Hospital (Approval No. 19210), and all patients were given the opportunity to refuse participation through a public opt-out mechanism.
Clinical and Laboratory DataClinical data, including the patient’s characteristics, medications, and echocardiographic data, were obtained from the medical records closest to the date of CPX. For patients without records on the day of examination, data from the period immediately before the examination were used. Echocardiographic data that were not included in clinical reports were obtained from stored images.
A symptom-limited exercise test was conducted using a ramp protocol. The setup included an upright, electromagnetically braked cycle ergometer with a 12-lead ECG and blood pressure monitoring device. Throughout the exercise test, V̇O2 and carbon dioxide output were measured using the breath-by-breath method using an AE-310 respiromonitor. The protocol involved a 1–3-min resting period until expiratory gas stabilized, followed by a 10-W steady-state load as a warm-up and 1-W 3–6-s incremental loads until the symptomatic limit was reached. The speed of titration was adjusted according to the patient’s background, including age, severity of cardiac disease, and usual activity level. Resting parameters were calculated by averaging a maximum of 120 s of expiratory gas data during a calm variability phase. The time series of V̇O2 data obtained during the exercise test were recorded breath-by-breath, and linear spline interpolation was then performed every 3 s, followed by a moving average of 4 samples before and after, for a total of 9 samples. Peak V̇O2 was defined as the maximum V̇O2 during exercise in the moving-average time-series data. For the construction of predictive classification models, patients were divided into 2 groups: L (low-capacity group) and P (preserved-capacity group), characterized by peak V̇O2 ≤14 mL/min/kg and peak V̇O2 >14 mL/min/kg, respectively. To use only readily available dependent variables, we chose heart rate, workload, and V̇O2 at the last point where RER ≤1.00 (at R1), which do not require specialized analysis. For patients who did not have an RER ≤1.00 after the resting period, the resting data were equivalent to the data at R1 (n=4). Heart rhythms were analyzed from the ECG recorded at rest.
PredictorsWe selected 33 variables for ML predictor extraction, focusing on those associated with exercise capacity, severity, and prognosis of HF, as reported in prior studies (Table 1). The body mass index (BMI), body surface area (BSA), estimated appendicular skeletal muscle mass (ASM),19 and estimated ASM index (SMI) were calculated for each patient. Values of ventilatory anaerobic threshold (VAT) estimated by age and peak V̇O2 estimated by age were calculated:20
The 33 Candidate Predictors for Machine-Learning Model
Clinical data | Age, sex, height, weight, BMI, BSA, estimated ASM, estimated SMI, ICM/non-ICM, estimated VAT, estimated peak V̇O2 |
Laboratory data | Hb, Na, Cr, BNP, CONUT score, PNI, GNRI |
TTE data | LVEDd, LVESd, LVEF, LAd, LAVi |
Chest X-ray | CTR |
CPX data | Sinus rhythm, AF/AFL/AT rhythm, pacing rhythm, HR at rest, V̇O2 at rest, HR at R1, V̇O2 at R1, load at R1 |
Parameters obtained from laboratory data, TTE data, chest X-ray, and submaximal exercise test data were selected as candidate predictors for exercise capacity, prognosis, and severity of heart failure. AF/AFL/AT, atrial fibrillation, atrial flutter, or atrial tachycardia; ASM, appendicular skeletal muscle mass; BMI, body mass index; BNP, brain natriuretic peptide; BSA, body surface area; CONUT, controlling nutritional status; CPX, cardiopulmonary exercise test; Cr, creatinine; CTR, cardiothoracic ratio; GNRI, geriatric nutritional risk index; Hb, hemoglobin; HR, heart rate; ICM, ischemic cardiomyopathy; LAd, left atrial diameter; LAVi, left atrial volume index; LVEDd, left ventricular end-diastolic diameter; LVEF, left ventricular ejection fraction; LVESd, left ventricular end-systolic diameter; Na, sodium; Peak V̇O2, peak oxygen uptake; PNI, prognostic nutritional index; R1, respiratory gas exchange ratio=1; SMI, skeletal muscle mass index; TTE, transthoracic echocardiography; VAT, ventilatory anaerobic threshold; V̇O2, oxygen uptake.
estimated VAT = −0.100 × Age + 21.44 (male)
= −0.069 × Age + 19.35 (female)
estimated peak V̇O2 = −0.272 × Age + 42.29 (male)
= −0.196 ×Age + 35.58 (female)
Actual values of VAT were not used because of pragmatic limitations. In particular, VAT could not be determined for several patients. As the exact determination of VAT is typically challenging, expert analysis is required to ensure its reproducibility.
Values of the controlling nutritional status (CONUT) score, prognostic nutritional index (PNI), and geriatric nutritional risk index (GNRI) were calculated from laboratory data. Patients with missing data for any variable were excluded from the analysis.
ML Model ConstructionA supervised ML model was constructed to predict whether a patient belonged to Group L or Group P. First, the dataset including all patients was randomly divided into a training dataset for supervised learning and a testing dataset for model validation in a 4 : 1 ratio. Next, several supervised feature selection techniques were used to narrow down the critical predictors suitable for predictive model construction. Finally, the selected important predictors were used for supervised learning using support vector machines with the Gaussian kernel to construct the final predictive model (Figure 1).
Protocol of machine-learning model construction and validation, showing process flow of validation and feature selection in the construction of the machine-learning model. The training dataset was used to select important predictors, which were then used to construct an optimized predictive model. Details in Supplementary Figure 1. The accuracy of the constructed predictive model was validated on the testing dataset. Shapley values were used to interpret the model.
Given that the use of an exhaustive search method to select important predictors from all candidate variables is time-consuming, filter and wrapper methods were used to narrow down the candidate pool in advance (Supplementary Methods, Supplementary Figure 1).
Step 1: Predictor Exclusion First, predictors with P values exceeding 0.1, as obtained by statistical analyses, were excluded from the candidates. Subsequently, using the permutated feature importance (PFI) values calculated by random forest algorithms, predictors only weakly related to exercise intolerance or exhibiting high collinearity with other predictors were excluded (Supplementary Methods).
Step 2: Narrowing-Down the Predictors The wrapper method was used to further narrow down the most influential predictors. To extract various correlations, we used 3 ML algorithms (Naïve Bayes, support vector machine, and decision tree), each of which performed forward sequential selection and backward sequential elimination (Supplementary Methods). The best set of predictors with the highest accuracy was selected for each algorithm, and the 3 sets were combined to identify the potential important predictors.
Step 3: Selection of Important Predictors and Model Optimization All combinations of the N candidate predictors were compared after wrapper method selection.
For each of the 2N−1 predictor combinations, a predictive model was constructed through supervised learning using a support vector machine incorporating the Gaussian kernel method. Among the predictive models with k predictors (k= 1, 2, ..., N), the model with the highest area under the receiver-operating characteristic curve (AUC) was subjected to model optimization, and the model with the highest accuracy and AUC among the N optimized models was selected as the optimized predictive model (Supplementary Methods). The predictors used in the optimized predictive model were considered important predictors.
In the wrapper method, exhaustive search, and model optimization, the 10-fold cross-validation method was applied to the iterative analysis.
Step 4: Validation Using the receiver-operating characteristic (ROC) curve of the training dataset, we selected an optimal probability threshold and applied the same threshold to the testing dataset to determine the accuracy, sensitivity, specificity, and F1 score for the testing dataset. We also evaluated another threshold for a false-negative rate <5% for screening use in clinical practice.
Analysis of the Optimized Predictive ModelThe model validity was assessed by calculating the average of the absolute Shapley values, representing the contribution of each important predictor in the optimized predictive model.
Statistical AnalysisContinuous variables with a Gaussian distribution were analyzed using mean±standard deviation values, where those with alternative distributions were analyzed using median and first and third quartile values. Categorical variables are presented as the number (percentage) of patients with the corresponding attribute. Based on the Anderson-Darling test, the deviation from the mean ± 2 × standard deviation or the first and third quantile ± 1.5 × interquartile range was clamped to these threshold values. Moreover, P values for continuous variables were obtained using the Wilcoxon rank-sum test, and those for categorical variables were derived through the χ-square test as comparisons between groups P and L and between the training and testing datasets. High collinearity between any 2 predictors was defined as the value of Spearman’s rank correlation coefficient (r) ≥0.9. Measurements of model performance included the values of AUC, accuracy, sensitivity, specificity, and the F1 score. Subgroup analyses of age (<60 vs. ≥60 years), sex, etiology, pacemaker implantation, BMI (<18.5 vs. ≥18.5 and <25 vs. ≥25 kg/m2), B-type natriuretic peptide (BNP) (<200 vs. ≥200 pg/mL), and LVEF (<30% vs. ≥30%) categories were performed. A two-tailed P value <0.05 was considered statistically significant. MATLAB® R2021b was used for ML and statistical analyses.
Table 2 lists the clinical characteristics of the studied cohort. Compared with Group P, Group L exhibited lower V̇O2 and workload at R1, suggesting an early shift to anaerobic metabolism at a lower load. The median peak V̇O2 values were 12.3 (10.7–13.1) and 17.8 (15.9–20.2) mL/min/kg for Groups L and P, respectively. The proportion of Group L or the value of peak V̇O2 did not differ between the training and testing datasets.
Baseline Characteristics of Patients According to Their Exercise Capacity
Entire cohort (n=343) |
Group P (n=205) |
Group L (n=138) |
P value | Training data set (n=275) |
Testing data set (n=68) |
P value | |
---|---|---|---|---|---|---|---|
Clinical data | |||||||
Age (years) | 54 (45–64) | 51 (42–61) | 59 (51–67) | <0.001 | 54 (45–64) | 55 (43–64) | 0.967 |
Male, n (%) | 278 (81) | 170 (82.9) | 108 (78.3) | 0.280 | 223 (81.1) | 55 (80.9) | 0.969 |
BMI (kg/m2) | 23.3±3.8 | 23.1±3.7 | 23.6±3.9 | 0.197 | 23.4±3.8 | 23.2±3.6 | 0.682 |
BSA (m2) | 1.73±0.18 | 1.73±0.18 | 1.74±0.19 | 0.529 | 1.73±0.19 | 1.73±0.17 | 0.960 |
NYHA, n (%) | |||||||
I | 61 (17.8) | 53 (25.9) | 8 (5.8) | <0.001 | 49 (17.8) | 12 (17.6) | 0.974 |
II | 216 (63) | 133 (64.9) | 83 (60.1) | 0.373 | 172 (62.5) | 44 (64.7) | 0.741 |
III | 66 (19.2) | 19 (9.3) | 47 (34.1) | <0.001 | 54 (19.6) | 12 (17.6) | 0.709 |
Non-ICM, n (%) | 248 (72.3) | 157 (76.6) | 91 (65.9) | 0.031 | 196 (71.3) | 52 (76.5) | 0.391 |
AF/AFL/AT, n (%) | 24 (7.0) | 10 (4.9) | 14 (10.1) | 0.061 | 21 (7.6) | 3 (4.4) | 0.351 |
Pacing rhythm, n (%) | 52 (15.2) | 18 (8.8) | 34 (24.6) | <0.001 | 44 (16.0) | 8 (11.8) | 0.383 |
Pacemaker, n (%) | 103 (30.0) | 42 (20.5) | 61 (44.2) | <0.001 | 85 (30.9) | 18 (26.5) | 0.475 |
Hypertension, n (%) | 69 (20.1) | 44 (21.5) | 25 (18.1) | 0.448 | 54 (19.6) | 15 (22.1) | 0.655 |
Diabetes mellitus, n (%) | 78 (22.7) | 40 (19.5) | 38 (27.5) | 0.082 | 57 (20.7) | 21 (30.9) | 0.074 |
Smoker, n (%) | 27 (7.9) | 21 (10.2) | 6 (4.3) | 0.047 | 22 (8.0) | 5 (7.4) | 0.859 |
β-blockers, n (%) | 320 (93.3) | 188 (91.7) | 132 (95.7) | 0.152 | 258 (93.8) | 62 (91.2) | 0.435 |
ACEi/ARBs, n (%) | 297 (86.6) | 180 (87.8) | 117 (84.8) | 0.421 | 233 (84.7) | 64 (94.1) | 0.042 |
Statins, n (%) | 139 (40.5) | 65 (31.7) | 74 (53.6) | <0.001 | 111 (40.4) | 28 (41.2) | 0.903 |
MRAs, n (%) | 224 (65.3) | 129 (62.9) | 95 (68.8) | 0.259 | 181 (65.8) | 43 (63.2) | 0.689 |
Diuretics, n (%) | 254 (74.1) | 133 (64.9) | 121 (87.7) | <0.001 | 206 (74.9) | 48 (70.6) | 0.467 |
Laboratory data | |||||||
Hemoglobin (g/dL) | 13.6±1.6 | 13.9±1.6 | 13.1±1.6 | <0.001 | 13.6±1.7 | 13.8±1.6 | 0.263 |
Na (mEq/L) | 139 (137–141) | 140 (138–141) | 138 (136–140) | <0.001 | 139 (137–141) | 139 (138–141) | 0.739 |
Cr (mg/dL) | 0.95 (0.82–1.14) | 0.93 (0.79–1.05) | 1.06 (0.88–1.33) | <0.001 | 0.96 (0.84–1.13) | 0.93 (0.78–1.21) | 0.642 |
BNP (pg/mL) | 148 (63–281) | 97 (41–214) | 226 (127–405) | <0.001 | 150 (78–290) | 90 (53–251) | 0.027 |
CONUT score ≥2, n (%) | 144 (42.0) | 65 (31.7) | 79 (57.2) | <0.001 | 121 (44.0) | 23 (33.8) | 0.128 |
PNI | 49.6±4.9 | 50.6±4.7 | 48.1±4.8 | <0.001 | 49.5±5.0 | 50.0±4.5 | 0.447 |
GNRI | 104.5±9.5 | 105.0±9.5 | 103.8±9.4 | 0.253 | 104.5±9.7 | 104.9±8.6 | 0.747 |
TTE data | |||||||
LVEDd (mm) | 66.1±11.3 | 64.6±10.5 | 68.3±12.0 | 0.003 | 65.9±11.2 | 66.7±11.5 | 0.587 |
LVESd (mm) | 58.4±12.8 | 56.2±12.2 | 61.5±13.1 | <0.001 | 58.3±12.8 | 58.7±12.8 | 0.834 |
LVEF (%) | 29 (21–35) | 31 (24–37) | 25 (20–33) | <0.001 | 29 (21–35) | 30 (23–35) | 0.819 |
LAd (mm) | 45±9 | 43±8 | 48±9 | <0.001 | 45±8 | 46±9 | 0.554 |
LAVi (mL/m2) | 48.1 (35.2–63.1) |
42.7 (31.7–57.9) |
54.8 (42.4–69.8) |
<0.001 | 47.0 (33.7–62.1) |
55.2 (41.4–68.1) |
0.013 |
MR moderate or more, n (%) | 49 (14.3) | 27 (13.2) | 22 (15.9) | 0.472 | 37 (13.5) | 12 (17.6) | 0.376 |
Chest X-ray | |||||||
CTR (%) | 53 (49–57) | 51 (48–55) | 55 (52–59) | <0.001 | 53 (49–57) | 52 (50–57) | 0.997 |
CPX data | |||||||
HR at rest (bpm) | 75 (70–83) | 76 (70–84) | 75 (69–81) | 0.235 | 75 (70–82) | 75 (69–85) | 0.607 |
V̇O2 at rest (mL/min/kg) | 3.6 (3.2–3.9) | 3.7 (3.4–4.1) | 3.4 (3.1–3.7) | <0.001 | 3.5 (3.2–3.9) | 3.6 (3.2–4.0) | 0.757 |
HR at R1 (bpm) | 97 (85–108) | 101 (91–111) | 90 (79–102) | <0.001 | 96 (85–107) | 101 (83–111) | 0.379 |
V̇O2 at R1 (mL/min/kg) | 9.5 (7.9–11.3) |
10.8 (9.4–12.9) |
7.8 (6.5–8.9) |
<0.001 | 9.4 (7.8–11.3) |
9.7 (8.3–11.4) |
0.259 |
Workload at R1 (W) | 50 (39–63) | 57 (46–76) | 41 (31–50) | <0.001 | 50 (38–63) | 51 (44–64) | 0.149 |
Peak V̇O2 (mL/min/kg) | 15.4 (12.6–18.3) |
17.8 (15.9–20.2) |
12.3 (10.7–13.1) |
– | 15.3 (12.5–18.2) |
15.6 (13.1–19.3) |
0.338 |
Peak RER | 1.29 (1.23–1.35) |
1.29 (1.23–1.34) |
1.30 (1.23–1.37) |
0.243 | 1.29 (1.23–1.36) |
1.28 (1.24–1.33) |
0.265 |
Group L, n (%) | 138 (40.2) | – | – | – | 111 (40.4) | 27 (39.7) | 0.921 |
Values are presented as mean±SD, (1st-3rd quartile), or n (%). ACEi, angiotensin-converting enzyme inhibitor; ARB, angiotensin II receptor blocker; BMI, body mass index; BNP, B-type natriuretic peptide; NYHA, New York Heart Association functional classification; MR, mitral regurgitation; MRA, mineral corticoid-receptor antagonist. Other abbreviations as in Table 1.
Excluded Predictors for Constructing Models
In the analysis of the training dataset consisting of 275 patients, predictors with P values >0.1 between groups L and P were excluded, including sex, height, weight, BNP, BSA, estimated ASM, estimated SMI, heart rate at rest, presence of ischemic cardiomyopathy, and presence of arrhythmia, including atrial fibrillation. Subsequently, a random forest model was constructed using all remaining predictors, and their PFI values were calculated (Figure 2).
Permutated feature importance (PFI) in a random forest model. The bar graph shows the PFI value of each predictor, shown in red on the right-hand side. V̇O2 at R1 had the highest PFI value (i.e., it was the factor that contributed the most to the prediction of exercise capacity in this random forest model). Only CTR had negative PFI values and was thus excluded from the candidate pool. BNP, B-type natriuretic peptide; CONUT, controlling nutritional status; Cr, creatinine; CTR, cardiothoracic ratio; GNRI, geriatric nutritional risk index; Hb, hemoglobin; HR, heart rate; LAd, left atrial diameter; LAVi, left atrial volume index; LVEDd, left ventricular end-diastolic diameter; LVEF, left ventricular ejection fraction; Na, sodium; PNI, prognostic nutritional index; R1, respiratory gas-exchange ratio=1; V̇O2, oxygen uptake.
The estimated VAT by age (eVAT) and estimated peak V̇O2 by age (ePV̇O2) were excluded because of their strong collinearity with age (eVAT: r=0.986, ePV̇O2: r=0.935) and low PFI (age: 0.603, eVAT: 0.465, ePV̇O2: 0.415). Left ventricular end-systolic diameter (LVESd) was also excluded because of the strong collinearity with left ventricular end-diastolic diameter (LVEDd, r=0.966) and low PFI (LVEDd: 0.446, LVESd: 0.218). The presence of pacing rhythm exhibited a negative PFI and was thus excluded. The cardiothoracic ratio was excluded because of its negative PFI value, and the remaining 17 predictors were included in the next step of the model construction.
Narrowed-Down Predictors for Constructing ModelsUsing the wrapper method on the 17 predictors, we obtained the best-performing predictor combination for each of the 3 algorithms (Table 3). For instance, in the decision tree algorithm, the highest accuracy was achieved using 8 predictors selected using the forward sequential wrapper method (Figure 3). By combining these predictors from the 3 algorithms, 14 predictors were selected as candidates: V̇O2 at R1, hemoglobin (Hb), age, LVEF, LVEDd, BNP, left atrial volume index, left atrial diameter, CONUT score, sodium concentration, sinus rhythm, V̇O2 at rest, workload at R1, and creatinine. These predictors were used in the next step of the model construction.
Predictor Combination for Each Algorithm, Obtained Using Wrapper Methods
Algorithm | Selected predictors |
---|---|
Decision tree | V̇O2 at R1, LAVi, Hb, CONUT score, Sinus rhythm, LVEF, Load at R1, Cr |
Support vector machine | V̇O2 at R1, BNP, LAd, Na, Hb, Age, V̇O2 at rest |
Naïve Bayes | V̇O2 at R1, LVEDd, Age, LVEF |
Candidates | V̇O2 at R1, Hb, Age, LVEF, LVEDd, BNP, LAVi, LAd, CONUT score, Na, Sinus rhythm, V̇O2 at rest, Workload at R1, Cr |
The best-performing predictor combination was selected using forward/backward sequential wrapper methods with accuracy as the evaluation function. The table lists the best-performing combination for each algorithm. All candidates selected by each algorithm were combined, and the 14 predictors were the candidates for the exhaustive search. Abbreviations as in Table 1.
Sequential wrapper methods with decision tree show the accuracy of the cross-validated decision tree models constructed using this method when the predictors are sequentially increased or decreased. The evaluation function in the wrapper method was the misclassification error (MCE). When the number of variables was sequentially increased, the accuracy was the highest when the number of selected predictors was 8 (black marker, Left). When the variables were sequentially eliminated, the accuracy was the highest when the number of eliminated predictors was reduced by 9 (i.e., when the number of retained predictors was 8 [black marker, Right]). The model based on the forward wrapper method was more accurate, and therefore, the 8 predictors of this model were considered to be the most effective for constructing the decision tree model.
Constructed Model
Cross-validated support vector machine models were constructed for all possible combinations of the 14 predictors, resulting in a total of 16,383 models. The corresponding AUCs were compared. Table 4 outlines the combinations with the best AUC for each of the k (k= 1, 2, … 14) predictors.
Results of Exhaustive Search and Model Optimization
Predictors | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
V̇O2 at R1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Age | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
BNP | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Hb | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
V̇O2 at rest | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
CONUT score | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||
LVEF | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||
Na | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||
LAVi | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||
Cr | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||
LAd | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||
LVEDd | ✓ | ✓ | ✓ | ✓ | ||||||||||
Load at R1 | ✓ | ✓ | ✓ | ✓ | ||||||||||
Sinus rhythm | ✓ | ✓ | ✓ | |||||||||||
Performance | ||||||||||||||
ACC | 81.1 | 82.2 | 83.3 | 84.7 | 85.5* | 84.4 | 85.1 | 85.5 | 84.4 | 83.6 | 84.4 | 82.9 | 82.9 | 83.6 |
AUC | 0.887 | 0.898 | 0.911 | 0.907 | 0.928* | 0.924 | 0.924 | 0.919 | 0.918 | 0.918 | 0.914 | 0.912 | 0.918 | 0.915 |
Sensitivity | 78 | 86 | 86 | 83 | 86* | 90 | 84 | 86 | 86 | 86 | 91 | 84 | 88 | 84 |
Specificity | 84 | 80 | 82 | 87 | 86* | 82 | 87 | 85 | 85 | 84 | 82 | 85 | 80 | 85 |
F1 score | 0.759 | 0.774 | 0.795 | 0.813 | 0.823* | 0.805 | 0.816 | 0.82 | 0.804 | 0.795 | 0.807 | 0.789 | 0.791 | 0.795 |
ACCAUC | 0.719 | 0.738 | 0.758 | 0.769 | 0.793* | 0.78 | 0.787 | 0.785 | 0.774 | 0.768 | 0.771 | 0.756 | 0.761 | 0.766 |
*The model performance of the best predictive model. The exhaustive search revealed the best combinations of κ predictors (κ=1, 2, 3, ... 14). (Upper) Each column represents the selected predictors with the highest AUC, as obtained by the exhaustive search among the combinations of κ predictors. (Lower) Performance metrics of the model optimized with cross-validation for each combination. Among these combinations, the combination of 5 predictors provided the best predictive model with the highest accuracy and AUC. ACC, accuracy; ACCAUC, ACC multiplied by AUC; AUC, area under the receiver-operating characteristic curve. Other abbreviations as in Table 1.
After model optimization with hyperparameter tuning, the model constructed using a combination of 5 predictors was identified as the best model for predicting exercise intolerance (Table 4). This optimized predictive model exhibited AUC, accuracy, sensitivity, specificity, and F1 score values of 0.93, 85%, 0.86, 0.86, and 0.82, respectively (Figure 4A). The predictive model with the threshold for screening use also exhibited good accuracy, sensitivity, specificity, and F1 score (80%, 0.95, 0.71, and 0.69, respectively). The following predictors were identified as the most important: age, BNP obtained from laboratory data, LVEDd obtained from echocardiographic data, V̇O2 at rest, and V̇O2 at R1 obtained from CPX data.
Optimized model performance of low exercise capacity in heart failure. (A) ROC curve of the optimized predictive model on the training dataset, with an AUC of 0.93. The red point shows the optimal probability threshold (0.43649), and the green point shows the threshold for screening use (0.18988). (B) ROC curve of the testing dataset, with an AUC of 0.94. Based on the optimal threshold, the accuracy was 85%, sensitivity was 0.78, and specificity was 0.90. (C,D) Average Shapley value for each important predictor in the optimized SVM model. These contributions were abundantly similar for the training (C) and testing (D) datasets. The Shapley value of V̇O2 at R1 was the highest among 5 important predictors. AUC, area under the curve; BNP, B-type natriuretic peptide; LVEDd, left ventricular end-diastolic diameter; R1, respiratory gas exchange ratio=1; ROC, receiver operating characteristic; SVM, support vector machine; V̇O2, oxygen uptake.
Validation of the Optimized Predictive Model
When validated on the testing dataset, the model with optimal threshold exhibited 0.94 (95% confidence interval: 0.89–1.00) for AUC, 85% for accuracy, 0.78 for sensitivity, 0.90 for specificity, and 0.81 for F1 score, comparable with the performance metrics on the training dataset (Figure 4B, Supplementary Figure 2A). The model with the threshold for screening use exhibited 90% accuracy, 1.00 for sensitivity, 0.83 for specificity, and 0.89 for F1 score, indicating a model suitable enough to screen high-risk patients with exercise intolerance (Supplementary Figure 2B). Subgroup analyses of the predictive model were performed for age, sex, etiology, pacemaker implantation, BMI, BNP, and LVEF categories, and no significant difference was observed between each pair of subgroups. The results are shown in the Supplementary Table.
Contribution Analysis of Predictors in Optimized Predictive ModelThe optimized predictive model was analyzed by calculating the average of the absolute Shapley values (Figure 4C,D). The contribution of each important feature was similar over the training and testing datasets. The contribution of V̇O2 at R1 was the highest among the 5 important predictors, followed by BNP; the average effect on the posterior probability was 20% and 10%, respectively. Comparable values were observed for the other 3 factors: LVEDd, age, and V̇O2 at rest.
This study represents the first successful attempt at using conventional ML algorithms to develop a highly accurate classification model for predicting exercise intolerance in patients with chronic HF using simple parameters commonly obtained during routine clinical practice and exercise tests up to submaximal loads. The proposed predictive model can serve as a screening tool and/or a valuable reference tool for treatment decisions regarding prognosis prediction and indications for cardiac transplantation in patients with chronic HF. This approach is particularly advantageous because it can mitigate the risk of cardiac overload attributable to maximal workload, as it requires exercise data only up to submaximal workload. In this context, the proposed model offers a safer alternative, not only for high-risk patients but also for patients with concerns about the negative effects of excessive load on their heart.
Practicality of the Predictive ModelSeveral predictive models for lower exercise capacity of patients with HF, from the severity of symptoms and BNP21 and standard echocardiographic parameters,22 have been reported. However, those models used higher thresholds of exercise capacity than the threshold associated with prognosis,2,7,8 which lies beyond the pragmatic threshold for patients with chronic HF. Predictive models using prognosis-related thresholds have also been reported, but they have not achieved a practical level of accuracy, with AUCs typically ranging from 0.66 to 0.81.23 Compared with the existing models, our model used a threshold of peak V̇O2=14 mL/min/kg, a standard threshold relevant to prognosis in HF, and exhibited high accuracy through the incorporation of readily available submaximal exercise parameters. One reason for this high accuracy may be the focus on classification models rather than regression models, although the clinical value of the 2 types of models is different. V̇O2 at R1 is observable in almost all patients when testing up to submaximal workloads and thus can be easily analyzed, unlike VAT. This highlights the practicality of the proposed model, and it is assumed that appropriate probability thresholds will be set depending on the application case and clinical purpose.
ML Model InterpretationThe support vector machine using the Gaussian kernel method, a nonlinear ML model, can be used to construct a model for accurately predicting exercise capacity in healthy subjects if the predictors are carefully selected.12,14 In this study, we demonstrated that this method, combined with careful selection of important predictors, can precisely predict exercise intolerance even in patients with HF. Nevertheless, the interpretation of ML models remains challenging because of their complex model structure. Recent advancements in explainable artificial intelligence techniques, such as Shapley values,24 enable a more comprehensive analysis of ML models. For example, it has been reported that V̇O2 at R1, an index of exercise capacity at submaximal load, strongly correlates with maximal exercise capacity, as Shapley values have shown.25–27 Moreover, BNP and LVEDd, as predictors of HF severity, are related to exercise capacity.28,29 Although age is an established predictor of exercise capacity in healthy subjects, its Shapley value was inferior to those of BNP and LVEDd as indices of HF severity, which suggests that HF severity would be more influential than age in exercise tolerance in patients with HF, except for exercise performance at submaximal load. Resting V̇O2 has been reported to be lower in obese patients,30 and body fat percentage has been reported as a significant predictor in healthy individuals in addition to exercise parameters.31 Given that exercise capacity is weakly correlated with body size, such as height, weight, and estimated muscle mass, but resting V̇O2 has been noted to be strongly associated with it, body composition may be more strongly related to exercise capacity than body size.
Interestingly, certain factors known to be closely related to exercise capacity were not identified as important predictors in this study. For example, Hb, which is directly related to oxygen-carrying capacity and exercise capacity,32 was not identified as an important predictor in this study. This may be attributable to small variations in Hb levels in Group P, resulting in a less significant contribution than expected.
Muscle strength, measured muscle mass as body composition, VAT, and the minute ventilation/carbon dioxide production slope (V̇E vs. V̇CO2 slope) during exercise have been reported to strongly correlate with peak V̇O2. In this study, we used estimated muscle mass and did not include exercise parameters requiring waveform analysis. Incorporating these parameters could potentially improve the predictive accuracy; however, specialized analysis of the waveform is not as practical as using the parameters proposed in this study.
Effective Protocol for Feature SelectionIn this study, 3 algorithms were applied to refine the selection of important predictors. Although many ML algorithms exhibit nonlinear characteristics, the selected 3 algorithms use different types of learning methods: support vector machine uses error-based learning, Naïve Bayes adopts probability-based learning, and decision tree relies on information-based learning. By combining these algorithms, various types of nonlinear characteristics could be selected. The final combination of important predictors could not have been derived by any single algorithm, highlighting the effectiveness of adopting a combination of algorithms.
Clinical PerspectivesThe proposed predictive model can serve as a valuable reference tool when considering patients in NYHA functional class III/IV, which indicates poor prognosis and is one of the indications for cardiac transplantation in patients with chronic HF, especially at referring institutions. This approach is particularly advantageous because it can mitigate the risk of cardiac overload attributable to maximal workload by requiring exercise data only up to submaximal workload. In this context, the proposed model offers a safer alternative, not only for high-risk patients, but also for patients with concerns about the negative effects of excessive load on their heart.
Study LimitationsThe clinical utility of the proposed predictive model needs to be further verified. In this study, we prevented overfitting through appropriate cross-validation during model learning and demonstrated the generalization performance by verifying the model’s accuracy over a testing dataset. In addition, although the average of the absolute Shapley values was used to analyze the model structure, the values were similar for both the training and testing datasets, and the order of their contributions was also sufficient for validation. These results highlight that the proposed model can be generalized to new datasets. However, as our analysis was based on a limited sample size dataset from a single center, model validation from additional data from alternative centers is desirable to ensure generalization performance for all patients with HF. To create more accurate models, HF patients with LVEF <50% were examined. Therefore, the versatility of our model in patients with HF and LVEF >50% or with Stage B HF also needs to be verified in the future. We have added the program files (for MATLAB®) as Supplementary Files (PredictLowPVO2fromEXCEL.m, PredictMdl.mat, and Example Table in the Supplementary Materials) so that the model can be extensively validated in patients with various types of HF in different hospitals. Parameters related to muscle and respiratory function, which may improve the accuracy of the prediction model, were not included from the practicality perspective. Peak V̇O2 measured by ergometer may be underestimated in patients with a pacemaker, but the predictabilities did not differ between patients with and without a pacemaker (0.87 vs. 0.83 P=0.319). Although low peak V̇O2 is a well-established indicator, the proposed classification model may be less versatile than a regression model that accurately predicts peak V̇O2 in real-world clinical practice. In future exploratory regression model construction, inclusion of the important predictors identified in this study may help enhance model accuracy.
The use of a conventional ML algorithm with an elaborate protocol enabled accurate prediction of exercise intolerance in patients with HF using parameters obtained in routine practice and test data from exercise up to moderate intensity.
We thank Editage (www.editage.jp) for English language editing.
This study was provided financial support via a management expenses grant from Osaka University Hospital. The authors have no competing interests to declare.
The present study was approved by the Ethical Review Board Osaka University Hospital. Reference number: 19210.
Please find supplementary file(s);
https://doi.org/10.1253/circrep.CR-24-0135