Predicting outcomes in patients with aortic stenosis using machine learning: the Aortic Stenosis Risk (ASteRisk) score

Objective To use echocardiographic and clinical features to develop an explainable clinical risk prediction model in patients with aortic stenosis (AS), including those with low-gradient AS (LGAS), using machine learning (ML). Methods In 1130 patients with moderate or severe AS, we used bootstrap lasso regression (BLR), an ML method, to identify echocardiographic and clinical features important for predicting the combined outcome of all-cause mortality or aortic valve replacement (AVR) within 5 years after the initial echocardiogram. A separate hold out set, from a different centre (n=540), was used to test the generality of the model. We also evaluated model performance with respect to each outcome separately and in different subgroups, including patients with LGAS. Results Out of 69 available variables, 26 features were identified as predictive by BLR and expert knowledge was used to further reduce this set to 9 easily available and input features without loss of efficacy. A ridge logistic regression model constructed using these features had an area under the receiver operating characteristic curve (AUC) of 0.74 for the combined outcome of mortality/AVR. The model reliably identified patients at high risk of death in years 2–5 (HRs ≥2.0, upper vs other quartiles, for years 2–5, p<0.05, p=not significant in year 1) and was also predictive in the cohort with LGAS (n=383, HRs≥3.3, p<0.05). The model performed similarly well in the independent hold out set (AUC 0.78, HR ≥2.5 in years 1–5, p<0.05). Conclusion In two separate longitudinal databases, ML identified prognostic features and produced an algorithm that predicts outcome for up to 5 years of follow-up in patients with AS, including patients with LGAS. Our algorithm, the Aortic Stenosis Risk (ASteRisk) score, is available online for public use.


INTRODUCTION
Aortic stenosis (AS) is increasingly prevalent and is associated with increased mortality, even when moderate in severity. 1 2 Identifying patients with AS who are at increased risk of death is challenging because of the complex interplay of multiple factors that determine risk. Conventional risk assessments use few echocardiographic criteria, namely aortic valve area, mean transvalvular gradient and peak transvalvular velocity. 3 This is clearly incomplete 4 -two patients with identical valve gradients can have very different risks based on other interacting, and unaccounted for, features.
Machine learning approaches can provide enhanced analysis of otherwise wasted data that are acquired as part of routine clinical care and can be used to estimate clinical risks. Such approaches applied to electronic WHAT IS ALREADY KNOWN ON THIS TOPIC ⇒ Current guideline criteria used to make clinical decisions in aortic stenosis (AS) are limited in number and are particularly challenging to apply in particular patient subgroups, namely low-gradient AS (LGAS). Machine learning (ML) models in medicine are traditionally challenging to apply at the bedside.

WHAT THIS STUDY ADDS
⇒ We show that application of an ML algorithm to combined echocardiographic and clinical data in patients with AS can provide good predictive capability for mortality and aortic valve replacement up to 5 years post echocardiography, and the algorithm performs well inthe clinically challenging patient subgroup LGAS. Furthermore, the algorithm developed is explainable and interpretable at a clinical level and uses few inputs that can be easily incorporated at the bedside.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE AND/OR POLICY ⇒ ML algorithms applied to echocardiographic and clinical data can yield valuable risk-predictive capability using few inputs at the bedside, extending the use of ML in AS beyond phenotyping and diagnosis and making use of available data that current guidelines for AS ignore.
health record (EHR) data have been used to accurately predict clinical outcomes. [5][6][7] Yet, findings are not always explainable and are not currently practically usable at the bedside. No study to date has evaluated clinical risk prediction in AS using machine learning analysis of combined echocardiographic and clinical data. We sought to develop an explainable model 8 that could be readily used at the bedside through a simple risk calculator requiring a few easily available input features. We also aimed to have a model that would be effective in the subgroup of patients with low-gradient aortic stenosis (LGAS) in whom clinical decision-making is extremely challenging. 9 10

Primary cohort
This study analysed a previously described longitudinal 'loyalty' cohort of patients (n=1130) with moderate or greater severity AS defined by echocardiographic aortic valve area ≤1.5 cm 2 and mean gradient ≥20 mm Hg. 11 All patients had longitudinal primary care at the Massachusetts General Hospital (MGH), Boston, Massachusetts, USA, with complete outcome follow-up until either death or the end of the study period (study period commencing 2006 and ending 31 December 2017) with no loss to follow-up. Patients with aortic dissection, coarctation, high left ventricular outflow tract velocity (≥1.6 m/s, which might indicate subaortic restriction to flow) or moderate or greater aortic or mitral regurgitation were excluded. One patient from the originally described cohort 11 was excluded from this study because the exact date of death was not available.

Data preprocessing
Patients in the primary cohort were described by a total of 316 (62 clinical and 254 echocardiographic) Valvular heart disease features. We excluded features that had missing data in greater than 50% of the patients leaving 69 features (online supplemental table 1). Of these remaining 69 features, the missing data rate was very low overall with most variables complete (median missing data per variable 0.15%, mode 0% and mean (skewed data) 5.6%). Categorical variables with more than two categories were binarised (ie, recoded to have only two categories). For example, aortic valve morphology was originally coded as either tricuspid, bicuspid, vertical bicuspid or horizontal bicuspid. We grouped all bicuspid categories into a single category, leaving only two categories-bicuspid or tricuspid. We also binarised the aortic valve area (≤1.0 cm 2 coded as 1), the mean pressure gradient (≥40 mm Hg coded as 1) and the flow rate (≤210 mL/s coded as 1). All continuous features were min-max normalised to fall between 0 and 1, inclusive. This was done to facilitate all input variables being on the same scale (ie, between 0 and 1) helping to interpret the coefficients arising from a logistic regression model.
Missing data were handled in one of two ways. For some binary variables, we assumed that a missing entry for each of these fields indicated that the result was normal, and considered not worth reporting, by the reporting clinician. We refer to these features as presumed normal and replaced these missing entries with the appropriate code for normal for that variable. For aortic valve morphology, for example, a blank was assumed to be normal tricuspid morphology. For features where a missing entry could not be presumed normal, we imputed values using a multivariate imputation method. 12 Thirty-one features had missing data requiring imputation.

Feature selection
Bootstrap lasso (Least Absolute Shrinkage and Selection Operator) regression (BLR) was used for feature selection. 13 14 In this method, an L1-regularised logistic regression model is trained using repeated rounds of bootstrapping. Lasso regression models have the property that many of the feature weights in the model are forced to zero, leaving only the most important features in the final model. Since the features that are selected by lasso regression may differ depending on the precise dataset used for training, we only use features that are consistently retained (ie, have non-zero weights) after many bootstrap iterations. We used 100 bootstrap splits, stratified by outcome, in which 80% of the data were used for training. Each bootstrap split consisted of a different set of patients randomly sampled with replacement from the entire dataset. Features with non-zero weight in at least 85% of the bootstrapped splits were retained for further model development. The regularisation parameter for the L1 regression was chosen using threefold cross validation, with the parameter being chosen separately for each bootstrap split (see online supplemental information). The choice of BLR threshold (85% in this case) entails a trade-off between which features are deemed important according to domain knowledge and the number of features selected. Based on prior work, 12 we initially used a value of 90%, but found that the mean gradient, a feature that is known to have prognostic significance, was not selected. We elected to lower the threshold to 85% to include this classically prognostic feature (see Discussion section).

Aortic Stenosis Risk (ASteRisk) score
We trained a logistic regression model to predict a combined outcome consisting of all-cause mortality or aortic valve replacement. As we hypothesised that patients who received an aortic valve replacement (AVR) were deemed to be at high risk of death if they did not receive a valve replacement, we included AVR in the combined outcome to improve our ability to identify patients who are at high risk of death; that is, AVR was treated as an aborted death event.
From 69 input features, we obtained 26 features from BLR. We then selected a subset of these 26 features to form a parsimonious set that could be readily entered into an online risk calculator. In reducing the feature set size, we kept features that were readily available to clinicians, are measured by current practice guidelines and appeared in the most bootstrap splits (figure 1). The resulting 9 features were used to train an L2-regularised logistic regression model to predict the combined outcome of death/AVR. Model performance, by area under the receiver operating characteristic curve (AUC) analysis, was compared among models containing 69, 26 and 9 features. For transvalvular flow rate, we used an empiric cut-off of 230 mL/s in the early BLR but chose to refine the threshold to 210 mL/s in the final model construction, to be consistent with the recent identification of this threshold flow rate at which aortic valve area becomes prognostic. 11 The thresholds for the aortic valve area and mean gradient were the same as those used in the BLR analysis. Posterior wall thickness 6. Heart failure 7. Myocardial infarction OR peripheral vascular disease OR regional wall motion abnormality 8. Hyperlipidaemia 9. CKD *Calculated from aortic valve area, mean gradient and peak velocity using validated formula. †Calculated from aortic valve area, aortic sinus diameter and transvalvular flow rate using validated formula. ‡Calculated from aortic valve area and aortic sinus diameter using validated formula; differs from feature 19 in that this feature does not use the transvalvular flow rate. Bootstrapping was done to obtain statistical measures of performance. A total of 10 stratified bootstrap splits (80% training, 20% testing) were performed for evaluation, and the results are reported over the bootstrapped test sets. We used AUC analysis as well as the 1-5 year HRs. For the HRs, we chose the upper quartile of risk to denote the high-risk subgroup. Cox proportional hazards models were used for time-to-event analyses. CIs for the AUCs were calculated by mean±2 SEs of the AUCs across the 10 bootstrap splits. All CIs in the primary cohort are reported on these 10 bootstrap splits. AUCs were compared using paired t-test.

Validation cohort
The validation cohort consisted of 540 patients with AS from Laval University, Quebec, Canada, who also had longitudinal follow-up and similar inclusion and exclusion criteria. This cohort came from a previously reported subset of patients, 11 which itself was drawn from a prior study cohort that recruited 1999-2007. 15 There were differences in coding of two features between the primary cohort (MGH) and validation cohort (Laval). First, the MGH dataset encodes a history of congestive heart failure without specifying the New York Heart Association (NYHA) class, while the validation cohort lists the NYHA class for each patient. To unify the coding structure, in the validation cohort we coded NYHA class ≤2 as zero and NYHA class >2 as one. Second, peripheral vascular disease at baseline and regional wall motion abnormalities were not available in the Laval dataset. Since the MGH-based model uses the 'logical or' of 'myocardial infarction, peripheral vascular disease, or abnormal wall motion' to represent significant atherosclerotic burden, we used only myocardial infarction in the Laval dataset for this feature.
To impute missing data, a second multivariate imputation model was trained on the final set of 9 features from the primary cohort and applied to the validation cohort.
CIs for the AUCs were calculated by randomly sampling 20% of the dataset, computing the AUC in that sample, and calculating the SD over 10 such random samples. The CIs were computed using the mean±2 SEs as in the primary cohort.

Baseline risk model
For comparison, we also constructed a baseline risk model using conventional clinical features: (1) mean transvalvular gradient, (2) aortic valve area, (3) age and (4) left ventricular ejection fraction. 6 16 These four features were used as input to an L2-regularised logistic regression model that was trained and tested in the same way as the ASteRisk score was, as described above.

RESULTS
Descriptive statistics for the primary and validation cohorts are shown in table 1. Of the 69 features available for each patient, 26 were selected by the algorithm. This was further reduced to 9 features (box 1).

Primary cohort
The minimum sample size required for a model with 9 features, an event fraction of 0.35, and a maximum root mean squared prediction error of 0.05 is 450, which is met in this analysis. 17 The ASteRisk score, which was trained on all 9 features, had an AUC of 0.74 (95% CI 0.73 to 0.76) for the combined outcome. The discriminatory ability of this model was superior to that of the baseline model (table 2). The ASteRisk score reliably identified patients at high risk of death at years 2-5 of follow-up, while the baseline model identified high-risk patients only at year 1 (table 3). Moreover, the discriminatory ability of the ASteRisk score was similar to that of a model trained with all 69 features (AUC of full model 0.75 with 95% CI 0.73 to 0.77, p=0.37 vs ASteRisk score). There was also no significant difference to the performance of the ASteRisk score to the intermediary model with 26 features.
For patients with LGAS (n=383), both the baseline and ASteRisk score reliably identified those at high risk of death for years 2-5, with the baseline model also being predictive at year 1 (table 3). Among the cohort of patients who did not receive an intervention (n=776), both models had statistically significant HRs for years 1-5 (table 3, see also online supplemental table 2).
Time-to-event analyses for both the combined outcome and mortality alone are shown in figure 2. Time-toevent analyses for patients with LGAS and for patients not undergoing intervention are presented in figures 3 and 4, respectively. For both the combined outcome and mortality alone, both curves separate at 1 year, with differences between years 2 and 5 being statistically significant for predicting mortality.

Validation cohort
In the validation cohort, the ASteRisk score had superior discriminatory ability relative to that of the baseline model (table 2). The ASteRisk score also identified patients at high risk of death at years 1-5 in the entire validation cohort, LGAS cohort (n=316) and patients not undergoing intervention (n=196) (table 4). By contrast, the baseline model identified high-risk patients in the overall cohort and patients not undergoing an intervention at years 1-5, but not at any time point in the LGAS cohort (table 4).
Time-to-event analyses for all patients in the validation cohort are presented in figure 5 for the ASteRisk score. Time-to-event analyses for patients with LGAS and patients not undergoing intervention in the validation cohort are presented in figures 6 and 7, respectively. Again, both curves separate by 1 year, with differences between years 1 and 5 being statistically significant for predicting mortality.

DISCUSSION
Using two independent longitudinal databases from large tertiary hospitals, we have demonstrated that a machine learning algorithm using a small set of important features, readily available at the bedside, can be input to reliably calculate clinical risk in patients with aortic stenosis over long-term follow-up ( figure 8). The algorithm outperforms a baseline risk model using conventional risk factors used to judge severe AS and also works in the traditionally challenging subgroup of LGAS. Moreover, the algorithm development process identified important features not traditionally considered in clinical risk assessment, but known from physiology to contribute to haemodynamic loading in AS.
A driving principle of this work was that a successful clinical risk model is not only determined by its performance but also by its ease of use. 8 Although modern EHR systems can implement risk models that use an arbitrary number of features, such systems are not available in all clinical settings. 18 Samad et al 5 demonstrated good prediction capability (AUC 0.89) with a random forest model, but used echocardiography and clinical data to predict all-cause mortality in a general population. Our study focused on outcome within AS, and specifically also evaluated the extremely clinically challenging subgroup of LGAS. While the results of Samad et al are compelling, the models work at a population level, and teasing out specific risks in particular phenotypic subgroups is a different challenge altogether. This brings up the fundamental dilemma in machine learning applied to health, whereby there is a debate about the degree of 'explainability' a model needs. 8 We take the view that if clinical practices are to be informed and influenced by machine learning models, and clinicians are to accept them, they first need to start with comprehensible models rather than broad 'black-box' approaches.
In addition to providing insight into how risks were determined, our model also permits some insights into the pathophysiology of AS. The set of features that appears in our predictive model includes both clinical (eg, heart   is not yet commonly used in current clinical practice. Transvalvular energy loss is considered the best measure of left ventricular afterload resultant from AS, 16 20 21 and our algorithm's selection of energy loss as a key determinant of outcome suggests that the concept of energy loss should be revisited in the assessment of AS. Furthermore, our findings suggest that echocardiographic approximations of energy loss are indeed clinically valuable and discriminatory for outcomes. 20 21 The fact that our algorithm performs well on two large, independent datasets argues that it is indeed generalisable and therefore can be applied more widely to different patient cohorts. We therefore constructed a user interface that enables clinicians and researchers to easily use our algorithm. This online risk calculator, the MGH-MIT Aortic Stenosis Risk (ASteRisk) Calculator, is available at https://calc-as.herokuapp.com/. Our goal is to provide this tool for clinicians to use as an assistance in clinical decision-making in AS, but we recognise that it also provides an opportunity for prospective validation of our model across disparate geographic, demographic and healthcare settings globally. The calculator provides users with quantitative risk scoring based on our machine learning algorithm applied to the 9 bedside inputs.
Although the final ASteRisk score uses 9 inputs, it ultimately requires 11 features, all of which can be derived from the nine routine measurements that a clinician is required as input. Transvalvular flow rate and energy loss are derived from other inputs (figure 1). 11 16 This was done because transvalvular flow rate and energy loss are not routinely measured during a routine echocardiographic study and we wanted inputs to be clinically accessible.  Furthermore, this algorithm does not require EHR infrastructure for clinicians to use, increasing the potential utility of our algorithm in a range of clinical settings where these technologies are unavailable or limited in their scope. 18 We narrowed input number to 9 and there was no significant difference in model performance to a model with all 69 inputs. Some limitations should be considered when interpreting our findings. We excluded patients with moderate or greater aortic or mitral regurgitation. Longitudinal data were retrospectively analysed. To achieve sufficient numbers to train a machine learning algorithm, it is necessary to use such retrospective data, but prospective validation would be important. The majority of patients in the primary cohort were Caucasian, making applicability to other demographics less certain.
Our data are only applicable to patients with moderateto-severe AS. Patients with mild AS were excluded because they are unlikely to proceed to adverse clinical outcomes within 5 years of diagnosis. 22 The outcome of AVR in the combined death/AVR is subject to clinical bias based on clinical decision-making to refer for and proceed with AVR-but nonetheless this is a commonly used outcome in AS studies. We also used all-cause mortality as a stand-alone outcome, which is less susceptible to bias.
We chose a BLR threshold of 85% (rather than 90%) to incorporate mean gradient as a feature. We did this because numerous studies have established the predictive power of mean gradient in AS [23][24][25][26][27] and mean gradient is a central tenet of clinical AS risk assessment. 3 9 We therefore believed that any model that left out mean gradient will be viewed with scepticism in the clinical community. Our cohort did have a large number of patients with LGAS and as such, this may have skewed findings to miss mean gradient as a selected feature using a 90% threshold. Nonetheless, we felt it important to include maintaining our model's external validity and applicability across a range of cohorts, acknowledging our own data skew toward LGAS. We believed that the adjustment of BLR threshold by 5% was a small but necessary adjustment to permit inclusion of the important and universally recognised feature.
The recruitment periods for the primary and validation cohorts were different with respect to the ease of access and technical success rates with transcatheter aortic interventions which have improved considerably in the last decade. This may have had an effect on outcomes within the two cohorts.
Finally, our cohorts (and hence models) arise from two large tertiary referral centres in North America. The applicability to other settings must be considered in this context and further data in other populations and setting would be of value.
Using a machine learning algorithm, we were able to predict clinical outcome in two separate longitudinal cohorts for patients with AS, including in LGAS and patients not undergoing intervention. We provide an online risk calculator that permits the use of our algorithm for clinical and research purposes.