Article Text

Original research
Machine learning approaches improve risk stratification for secondary cardiovascular disease prevention in multiethnic patients
  1. Ashish Sarraju1,
  2. Andrew Ward2,
  3. Sukyung Chung3,
  4. Jiang Li3,
  5. David Scheinker4,5 and
  6. Fàtima Rodríguez1
  1. 1Division of Cardiovascular Medicine and Cardiovascular Institute, Stanford University School of Medicine, Stanford, California, USA
  2. 2Department of Electrical Engineering, Stanford University, Stanford, California, USA
  3. 3Palo Alto Medical Foundation Research Institute, Palo Alto, California, USA
  4. 4Department of Management Science and Engineering, Stanford University School of Engineering, Stanford, California, USA
  5. 5Division of Pediatric Endocrinology, Stanford University School of Medicine, Stanford, California, USA
  1. Correspondence to Dr Fàtima Rodríguez; frodrigu{at}stanford.edu

Abstract

Objectives Identifying high-risk patients is crucial for effective cardiovascular disease (CVD) prevention. It is not known whether electronic health record (EHR)-based machine-learning (ML) models can improve CVD risk stratification compared with a secondary prevention risk score developed from randomised clinical trials (Thrombolysis in Myocardial Infarction Risk Score for Secondary Prevention, TRS 2°P).

Methods We identified patients with CVD in a large health system, including atherosclerotic CVD (ASCVD), split into 80% training and 20% test sets. A rich set of EHR patient features was extracted. ML models were trained to estimate 5-year CVD event risk (random forests (RF), gradient-boosted machines (GBM), extreme gradient-boosted models (XGBoost), logistic regression with an L2 penalty and L1 penalty (Lasso)). ML models and TRS 2°P were evaluated by the area under the receiver operating characteristic curve (AUC).

Results The cohort included 32 192 patients (median age 74 years, with 46% female, 63% non-Hispanic white and 12% Asian patients and 23 475 patients with ASCVD). There were 4010 events over 5 years of follow-up. ML models demonstrated good overall performance; XGBoost demonstrated AUC 0.70 (95% CI 0.68 to 0.71) in the full CVD cohort and AUC 0.71 (95% CI 0.69 to 0.73) in patients with ASCVD, with comparable performance by GBM, RF and Lasso. TRS 2°P performed poorly in all CVD (AUC 0.51, 95% CI 0.50 to 0.53) and ASCVD (AUC 0.50, 95% CI 0.48 to 0.52) patients. ML identified nontraditional predictive variables including education level and primary care visits.

Conclusions In a multiethnic real-world population, EHR-based ML approaches significantly improved CVD risk stratification for secondary prevention.

  • electronic health records
  • coronary artery disease
  • risk factors

Data availability statement

No data are available. The data analysed during the current study are not publicly available. Due to reasonable privacy and security concerns, the underlying EHR data are not easily redistributable to researchers other than those engaged in the Institutional Review Board-approved research collaborations in the current project. The corresponding author may be contacted for access to EHR data for an IRB approved collaboration.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Data availability statement

No data are available. The data analysed during the current study are not publicly available. Due to reasonable privacy and security concerns, the underlying EHR data are not easily redistributable to researchers other than those engaged in the Institutional Review Board-approved research collaborations in the current project. The corresponding author may be contacted for access to EHR data for an IRB approved collaboration.

View Full Text

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • DS and FR contributed equally.

  • AS and AW contributed equally.

  • Contributors AS and AW contributed equally to this paper as joint first authors. DS and FR also contributed equally to this paper as joint senior authors. All authors made substantial contributions to the conception and design of the work, and interpretation and analysis of the data. All authors were involved in drafting and critically revising the paper. All authors have reviewed and approved the final version of the paper and accept full responsibility for the finished work. FR is the guarantor and accepts full responsibility for the work and/or the conduct of the study, had access to the data, and controlled the decision to publish.

  • Funding FR received support from the National Heart, Lung and Blood Institute, National Institutes of Health (1K01HL144607) and the American Heart Association/Robert Wood Johnson Harold Amos Medical Faculty Development Program. AS received support from the American Heart Association (Grant 20SFRN35360178). AW was supported by the Department of Defense through a National Defense Science and Engineering Graduate Fellowship.

  • Competing interests Outside of the submitted work, FR reports equity from HealthPals and Carta, and advisory board and consulting fees from NovoNordisk, HealthPals and Novartis.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Linked Articles