Article Text

Original research
Single-lead arrhythmia detection through machine learning: cross-sectional evaluation of a novel algorithm using real-world data
  1. Henry Mitchell1,
  2. Nicole Rosario1,
  3. Carme Hernandez1,2,3,
  4. Stuart R Lipsitz1,4 and
  5. David M Levine1,4
  1. 1Divison of General Internal Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
  2. 2University of Barcelona, Barcelona, Catalunya, Spain
  3. 3Hospital Clinic, Barcelona, Spain
  4. 4Harvard Medical School, Boston, Massachusetts, USA
  1. Correspondence to Dr David M Levine; dmlevine{at}bwh.harvard.edu

Abstract

Background Computer-assisted interpretation of single-lead ECG is the preliminary method for clinicians to flag and further evaluate an arrhythmia of clinical importance for acutely ill patients. Critical scrutiny of novel detection algorithms is lacking, particularly in external real-world data sets. This study’s objective was to evaluate a hybrid machine learning model’s ability to classify eight arrhythmias from a single-lead ECG signal from acutely ill patients.

Methods This cross-sectional external retrospective evaluation of a previously trained hybrid machine learning model against an ECG reading team in the setting of home hospital care (acute care delivered at home substituting for traditional hospital care) draws from patients admitted at two hospitals in Boston, Massachusetts, USA between 12 June 2017 and 23 November 2019. We calculated classifier statistics for each arrhythmia, all arrhythmias and strips where the model identified normal sinus rhythm.

Results The model analysed 2 680 162 min of single-lead ECG data from 423 patients and identified 691 478 arrhythmias. Patients had a mean age of 70 years (SD, 18), 60% were female and 45% were white. For any arrhythmia, the model had a sensitivity of 98%, a specificity of 100%, an accuracy of 98%, a positive predictive value of 100%, a negative predictive value of 93% and an F1 Score of 99%. Performance was best for pause (F1 Score, 99%) and worst for paroxysmal supraventricular tachycardia (F1 Score, 92%). The model’s false positive rate for any arrhythmia was 0.2%, ranging from 0.4% for pause to 7.2% for paroxysmal supraventricular tachycardia. The false negative rate for any arrhythmia was 1.9%.

Conclusions A hybrid machine learning model was effective at classifying common cardiac arrhythmias from a single-lead ECG in real-world data.

  • ARRHYTHMIAS
  • Arrhythmias, Cardiac
  • Electrocardiography

Data availability statement

Data are available upon reasonable request. Select data are available upon reasonable request within the bounds of our data privacy policies.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

  • The increase in patient monitoring data has resulted in a need for computer-assisted interpretation which has suffered from misdiagnosis. ECG data provide an opportunity for algorithm deep-learning to identify arrhythmias as evaluated in this study.

WHAT THIS STUDY ADDS

  • Current machine learning algorithms have the potential to assist in identifying common cardiac arrhythmias in acutely ill patients.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • Algorithms used to identify cardiac arrhythmias may also be used to monitor chronically ill patients and serve as an indicator for overall cardiac health.

Introduction

The ECG is a foundational tool for the evaluation and management of cardiac conditions, available beginning in the early 1900s.1 Computer-assisted interpretation of the ECG has been available for over 50 years, yet existing commercial algorithms continue to misdiagnose.2–5 With the advent of continuous telemetry monitoring and consistently larger swathes of patient monitoring data, the need for computer-assisted interpretation has grown. Clinicians are unable to critically analyse these new large quantities of data efficiently or ‘just in time’, as is the case in acute care; thus, computer-assisted interpretation becomes highly beneficial. Unfortunately, commercially available single- and three-lead systems have also continued to suffer from misdiagnosis.5 6 Digitisation of ECG data and advances in machine learning likely allow for improvements in currently available systems.

Various machine learning methods developed in the last several decades, often called representation learning, offer the opportunity to improve computer-assisted analysis, as they rely significantly less on manual feature extraction to transform raw data.7 Instead, representation learning allows a machine to discover representations using raw data. ‘Deep’ learning refers to multiple levels of representation.

ECG data and its interpretation offer an ideal use case for deep learning. It is increasingly digitised in a standard format and it requires highly trained pattern recognition, among other methods. Use of deep learning for 12-lead ECG interpretation has shown great promise.8–12 Single-lead ECG interpretation is a burgeoning field with limitations including lack of external validation, small data sets such as the MIT-BIH Arrhythmia Database,13 14 algorithms targeting small numbers of arrhythmias15 and issues of overfitting and generalisability.15–17 There have recently been new developments in single-lead arrhythmia detection using artificial intelligence.8 16 18–21 However, pure machine learning approaches may face challenges from data that are noisy. This problem is overcome with a hybrid model combining traditional signal processing for noise reduction and machine learning for pattern recognition. Few evaluations occur for patients who are acutely ill.

We sought to evaluate a hybrid machine learning model using real-world single-lead ECG data to classify eight arrhythmias from acutely ill patients receiving care at home in an innovative ‘home hospital’ programme that provides acute care at home as a substitute for care that would traditionally be provided in the hospital (figure 1).22–27

Figure 1

An acutely ill patient generates continuous single-lead ECG interpreted by traditional computer algorithms during usual care with only fair arrhythmia recognition, generating significant alarm fatigue due to false positives. The hybrid model evaluated in this study demonstrated performance comparable to gold standard annotators when challenged with real-world single-lead ECG data from acutely ill patients. Refer to online supplemental figure S1 for a more detailed schematic of the hybrid model. AFIB, atrial fibrillation; Brady, sinus bradycardia; DNN, deep neural network; NSR, normal sinus rhythm; PSVT, paroxysmal supraventricular tachycardia; PVC, premature ventricular contraction; Tachy, sinus tachycardia; V. Big, ventricular bigeminy; V. Trig, ventricular trigeminy.

Methods

Setting and participants

Select acutely ill patients who require inpatient hospitalisation can be cared for at home instead of in the hospital when a specialised home hospital team is dispatched to the home.22–30 We previously described the home hospital intervention in detail.23 Briefly, all patients received at least one daily visit from an attending general internist and two daily visits from a registered nurse (Mass General Brigham Healthcare at Home), with additional visits performed as needed. Patients received hospital-level care, including intravenous infusions, respiratory treatments, imaging and blood tests as needed. Also tailored to patient need, participants could receive medical meals and the services of a home health aide, social worker, physical therapist and/or occupational therapist. All patients who were home hospitalised, irrespective of diagnosis, wore a patch monitor (VitalConnect) throughout their stay, capable of transmitting continuous single-lead ECG, accelerometry (including falls detection) and body temperature.

Patients were admitted from either Brigham and Women’s Hospital, an academic medical centre, or Brigham and Women’s Faulkner hospital, a community hospital between 12 June 2017 and 23 November 2019.

Data source

We extracted the continuous (sampling frequency: 125 Hz) unfiltered single-lead ECG signal for all patients enrolled in the programme. These data were transmitted wirelessly to the remote monitoring endpoint, stored remotely and analysed retrospectively. The ECG recordings were divided into 1 min non-overlapping strips for easy comparison and standardisation. Each strip was analysed by the model for each arrhythmia, meaning that strips can fall into multiple categories. Up to 1000 strips for each arrhythmia type identified by the algorithm were randomly selected for human annotation, balanced by patient. Additionally, to determine a false negative rate (FNR), 2000 strips with no model-noted abnormality were randomly selected for human annotation from patients home hospitalised for a heart failure exacerbation or with known arrhythmia.

Gold standard annotation

Each of the randomly selected strips described above was annotated beat-by-beat by two expert technicians who were certified cardiac technicians and certified rhythm analysis technicians. A third senior technician analysed strips if there were incongruencies to break any discrepancies. In the event of multiple incongruencies, the senior technician’s reading prevailed. Each technician’s reading was reviewed by a cardiologist for accuracy before final labelling.

Prior algorithm development

The ECG interpretation algorithm is a hybrid machine learning model developed separately from and prior to the present study using a different patient cohort. It was originally trained on approximately 3 700 000 annotated single-lead ECGs of US adult patients with heart failure and atrial fibrillation (AFIB) monitored at home between 2015 and 2018 (RhythmAnalytics, Biofourmis). The data were captured via Holter monitor at a sampling rate of 256 Hz. Without manual preprocessing, the model was trained to autonomously identify AFIB, pause, paroxysmal supraventricular tachycardia (PSVT), premature ventricular contraction (PVC), sinus bradycardia, sinus tachycardia, ventricular bigeminy, ventricular trigeminy and normal sinus rhythm (figure 2). The distribution of arrhythmias in the training set and the model’s performance against the training data set are shown in online supplemental tables S1 and S2.

Figure 2

Example ECGs demonstrating arrhythmias. (A) Atrial fibrillation. (B) Sinus bradycardia. (C) Normal sinus rhythm. (D) Pause. (E) Paroxysmal supraventricular tachycardia. (F) Sinus tachycardia. (G) Ventricular bigeminy. (H) Ventricular trigeminy.

The hybrid machine learning model receives input from the unfiltered signal obtained from the single-lead ECG (online supplemental figure S1 and table S3). This signal passes through a 0.5–40 Hz band pass filter31 and amplitude ratio thresholding to detect and remove artefacts.32 This is then further filtered33 to remove contributions from baseline wandering,31 power line interference, electromyographic (EMG) and motion artefact interference.32 This filtration is all used to reduce the noise in the ECG signal. The filtered signal is passed to modules that output the peak of the p, q, r, s and t as well as the morphological features for each beat.34

Following this processing, a hybrid approach is adopted by combining traditional ECG feature engineering and a deep neural network (DNN).34 Established ECG techniques such as r-peak detection,34 p-wave detection,35 t-wave offset detection,36 37 wavelet transform31 38–40 and sample of entropy41 are used as inputs to the arrhythmia modules. In addition, the DNN which was trained directly from the ECG signal is used to output morphological features and independent peak detection features. These morphological features and peak detection features are then passed to each of the eight classifiers, allowing for the detection of multiple arrhythmias per strip. This architecture allows a combination of techniques to be used in different independent arrhythmia modules. For example, the arrhythmia module that classifies sinus bradycardia and sinus tachycardia receives input from r-peak detection and ignores morphological features from the DNN. In contrast, the arrhythmia module that detects ventricular arrhythmias (ventricular bigeminy, ventricular trigeminy and PVC) uses mainly morphological features and peak detection features from the DNN. Lastly, the arrhythmia module that detects atrial arrhythmias (AFIB and PSVT) uses both morphological features from the DNN and other established ECG features such as p-wave detection and sample of entropy.

The model was not tuned or adapted for this validation study. The only differences in inputs between the training set and this study’s external validation set came from hardware differences (eg, sampling rate) and the patient population.

Statistical analysis

We summarised patient sociodemographic characteristics with counts and percentages or means and 95% CIs as appropriate. Algorithm evaluation was performed by calculating accuracy, F1 Score (the harmonic mean of sensitivity and specificity), negative predictive value (NPV), positive predictive value (PPV), sensitivity and specificity. Wilson intervals were used to calculate 95% CIs. These statistics were calculated for each arrhythmia subgroup to determine the algorithm’s ability to identify specific arrhythmias. A true positive was defined as both the model and annotators indicating the same arrhythmia. We obtained the mean performance across all arrhythmias, termed ‘mean of specific arrhythmias’. We chose these classifier metrics instead of area under receiver operating characteristic curve and similar measures because the probability threshold for the algorithm was frozen on submission for Food and Drug Administration (FDA) approval.

Additionally, these same statistics were calculated to determine the algorithm’s ability to recognise that a strip contained an arrhythmia, regardless of the exact arrhythmia present. For those analyses, a true positive is defined as the model and the annotators both indicating arrhythmias, though the arrhythmia indicated by the model need not be the same as the arrhythmia indicated by the annotators.

As post hoc analyses, we evaluated the algorithm’s operating characteristics by gender, race/ethnicity42 and the patient level. For the patient-level analyses, a true positive was defined as the model indicating the presence of an arrhythmia in any strip for a given patient and the model indicating the presence of that arrhythmia in any strip for that given patient. As we did not find clinically systematic differences, we present these data in online supplemental tables S4 and S5.

Results

Patient characteristics

We analysed 423 acutely ill patients (mean length of stay: 4.5 days), totalling 2 680 162 min of ECG monitoring. Patients had a mean age of 70 years (SD, 18), 60% were female, 45% were white and 71% spoke English as their primary language (table 1). Twenty-three per cent of patients were admitted primarily with a cardiac diagnosis (heart failure, hypertensive urgency, venous thromboembolism and AFIB with rapid ventricular response).

Table 1

Patient characteristics

Arrhythmia detection

The machine learning algorithm identified 691 478 arrhythmias across 418 291 unique 1 min strips (table 2). PVC was the most prevalent (n=233 317), followed by AFIB (n=184 142). Least prevalent were pauses (n=685) and PSVT (n=9044). The 9156 unique strips annotated by the technicians came from 271 patients (64% of the total patient population).

Table 2

Arrhythmias identified by the hybrid machine learning model and selected for analysis

For any arrhythmia, the algorithm had a sensitivity of 98% (95% CI, 97.8% to 98.3%), a specificity of 100% (95% CI, 99.7% to 99.9%), an accuracy of 98% (95% CI, 98.2% to 98.7%), a PPV of 100% (95% CI, 99.9% to 100.0%), an NPV of 93% (95% CI, 92.4% to 93.5%) and an F1 Score of 99% (95% CI, 98.8% to 99.2%; table 3). For the mean of specific arrhythmias, the algorithm had a sensitivity of 99.5% (95% CI, 99.2% to 99.5%), a specificity of 97.2% (95% CI, 97.0% to 97.8%) and an F1 Score of 96.7% (95% CI, 96.5% to 97.3%). Performance was best for pause (F1 Score, 99.5% (95% CI, 99.1% to 99.7%)) and worst for PSVT (F1 Score, 92% (95% CI, 90.6% to 92.5%)). The confusion matrices for each algorithm, the mean of the specific arrhythmias and overall arrhythmia detection are in figure 3.

Figure 3

Confusion matrices for each arrhythmia and any arrhythmia. Colours indicate proportion of patients in each category. (A) Any arrhythmia. (B) Atrial fibrillation. (C) Pause. (D) Paroxysmal supraventricular tachycardia. (E) Premature ventricular contraction. (F) Sinus bradycardia. (G) Sinus tachycardia. (H) Ventricular bigeminy. (I) Ventricular trigeminy. (J) Mean of specific arrhythmias. Note: Some arrhythmias show fewer than 3000 strips total due to some strips being duplicated between both the arrhythmia-containing and normal sinus rhythm categories.

Table 3

Performance characteristics of the hybrid machine learning model versus human annotators at the strip level

False positive and negative rates

The machine learning algorithm’s false positive rate for any arrhythmia was 0.2% (95% CI, 0.14% to 0.33%), ranging from 0.4% (95% CI, 0.19% to 0.65%) for pause to 7.2% (95% CI, 6.3% to 8.2%) for PSVT (table 3). Among the randomly selected strips where the model did not find any arrhythmia, the FNR was 1.9% (95% CI, 1.7% to 2.2%).

Discussion

In this retrospective analysis of a single-lead arrhythmia detection algorithm using real-world data, we found an overall F1 Score of 98% at determining the presence of an arrhythmia, and a mean F1 Score of 97% at identifying specific arrhythmias. To our knowledge, this represents the first evaluation of a single-lead ECG hybrid machine learning algorithm against external real-world data and in acutely ill patients.

We foresee multiple use cases for this algorithm that can function on often noisy real-world data, even when collected in the home of acutely ill patients during home hospital care. First, this algorithm could be used to interpret traditional telemetry feeds for acutely ill patients. Despite being relied on by clinicians as a key screening tool, current telemetry systems such as GE’s EK-Pro perform with sensitivities of 92% for AFIB and 93% for PSVT on the limited MIT-BIH database.43 A machine learning algorithm has the potential to significantly improve arrhythmia recognition among acutely ill patients monitored by single-lead ECG. Second, this algorithm could monitor chronically ill patients who require additional short-term monitoring during medication titration or diagnostic investigation for occult arrhythmia. Third, this algorithm could form the beginnings of a system that consistently monitors cardiac health through arrhythmia detection.44 Future research can correlate long-term arrhythmia burden with health outcomes.

Our analysis builds on others’ work. A single-lead ECG arrhythmia classifier by Hannun et al16 classified ECG tracings into 12 rhythmic categories—AFIB/atrial flutter, atrioventricular block, bigeminy, ectopic atrial rhythm, idioventricular rhythm, junctional rhythm, noise, normal sinus rhythm, supraventricular tachycardia (SVT), tachycardia and Wenckebach block. Its average F1 Score when tested against a 10% sample of the PhysioNet data set was 83.7%. A single-lead ECG arrhythmia classifier from Dias et al classified heartbeats as normal, supraventricular ectopic beats or ventricular ectopic beats with F1 Scores of 52% for supraventricular ectopic beats and 91% for ventricular ectopic beats, lower than the presently considered model for PSVT (92%), ventricular bigeminy (98%) and trigeminy (96%).18 A 12-lead ECG arrhythmia classifier developed by Baygin et al19 which classifies ECG tracings into seven rhythmic categories—AFIB, atrial flutter, bradycardia, SVT, tachycardia, sinus irregularity, normal sinus rhythm—achieved a maximum single-lead F1 Score of 84.0%. A 12-lead ECG analysis neural network from Hughes et al achieved a mean F1 Score of 81.2% at detecting arrhythmias.8

Work by Pandey et al classified PVC with a PPV of 98.81%, a sensitivity of 98.84% and an F1 Score of 98.82%, all better than the present algorithm.20 The work of de Albuquerque et al resulted in a maximum F1 Score of 93% for identifying PVC.21 However, both of these models were trained and tested on the two-channel MIT-BIH database, not on an external data set.

We also were interested to determine whether this model had embedded performance disparities by demographic characteristics, as has been described in other ECG analysis models.42 In a post hoc analysis, we did not find systematic biases in the algorithm by demographic characteristics, although this deserves further prospective analysis to allow for proper statistical comparison.

Our study has limitations. First, our analysis may miss conflation errors (eg, if the model misidentified AFIB as tachycardia, those false negatives would not arise in the AFIB analysis). However, this type of misclassification would have reduced the PPV of the conflated arrhythmia, which does not appear to have happened. Second, the model does not evaluate all arrhythmias of clinical interest. Future iterations should include arrhythmias such as ventricular tachycardia and ventricular fibrillation. It should also monitor for QT prolongation, for example.45 46 Third, although our patient population was diverse by multiple sociodemographic characteristics, our population drew from a single metropolitan area and did not encompass the complete array of cardiac conditions, limiting generalisability. Finally, our analysis does not include an evaluation of the model on a common data set such as the MIT-BIH Arrhythmia Database—or an evaluation of other models on this data set—to allow for head-to-head comparison with other models, although use of a real-world data set the algorithm had never seen is a strength of this analysis.

Overall, a previously trained hybrid machine learning model for the detection of common cardiac arrhythmias demonstrated performance comparable to gold standard annotators when challenged with external real-world single-lead ECG data from acutely ill patients.

Data availability statement

Data are available upon reasonable request. Select data are available upon reasonable request within the bounds of our data privacy policies.

Ethics statements

Patient consent for publication

Ethics approval

The study protocol was approved by the Mass General Brigham institutional review board.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Twitter @davidlevinemd

  • Contributors HM had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: DML. Acquisition, analysis or interpretation of data; critical revision of the manuscript for important intellectual content: all authors. Drafting of the manuscript: DML and HM. Statistical analysis: HM and SRL. Administrative, technical or material support: HM. Study supervision: DML.

  • Funding Biofourmis funded the present PI-initiated manuscript. Biofourmis had no role in the design, analysis and interpretation of the data. Biofourmis had no role in preparation of the manuscript, were blind to all results and did not control publication.

  • Competing interests DML and SRL receive grant support from Biofourmis and IBM.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.