Original research

Heart-to-heart with ChatGPT: the impact of patients consulting AI for cardiovascular health advice

Abstract

Objectives The advent of conversational artificial intelligence (AI) systems employing large language models such as ChatGPT has sparked public, professional and academic debates on the capabilities of such technologies. This mixed-methods study sets out to review and systematically explore the capabilities of ChatGPT to adequately provide health advice to patients when prompted regarding four topics from the field of cardiovascular diseases.

Methods As of 30 May 2023, 528 items on PubMed contained the term ChatGPT in their title and/or abstract, with 258 being classified as journal articles and included in our thematic state-of-the-art review. For the experimental part, we systematically developed and assessed 123 prompts across the four topics based on three classes of users and two languages. Medical and communications experts scored ChatGPT’s responses according to the 4Cs of language model evaluation proposed in this article: correct, concise, comprehensive and comprehensible.

Results The articles reviewed were fairly evenly distributed across discussing how ChatGPT could be used for medical publishing, in clinical practice and for education of medical personnel and/or patients. Quantitatively and qualitatively assessing the capability of ChatGPT on the 123 prompts demonstrated that, while the responses generally received above-average scores, they occupy a spectrum from the concise and correct via the absurd to what only can be described as hazardously incorrect and incomplete. Prompts formulated at higher levels of health literacy generally yielded higher-quality answers. Counterintuitively, responses in a lower-resource language were often of higher quality.

Conclusions The results emphasise the relationship between prompt and response quality and hint at potentially concerning futures in personalised medicine. The widespread use of large language models for health advice might amplify existing health inequalities and will increase the pressure on healthcare systems by providing easy access to many seemingly likely differential diagnoses and recommendations for seeing a doctor for even harmless ailments.

What is already known on this topic

  • AI systems such as ChatGPT are promising tools for medical and patient education, and extant studies find ChatGPT to be empathetic and able to deliver quality answers, but limited research has been conducted on the use of ChatGPT on cardiovascular topics.

What this study adds

  • This systematic study finds and suggests explanations for the varying performance of ChatGPT on four cardiovascular topics.

  • Furthermore, we conclude that the use of AI for patient education might increase health inequalities and is likely to contribute to the overburdening of healthcare systems.

How this study might affect research, practice or policy

  • This study highlights the importance of providing appropriate guidance for patients in using ChatGPT to seek medical information to mitigate risks associated with using the technology and provides insights into opportunities and threats of employing the technology that might influence medical education.

  • The 4Cs proposed in this work for scoring ChatGPT’s answers have the potential to shape future research on assessing responses from large language models and other AI technologies.

Introduction

The release of ChatGPT as a ‘research preview’ on 30 November 2022 has sparked a discussion on the opportunities and threats associated with current and upcoming technologies based on large language models (LLMs) for medical research, clinical practice and health communication and education.1 2 An as-of-yet limited body of research posits LLMs as a potentially valuable tool for patients to access health advice,3–5 a scenario that likely already has become a reality as health inherently is a topic of great interest to all of us. Patients already use Google extensively for health advice,6 and the societal lockdowns during the COVID-19 pandemic have conditioned patients further to find advice online instead of visiting medical professionals.

The use of LLMs for health advice represents a double-edged sword. On the one hand, ChatGPT has been found to provide more informative responses than Google snippets in the context of patients and health professionals accessing information on cancer.4 ChatGPT has also been found to be good at reproducing consensus rather than fringe opinions,5 providing an opportunity to counteract the echo chamber effect of social media and other user-generated content.7 Comparing responses from ChatGPT to human responses, the former was perceived as more empathetic.8 opening avenues for more individualised, compassionate and scalable healthcare.9 On the other hand, ChatGPT has been found to provide dangerous advice in infection scenarios, showing limits by being unable to ask clarifying questions,10 raising questions regarding the quality of responses in safety-critical applications such as health advice.

In this article, we aim to provide an overview of the academic debate on the potential of LLMs and systematically and critically explore this potential for the case of patients searching for advice on cardiovascular conditions using ChatGPT. Cardiovascular diseases were chosen as they are the leading cause of death on a global scale, the academic debate already discusses the potential of LLMs to contribute to cardiovascular medicine,11 and initial explorations have investigated their use in educating citizens on cardiopulmonary resuscitation.3 We contribute to research on the use of large LLMs in medical fields by providing insight into how language, description depth and health literacy of the prompts impact the quality of the responses obtained. We also showcase a generic template for further systematic explorations and structured evaluations based on a 4C model for assessing correctness, comprehensiveness, conciseness and comprehensibility. Ultimately, we contribute to the wider academic debate on the opportunities and threats of employing LLMs in medical fields, pointing to mechanisms that might amplify existing health inequalities and increase pressure on healthcare systems.

Materials and methods

This mixed-methods study comprises two phases: (1) a systematic state-of-the-art review of the academic debate on ChatGPT and LLMs in health communication, education and research and (2) a systematic experiment with prompts and a multidimensional evaluation of the quality of the associated responses in the field of cardiovascular diseases.

State-of-the-art review

To keep the review of extant literature feasible in the face of the veritable explosion of publications related to ChatGPT (see figure 1), we considered all 528 items from the PubMed database as of 30 May 2023 where the title and/or abstract contained the search term ‘ChatGPT’, including all 258 journal articles and excluding 270 commentaries, editorials and other publication types. We performed a thematic coding of the included articles, identifying the different potential and actual areas of application of LLMs discussed in different medical contexts.

Figure 1
Figure 1

Number of publications mentioning ChatGPT in title and/or abstract by month indexed in the PubMed database.

Systematic experimentation with ChatGPT

Prior work has performed experiments using prompting and LLMs, but until now, few have performed systematic simulation experiments, often due to concerns of limited replicability and/or a lack of best practices to do so. Most works focus on a single chat session, or a handful of prompts that are run a manageable number of times and are then analysed and discussed by the authors,4 12 effectively providing some first impressions of the potential of LLMs in the chosen medical context.13 A few more extensive studies randomly sample real-world requests for health advice and corresponding human responses from online sources to compare human and AI-generated responses.8

In our experiment, we define a latent space of prompts and systematically construct a variety of prompts in this space. We assess the responses by employing a hybrid approach where we first perform a quantitative evaluation,5 followed by a qualitative analysis.14 In particular, we score responses by how correct, concise, comprehensive and comprehensible they are and refer to these criteria as the 4Cs of LLM evaluation in medical fields (see table 3). Instead of running the same prompts multiple times, we opt for prompt diversity in each of the four topics outlined in table 1, to investigate whether and to what degree different ways of prompting about the same subject influence the results.

Table 1
|
Topics to assess the ability of ChatGPT to respond to prompts for advice regarding acute, chronic and preventive prompts from the cardiovascular domain

We choose four cardiovascular diseases that cover different degrees of urgency as our topics. Namely, for chats with the myocardial infarction (MI) topic, it is crucial to direct patients to emergency services if they or another real person in their immediate surroundings appear to be suffering from relevant acute symptoms. For peripheral arterial disease (PAD), critical symptoms should be recognised, but the risks and need for treatment in more mundane or chronic cases should not be exaggerated. Similarly, for varicose veins (VV) and cardiovascular prevention (CP), the LLM should also simply answer the questions of the patients as accurately as possible with proportional response in mind.

For each of these four topics, we systematically construct prompts from the latent prompt space distributed across two languages (English and Danish), two levels of description depth (low and high), and two levels of health literacy (low and high). As we do not combine low description depth with high health literacy, the space effectively comprises six classes of prompts presented in table 2. Extant work portrays the ability of conversational AI systems to meet patients on their level as one of their greatest potential strengths.5 9 The six combinations of language, language proficiency and health literacy allow us to investigate how well this strength is realised in practice.

Table 2
|
Overview of the six classes of prompting that are employed across the four topics

After constructing at least five prompts for each combination of the six classes and four topics, we are left with the 123 prompts presented together with the responses in online supplemental materials. We ran all 123 prompts through the ChatGPT March 23 Version using the state-of-the-art GPT-4 model.15 Each prompt was entered into a new chat session with no repeats (ie, we did not rerun prompts to obtain alternative responses) or follow-ups (ie, we did not supply further prompts to clarify or extend responses). The responses were then subsequently scored on the 4Cs regarding correctness, conciseness, comprehensiveness and comprehensibility according to the criteria set forth in table 3. Reflections were noted down by the author team comprising experts in cardiovascular disease and nursing, machine learning, communication and patient perspectives. Correctness and comprehensiveness were scored by medical professionals while conciseness and comprehensibility were scored by communication and patient perspective experts. Each score was reviewed by at least two authors to ensure inter-rater reliability by consensus.

Table 3
|
Overview of the 4Cs of large language model evaluation in medical fields detailing the scoring criteria that we subjected all responses generated by ChatGPT

The main limitations of our research design are that we refrained from including patients in the assessment of prompt-response pairs and that we limited our systematic exploration to a selection of four cardiovascular topics. However, small-scale experimentation by the authors of this article suggests that a similar exploration of other common medical fields such as endocrinology and oncology is likely to yield comparable results.

Results

As of 30 May 2023, 528 items on PubMed contain the term ChatGPT in their title and/or abstract, with 258 being classified as journal articles and included in our thematic state-of-the-art review. The five main themes identified represent potential and actual application areas of LLMs and comprise clinical practice (73, clipra), scientific publishing (72, medpub), education of medical professionals (71, edupro), patient education (65, patedu) and medical research (39, medres), supplemented by minor themes of often discursive/philosophical nature (48). Many articles discussed more than one of these themes, with the distribution of themes to articles represented in figure 2.

Figure 2
Figure 2

Upset plot of the distribution of the five main themes found in the 258 articles from PubMed included.

Next in this section, we provide an overview of the experimental results. The full breakdown of every individual prompt-response pair is found in online supplemental materials, where we identify the pairs by topic (MI/PAD/VV/CP), health literacy (L/H for low/high), description depth (L/H), language (E/D for English/Danish), and a variant number (1-5). For example, PAD-LH-E2 identifies the second English, low-health literacy, high-description depth prompt variant for PAD.

Overall, the prompt responses score somewhere between three and four, with conciseness generally being the weakest of the 4Cs (see figure 3). This trend is somewhat consistent across various sortings and subdivisions of the data, though with some variance.

Figure 3
Figure 3

Histogram illustrating the distribution of scores across the four criteria correct, concise, comprehensive,= and comprehensible, on the whole dataset of 123 prompt-response pairs.

When sorting by topic (see table 4), it seems MI is among the more consistently correct and concise topics, while for comprehensiveness and comprehensibility, VV get better scores on average. CP generally seems to be doing the worst on all fronts, indicating that ChatGPT may be trying to weave together too many contradictory ideas or that scoring was in some way unintentionally biased.

Table 4
|
Average scores by topic with SE

When arranging the data by the different prompt classes (see table 5), we observe that, on average, the high-health literacy prompts gave more correct, comprehensive and comprehensible (for patients with high literacy) results compared with low-health literacy prompts, but they were also less concise. Description depth seems to have little effect, except that a more descriptive prompt sometimes gives a more comprehensible response, which seems sensible.

Table 5
|
Average scores grouped by different prompt classes

Finally, the prompts and responses in Danish scored better on all criteria, which is unexpected given that ChatGPT was trained predominantly on English language datasets. A likely explanation is that ChatGPT was trained on less user-generated and/or self-help material in Danish that might obscure and dilute the influence of official recommendations, textbooks and freely available professional doctor manuals. As an example, consider the absurd recommendation to avoid ‘tight clothing, particularly around your waist, groin and legs’ as this ‘can restrict blood flow’ (VV.HH.E1).

Figure 4 visualises the class imbalance-corrected, relative counts across health literacyL/H, description depth L/H and language E/D. Regarding health literacy, low-health literacy prompts receive worse scores on comprehensiveness and comprehensibility than high-health literacy prompts but slightly better scores on conciseness. Description depth presents somewhat more balanced, though high description depth also takes the majority of the comprehensible scores of five. Regarding language, Danish has both the most fives on all parameters and the fewest ones.

Figure 4
Figure 4

Clustered relative population size histogram (corrected for class imbalance) visualising how the different characteristics influence the distributions of scores. If the filled and half-filled bars meet at 50% then there is no difference in the coverings, if they are tending to one side, one of the categories received fewer of this score than the other.

We conclude the quantitative assessment of the prompt-response pairs with an analysis of pairwise mutual information. Mutual information is the measure of information shared between two variables, that is, how much we can learn about one variable by knowing the other.16 If  Inline Formula  are random variables over the space  Inline Formula , then their mutual information is;

Display Formula

where  Inline Formula  is the entropy, and  Inline Formula  is the conditional entropy of X and Y. If  Inline Formula  are completely independent, then  Inline Formula . If X is completely determined by the value of Y, then  Inline Formula .

Figure 5 provides a heatmap of pairwise mutual information. Notably, comprehensibility and correctness are associated while the other criteria are not significantly influenced by each other. What matters the most for the scores are the response (and prompt) lengths, and to some degree the topic.

Figure 5
Figure 5

Pairwise mutual information of the variables we tracked in the systematic exploration of prompting and scoring the responses. Mutual information measures how much knowing one variable tells us about another variable.

Qualitatively analysing the responses, we find that the verbs and nouns used in the prompts, as well as whether negation is used are highly influential on the style of the responses and their wording. The patients tend to receive ‘advice’ (MI.LL.E2) when asking what they are supposed to do and ‘general information’ (MI.LL.E1) when asking about the symptoms of a heart attack.

Interestingly, the description depth appears to have little bearing on the correctness and conciseness of the responses. In some instances, additional details in the prompt derail rather than improve responses, for example, when ChatGPT focuses on the (arbitrary) context of ‘fitness training’ (MI.LL.D5) mentioned, missing a potentially acute MI and providing extensive best practices of safe fitness training instead.

The correctness of health advice is also questionable in some of the responses, where ChatGPT seems to reproduce health beliefs rather than evidence-based medical consensus. In addition to ‘avoid tight clothing’ (VV.HH.E1), ChatGPT also suggests ‘herbal remedies’ which it notes ‘are believed to improve blood circulation’ (VV.LL.E2). It also suggests treating PAD by ‘sleeping on your side’ and to ‘elevate your legs while sleeping’. While the former is non-sensical, the latter should be considered harmful as it may result in falling pressure, triggering pain and, ultimately, ischaemic tissue loss. As noted above, such responses are more prolific in the English responses compared with the Danish.

In many responses scoring low on conciseness, ChatGPT lists a multitude of symptoms, diagnosis and treatment options (PAD.LL.D5), or even other potential medical conditions (PAD.LL.E2). Despite these extensive responses, ChatGPT remains vague on the more quantitative aspects involved such as how many symptoms should a patient exhibit before medical assistance should be sought. When asked directly for quantitative information, ChatGPT usually just lists risk factors when asked to estimate risk (CP.HH.E5) or provides highly vague responses such as ‘80%–90%’ when asked for proportions of ‘preventable’ ‘heart attacks’ (CP.HH.E4). Finally, ChatGPT also provides some additional information without being prompted to do so, for example, by prominently including insights on how symptoms vary between individuals of differing gender (MI.HH.E5) or racial/ethnic backgrounds (T1.HH.D3).

Discussion

Our quantitative analysis found average scores of just under four out of five for correctness. These results are well aligned with extant literature where subject professionals scored 100 and 96 prompt-response pairs, respectively, on a five-point grading scale, finding average scores around four out of five for validity/factual correctness.17 18 Our results demonstrate the value of considering three additional dimensions when evaluating prompt-response pairs: two additional independent dimensions (conciseness, comprehensibility), as well as one somewhat correlated but predominantly complementary dimension (comprehensiveness).

Our qualitative analysis uncovered some areas of concern that can be expected to have both direct and indirect implications for the use of LLMs for patient education, as well as potential for further improvements. While the correctness of the responses of LLMs is, of course, important, it is even more important to understand to what degree patients are believing the health information and advice acquired through LLMs,19 and how those beliefs are affecting their practices of diagnosis, treatments and interactions. Health information campaigns need to stress the ability of LLMs to make mistakes and the need to be critical of their responses and advice to use officially sanctioned health information. The development of carefully curated self-surveilling LLMs such as Med-PaLM-2 is a first step in this direction.20

Furthermore, to increase validity and contextualise responses, AI systems such as ChatGPT need to be taught to find and/or derive quantitative estimates of risks, disease prevalence, biochemical thresholds, etc in addition to qualitative summaries and lists of symptoms, risk factors, diagnose and treatment options.

The organisations developing LLMs also need to critically reevaluate in which situations the responses should contain disclaimers, list long lists of diagnosis and treatment options that might or might not be available to the patients, or even nudge patients to seek medical assistance, and when this is neither required nor desirable. Here, LLMs could and arguably should be imbued with the capacity to interview the patient to better understand their situation and provide more personalised and selective responses.

For the time being, many of the responses we analysed leave an impression of shying away from such responsible considerations in favour of providing a combination of long authoritative sounding but only semicomprehensive responses to which disclaimers have been added as part of the human feedback integrated into the LLM development. Patients are confronted with unlikely but severe potential diseases, long lists of symptoms without guidance and many diagnosis and treatment options. This has the potential to both instigate fears and anxiety in patients, as well as to reinforce trends of self-diagnosis and overdiagnosis. Patients might also feel obliged to act on the advice obtained, with the alternative being to bear the responsibility of inactivity. Acknowledging the potential of LLMs like ChatGPT to get more patients the right diagnosis and necessary treatment in good time and, thus, to improve health outcomes, an important but non-trivial task will be to find the right balance between underplaying and overplaying risks and the need for action.

Last but not least, observing that the quality of the responses of ChatGPT nearly always reflects the quality of the prompts, LLMs might prove a hazard to rather than an aide of already disadvantaged populations. Not only do those who are less proficient at formulating prompts and possess less health literacy receive less valuable responses. The correlation between prompt and response quality also has the potential to amplify existing health inequalities, increasing the health literacy and agency of those who already possess relatively high levels and yielding little to no benefits for those who do not.21 Ultimately, this can be expected to amplify differences regarding key areas of contemporary health policy such as patient involvement, education and empowerment.

While the responses of ChatGPT might be perceived as more empathetic than human responses,8 medical and cross-disciplinary research on LLMs in medical fields needs to take into account how socioeconomic and cultural resources of the patients impact not only their interaction with LLMs but also their ability to benefit from this interaction.22 The very perception of an empathetic conversational partner in matters of health might exacerbate an already existing problem where low-resource patients shy away from interactions with medical professionals. Simultaneously, the ease of obtaining confident and extensive health advice might increase anxiety among high-resource patients.

Demanding consumer patients, who already equip themselves with printed results of Google searches when consulting medical professionals,19 might sooner than later start to demand to be diagnosed and treated using all the options known to and listed by LLMs that have already demonstrated surprising adeptness at differential diagnosis.23 Under current legislation, healthcare workers typically have no choice other than to comply with such demands to avoid liability and the risk of prosecution in case unlikely events do occur. This is a potential bomb under the healthcare systems, where insurance-driven systems need to raise the price of their insurances while tax-financed systems constantly given sparse financial resources will increasingly be forced to spend critical human resources on healthy citizens instead of patients actually requiring medical assistance.