Guidelines for AI Research in Medicine

Reporting guidelines for medical research first appeared in the 1990's as a means to promote high-quality transparent scientific communication. Over the past year, we witnessed a growing number of guidelines being developed specifically for medical AI research. These guidelines represent expert consensus and will ultimately help establish AI research within the medical sciences, while also acknowledging its nuances. In this article, we will explore ten guidelines serving multiple stages of a typical medical AI application lifecycle. We will then outline how guidelines are used by journals and researchers today. We argue that they may fail to assume their intended advisory role, and are often regarded as an after-the-fact journal requirement. We will then identify issues related to the adherence and accuracy of guidelines, and finally touch on research aimed at measuring their long-term impact.

Medical Research and Guidelines

Medical research in the 70’s and 80’s suffered significantly from poor, or at best mediocre, methodological quality^↗︎. As a response, the community came together to develop guidelines for several types of study design to ensure accurate and transparent reporting. A prominent example of this movement was in clinical trials. At the time, it was demonstrated that trials were being poorly reported due to bias in estimating treatment effects^↗︎. To address this, stakeholders co-developed the Consolidated Standards of Reporting Trials (CONSORT) statement in 1996. CONSORT would then become one of the earliest reporting guidelines, initiating a cascade of changes and improvements to the reporting of medical research in scientific journals^↗︎.

As it turns out, developing these guidelines is no easy feat. These efforts often involve large steering committees of transdisciplinary experts: academic faculty, researchers, practitioners, policy makers, and patient groups. After extensive meetings and voting on candidate components, the guideline is delivered in the form of a checklist together with a statement paper describing the development process^↗︎. This checklist represents a consensus-based minimal set of 15~30 items that these experts have determined should be reported in a study^↗︎.

“Readers should not have to infer what was probably done, they should be told explicitly.^↗︎”

While the ultimate goal is ensuring high-quality, transparent, and complete scientific communication, no two studies perfectly adhere to a given guideline. As such, these guidelines are not meant to be strictly followed, but rather used as an advisory mechanism supporting authors as they develop their research. They are also designed to assist editors, peer reviewers, and general readership in understanding, interpreting, and critically appraising the findings^↗︎. Many journals have since required authors to submit completed checklists indicating where each item has been reported.

Guidelines for Medical AI

A growing line of AI applications in medicine continue to populate scientific journals since ca. 2015. Despite the momentum, an over-inflated hype around the technology together with a reproducibility crisis^↗︎ have both led many to question the quality and scientific rigor of the analyses being conducted. Once again, we see the community mobilizing efforts to address this. Just over the past year, we started witnessing existing guidelines being “extended” to include studies with AI components, as well as newly minted AI-specific guidelines being developed. Acknowledging medical AI research in such a manner brings much needed legitimacy to the field and will play an important role in shaping its future. For context, it took us nearly a couple of decades to identify issues with clinical trials and take corrective measures^↗︎. Today, it only took us 5 years to do the same for AI interventions. Progress.

Here, we explore 10 guidelines serving four stages of a typical medical AI application lifecycle: development & in silico validation, intermediate clinical evaluation, randomized clinical trials, and finally, product procurement.

artificial intelligence ai guidelines checklists medicine healthcare medical research consort spirit tripod — An overview of 10 guidelines serving four stages of a typical medical AI application lifecycle.

Stage 1: Development & in silico Validation

Most guidelines serve this crucial proof-of-concept stage and aim to capture the nuances of developing AI applications. The argument here is that differences in terminology between existing guidelines and contemporary ML research may be the reason behind them being underutilized^↗︎. For instance, while TRIPOD focused on simpler regression-based models^↗︎, its extension TRIPOD-ML may include specifics pertaining to training and validating neural networks and associated hyperparameter tuning.

Academia is no stranger to duplicated efforts, and it is unclear why multiple guidelines are needed here as they all tend to address identical ML concepts: model design, data partitioning, validation metrics..etc. This is also true for specialized guidelines such as PRIME for cardiovascular imaging. Instead of focusing on generic concepts, these specialized guidelines should focus more on what makes the speciality unique i.e. the specific data it uses and the clinical task it addresses.

“A cynic might be forgiven for thinking that there are now so many publication guidelines that nobody can keep track of, and that they will all sink quietly into oblivion^↗︎”

Stage 2: Intermediate Clinical Evaluation

While it is great to see guidelines being developed here, the mere acknowledgement of this stage is worth noting. In silico AI development is a proof of concept that does not inspire sufficient confidence to run clinical trials. This intermediate clinical evaluation stage will allow for studying ergonomics and human factors by running small clinical experiments. Such a “dry run” deployment will enable rapid prototyping with user feedback in a simulated clinical environment, ultimately informing the go/no-go decision to conduct large and expensive trials. Because this stage may also allow for testing the AI’s safety profile, it has been compared to phase 1/2 trials in the drug development space^↗︎. These types of intermediate clinical experiments represent a minority share of the literature today as most tend to be in silico.

Stage 3: Randomized Clinical Trials

Clinical trials are considered the gold standard for medical evidence. Trials for AI interventions are fairly new, as we have only started to see those in the past few years^↗︎. It is only a matter of time before they become an established research line - one that will immensely help discern hype from true clinical utility. From a regulatory perspective, trial-based evidence is urgently needed as the FDA is currently approving AI applications mainly based on preliminary evidence^↗︎.

With SPIRIT-AI focusing on trial protocols and CONSORT-AI on trial results, these guidelines may help standardize the reporting of how AI interventions are administered and how AI outputs contribute to users’ decision-making. This new breed of AI trials requires additional considerations. For instance, while traditional trials report inclusion/exclusion criteria for human participants, AI trials must also report the same for input data, together with protocols for handling subpar data. While the guidelines’ authors chose not to discuss continuously updated AI models trained on new data^↗︎^↗︎, it will be exciting to see how future trial designs will take this into consideration.

Stage 4: Product Procurement

We are also seeing guidelines being developed beyond the academic sphere, specifically for procuring AI products. This is essential as the evaluation of medical device software with AI components differs from its generic counterparts. In addition to algorithm-related considerations, one must note that AI software consumes data to generate value as opposed to solely performing record-keeping functions (create, read, update, and delete records). To ensure maximum value from a given product, the “data profile” of an institution must match what the product is designed to work with.

In addition to ECLAIR, other guidelines also exist including the NHSx AI buyer’s guide^↗︎ as well as other AI purchasing guides^↗︎. While these offer a wishlist of information about a product, most of these may not be made available, ever. For instance, while it is natural for a clinician to inquire about the data used to train an AI model, vendors today do not reveal this information, nor are they legally required to do so. It will take some time for vendors to adjust to the nature of AI products and hopefully provide more transparency. While there may be only a handful of AI vendor options in each application area today, their increasing number will likely cause “vendor fragmentation” and make for a more challenging procurement process. We are already seeing this today in EHR^↗︎. In the same way we have consultants to help with procuring, designing, and implementing EHR^↗︎, expect to see AI consultants moving forward.

The Nuances of AI Guidelines

Guidelines for medical AI research call for additional considerations beyond their more generic ancestors. Best practices of AI research are constantly evolving and so should the guidelines that accompany them. While CONSORT has been revised twice since its inception in 1996, AI guidelines may require more frequent revisions. As data plays a central role in AI applications, we are also seeing guidelines become more data-specific. AI research today is centered around computer vision and therefore most guidelines cater to imaging-based applications^↗︎. More guidelines for text and speech data are yet to be formalized.

There are also limits to what AI guidelines can achieve. Some guidelines promise to encourage reproducible research through checklist items such as “complete sharing of the code” and “allow a third party to evaluate the code”^↗︎. While this may help measure transparency in the field, reproducibility challenges are often cultural (the appetite to share), computational (controlling software environments), and ethical (healthcare data privacy). It is unlikely that reporting guidelines will move the needle in that regard.

How are Guidelines Really Used?

Despite their delightful acronym-based names, guidelines often fail to serve their intended advisory role. The main reason behind this: guidelines have become more journal-specific as opposed to study-specific. Your likelihood of consulting a specific guideline depends almost entirely on whether the journal you are submitting to requires it as part of their submission process. As such, authors rarely consult guidelines during the development and writing phases of research. Instead, guidelines end up being treated as an after-the-fact checklist filled in at submission time, and appended to published studies.

While the logical move of incorporating guidelines into the journal submission process has helped extend their reach, it has also pushed them towards becoming part of an otherwise highly mundane process. Journal submission portals are not the most user friendly and often involve lengthy frustrating data entry of author names, affiliations, and other information. The guideline checklist has simply become one more item uploaded to these portals.

Adherence & Accuracy

Both the adherence to and accuracy of reporting guidelines have come under scrutiny. The responsibility of adhering to guidelines falls entirely on the author(s), given that they are most familiar with the work presented. Both journal editors and peer reviewers have distanced themselves from policing the correct use of guidelines, arguing that it is a burden that falls out of their competence^↗︎. As a result, authors’ unfamiliarity with guidelines, the lack of a second opinion, and other external factors such as word count limits may all lead to checklists that do not reflect what is reported in the study^↗︎. Moreover, the large variability in how journals incorporate reporting guidelines into their “instructions to authors” may cause additional confusion. These range from “please refer to” and “encourage” to “should conform” and “must be reported”^↗︎.

Authors will always prioritize fulfilling the editors’ and peer reviewers’ requests over conforming to a checklist that has no direct bearing on whether the study will be accepted for publishing.

Measuring Guideline Impact and Reach

The emergence of guidelines to inform AI research in medicine just 5 years into this relatively new field is a positive hint at its prospects. To fully understand the potential impact of these newly proposed guidelines, one must look into the track record of existing guidelines. Research into this area continues to report mixed findings. For one specific guideline, STARD, some report no meaningful differences in the quality of reporting between journals that endorse it and those that do not^↗︎, while others report a small but significant improvement^↗︎. The same is also true for CONSORT where one study reports “extensive misunderstandings” around guideline interpretation among journals^↗︎^↗︎, while another reports that its adoption is associated with improved reporting of trials^↗︎.

Guideline reach today is measured by the number of journal endorsements and citations. We are lacking data on how readers interact with, interpret, and use guidelines to appraise studies. If readers fail to utilize guidelines in the intended manner, they will fail to deliver on their promises. Additionally, more work is to be done in improving authors’ familiarity with the guidelines and promoting their use earlier in the research journey before the academic editorial process starts.

It will be a while before we experience the impact of these guidelines. To make the most of them, they must be used in the right way^↗︎: less as a quality evaluation form or a strict document to be followed verbatim^↗︎, and more as an overarching guidance for research^↗︎.