Analysis

The Challenge of Evaluating AI Products in Healthcare

Danny Tobey, Ashley Carr, Vernessa Pollard, Michael Atleson / Feb 23, 2026

Fanny Maurel & Digit, Ambient Scribes, Licensed by CC-BY 4.0.

As the use of AI-enabled products in healthcare settings continues to rise, so do questions about the evaluation methods to ensure their safety and efficacy. From development to deployment, no settled benchmarks exist to address some fundamental questions: what does good look like in the new world of generative health technology? And how do you test and document that quality for the success of patients and innovators alike? No less than the White House, as part of its AI Action Plan, has noted the issue, tasking NIST with leading “the development of the science of measuring and evaluating AI models,” addressing what one noted AI scientist has called “the evaluation crisis.”

These issues span the most basic medical chatbots to generative systems used for medical notetaking, synthesizing vast health records, defragmenting information across disparate EHRs and institutions, and running hospitals through agentic triage and risk monitoring. As the risk levels rise from low-risk administrative functions to high-impact health interventions, so does the need for stronger quality assurance metrics. Risk can also hide in unexpected places. Indeed, the EU AI Act classified general chatbots as an exemplar of limited risk AI, only for the world to discover that a top unexpected use of them was as therapists. Innovators need to know where the goal lines are for medical AI, and providers and patients should be able to trust these products to be safe and effective.

Recent regulatory and research activity has focused on the responsibilities and roles of diverse stakeholders in healthcare, grappling with how to evaluate chatbots and other emerging AI products entering the market. Here we explore some of this activity to reveal both the challenges and reasons for optimism – and to warn about unsubstantiated claims for these products based on tests that fail to measure real-world performance.

Governance and oversight challenges

Regulators and lawmakers addressing these evaluation issues include federal health regulators like the FDA, state medical boards, and state legislatures that are passing broad and narrow AI-related laws. Health systems and makers of devices and software are obviously key stakeholders as well. Their collective efforts reflect how new technologies can highlight regulatory gaps and overlaps, with each category having its own challenges. Regulatory gaps create dead zones where no one is sure which, if any, agency or law applies. At the same time, overlapping jurisdiction means having to navigate multiple frameworks, which are sometimes redundant (yet different enough to cause waste) or inconsistent. None of this helps patients or innovators.

In October 2025, the Journal of the American Medical Association (JAMA) published its Summit Report on Artificial Intelligence, intended as a roadmap for safe and effective AI in healthcare. The report focuses principally on the need and specific recommendations for better and more robust methods and tools to evaluate the efficacy and safety of AI products in clinical and administrative settings. “All these tools can have important health effects (good or bad),” the authors say, “but these effects are often not quantified because evaluations are extremely challenging or not required, in part because many are outside the US Food and Drug Administration’s regulatory oversight.” They also note that, even for tools over which the FDA does have authority, “clearance does not necessarily require demonstration of improved clinical outcomes.” Moreover, “generative and agentic AI tools can be capable of so many tasks as to seriously challenge the traditional intended use framework for device regulation.”

Later that month, researchers from Johns Hopkins, Georgetown, and Yale published an article in JAMA Health Forum discussing the lack of clinical validation for many AI-enabled medical devices (AIMDs) that companies released and later recalled. In analyzing roughly 1,000 AIMDs cleared by the FDA, they found that most of the recalled devices from public companies had not been clinically tested and that many recall events had occurred within the first year after FDA clearance. One of the authors commented, "We just thought, 'wow—if AI hasn't been tested on people, then people become the test.'"

Meanwhile, the FDA accepted submissions through December 1, 2025, in response to a Request for Public Comment to gather insights on evaluating the real-world performance of AIMDs. But the agency did not seek comments on its own processes. Focusing specifically on AIMDs for mental health, though, the FDA Digital Health Advisory Committee held a public meeting on November 6, 2025, on “Generative Artificial Intelligence Enabled Digital Mental Health Medical Devices.” The Committee had last met in November 2024 and concluded that regulatory approaches are needed to balance innovative public health products with the risk of harm to users, and to provide reasonable assurance of safety and effectiveness. This year’s event included experts from industry, government, and academia discussing the evaluation and regulation of these devices.

Once the Advisory Committee makes any recommendations, it will remain to be seen what the FDA does with them. To date, the agency has not approved any device that uses generative AI for mental health, though it has given at least two such devices a “breakthrough device designation,” which speeds regulatory review. Some companies are likely avoiding the FDA process altogether by taking the position that their bots are intended for wellness, not mental health therapy, and are thus not medical devices at all. In 2022, the agency said that it would “exercise enforcement discretion” over a limited subset of software functions that could be used in the diagnosis or treatment of psychiatric conditions and diseases, because, while they might be medical devices, they “pose lower risk to the public.” Examples of products in this category include educational tools, reminders, symptom trackers and motivational or skill-building functions that support users in building or reinforcing habits or behaviors to help manage a diagnosed psychiatric condition.

As we have explored in past DLA Piper alerts, however, the regulatory winds have shifted recently with the intense focus on the mental health risk of chatbots, especially for children. For example, the FTC’s ongoing market study of companion bots includes consideration of what companies do, before and after deployment, to address potential negative impacts, with a focus on protections for minors. State attorneys general have sent a series of letters to AI companies warning them about legal consequences if they do not use the data they have on user interactions to mitigate such impacts.

Meanwhile, state legislatures have been passing laws restricting the marketing and deployment of mental health and companion bots, including a recently passed law in California. None of those state laws specifically addresses pre-market evaluation. In California, Governor Newsom vetoed a separate bill that would have likely had the practical effect of requiring such evaluation, in that it would have prohibited making companion chatbots available to children unless they were not foreseeably capable of engaging in specified conduct that could harm children.

However, one broad state law, the Colorado AI Act, which takes effect on June 30, would treat medical AI tools as high-risk systems to the extent they are a substantial factor in “consequential decisions” as to the provision or denial of healthcare services to consumers. If so covered, developers are subject to notable requirements, including impact assessments and risk management.

We may also see activity in this area from state medical boards. In 2024, the Federation of State Medical Boards (FSMB) issued a paper, Navigating the Responsible and Ethical Incorporation of Artificial Intelligence into Clinical Practice, which recognizes “the need for verification of AI-generated clinical information for accuracy.” The FSMB said that it should “develop documentation detailing the capabilities and limitations of the most commonly used AI tools” and that clinical decision-makers “should design a process for regular review of the efficacy of the tools.” The paper also notes AI’s possible impact on the corporate practice of medicine doctrine, the applicability and contours of which vary by jurisdiction, but which is intended to ensure that only licensed physicians make medical decisions. Regulators could use that doctrine to restrict the usage of certain medical AI tools in ways that effectively mandate more human oversight of their efficacy.

Benchmark limitations and the measurement problem in generative AI

Healthcare is by no means the only sector facing serious challenges in evaluating the safety or efficacy of new AI products. Law is another one. And these challenges are not new; prominent researchers have been focusing for years on the general problems with existing benchmarks and methods for testing predictive and generative models.

On December 2, 2025, the National Institute of Standards and Technology (NIST) published a blog post from its Center for AI Standards and Innovation (CAISI) describing what it is doing to meet the “need for improved AI measurement science.” Today, CAISI says, “many evaluations of AI systems do not precisely articulate what has been measured, much less whether the measurements are valid.” As noted in the post, tackling this problem is part of the White House’s AI Action Plan, which tasks NIST with leading “the development of the science of measuring and evaluating AI models.” CAISI explains that challenges include ensuring that measurement is valid in terms of construct validity (does a test measure what it claims to measure?) and generalization (will results apply in other contexts and real-world settings?). The blog post also describes how benchmarks can be problematic because of the train-test issue (whether the test questions are already in the model’s training data) and susceptibility to developer gaming (whether the model was designed or optimized specifically for the test).

CAISI’s efforts are not limited to generative models, but it is well recognized that solid evaluation methods of generative models are proving especially difficult to establish. Noted computer scientist Andrej Karpathy even referred to an “evaluation crisis” regarding large language models (LLMs). Leading an international group of researchers, the Oxford Internet Institute published findings that 445 LLM benchmarks lacked scientific rigor, having been “built on unclear definitions or weak analytical methods, making it difficult to draw reliable conclusions about AI progress, capabilities or safety.”

Several reasons explain why, as some researchers put it, “[a]cross academia, industry, and government … there is an increasing awareness that the measurement tasks involved in evaluating GenAI systems are more difficult” than those involving traditional machine learning systems. Generative AI is, of course, a newer technology, and basic standards – such as for assessing accuracy – take time to establish. It is also hard to establish metrics because they would vary by use case, whereas these models are generally not built for a single use case, and because use cases also evolve over time. These systems also accept a variety of inputs and produce diverse outputs that can have a range of effects on people. As a result, what often needs to be measured is abstract and contested. Evaluation is further complicated because companies building these models are often not transparent about how and with what data they were developed, and downstream users thus have little access to or control over them.

A March 2025 paper by University of California and University of Virginia researchers focuses specifically on the problem of construct validity for medical large language benchmarks, finding that they “are arbitrarily constructed using medical licensing exam questions.” To measure real progress, the authors say, these benchmarks “must accurately capture the real-world tasks they aim to represent.” Their vision for a new ecosystem of valid benchmarks involves empirical evaluation, analogous to psychological tests.

This approach dovetails with a June 2025 paper that discusses the general lack of scientific rigor for evaluating generative AI systems and calls for the use of measurement theory from the social sciences. Yet another recent paper discusses the same topic and suggests borrowing from fields such as transportation, aerospace, and pharmaceutical engineering.

A September 2025 study, this one from Microsoft Research, examined the top scores achieved by six flagship multimodal models on medical benchmarks. When “stress tested,” these frontier models “often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning.” The authors conclude that their tests “expose how today's benchmarks reward test-taking tricks over medical understanding” and thus how they do not directly reflect real-world readiness.”

In January 2026,the ARISE network, a Stanford-Harvard collaboration, released its State of Clinical AI (2026) report, which focused on AI use in clinical settings and noted mixed diagnostic results in real settings versus controlled benchmarks. The report emphasized the importance of evaluation frameworks, concluding: “The consensus is clear: better evaluation, not just better models, is the prerequisite for trustworthy clinical AI.”

Emerging solutions for evaluation and risk management

Despite the many challenges with AI evaluation, there are many reasons for optimism beyond the NIST announcement and noted academic efforts. Several new collaborations have been launched to evaluate AI more broadly and in the healthcare space. For example, a group of researchers, the EvalEval Coalition, is tackling the problem by identifying gaps and priorities in AI evaluation science, as well as designing and performing relevant research. In addition, a new consortium of research organizations, the AI Evaluator Forum, has formed “to establish standards, share knowledge, and ensure that AI evaluations meet the highest levels of methodological rigor and independence.”

Another collaborative, the Health AI Partnership (HAIP), is engaging in multiple efforts to assist health delivery organizations (HDOs) in evaluating AI tools for safe and effective use. For example, HAIP has piloted a hub-and-spoke network to support local HDOs that do not have the resources and capabilities to evaluate their AI tools. The partnership, led by the Duke Institute for Health Innovation, has also developed the AI Vendor Disclosure Framework, intended to help HDOs and vendors evaluate AI systems via expert guidance on how to surface critical information. (Two DLA Piper attorneys, Danny Tobey, M.D., J.D., and Zev Eigen, J.D., Ph.D., sit on HAIP’s Leadership Council.)

While pre-deployment evaluation of medical AI tools is crucial, so is monitoring and evaluating them after deployment. In recognition of needs in this area, an Aspen Policy Fellow has developed a framework for post-deployment evaluation of AI-enabled clinical decision support tools.

Finally, some research has focused on LLM use in patient-facing contexts rather than use by medical professionals. A set of researchers from South Korea developed a benchmark, PatientSafeBench, to evaluate such LLMs. Their results from testing 11 models on that benchmark were “that no model met our safety criteria for patient use, with medical-specific LLMs surprisingly underperforming general-purpose models.”

While regulatory, research, and community efforts proceed, makers and sellers of AI-enabled healthcare products – especially those designed or used for mental health – remain under scrutiny by regulators, healthcare institutions, and the public. In this climate, developers and deployers should ensure that marketing claims are backed up by reliable testing and that they’ve identified and mitigated any substantial risks to users. Among other things, they shouldn’t rely on bold claims about, or based on, unscientific benchmarks. Regardless of what the FDA and others may do in this area, due diligence here may be a legal requirement under other federal or state legal regimes, such as the FTC Act and AI-related state laws, as well as laws and rules in the European Union and other parts of the world.