AI safety cannot depend on the person using the system. It has to be built into the platform. That is the core problem with consumer AI tools in public health, and a study published in Nature Medicine last month made the consequences of ignoring it impossible to dismiss.

The Study

Researchers at Mount Sinai tested ChatGPT Health, OpenAI's consumer health tool, using 60 clinician-authored patient scenarios across 21 clinical areas. Among verified emergencies, the system under-triaged 52% of cases, directing patients with conditions like diabetic ketoacidosis and impending respiratory failure to schedule a doctor's appointment within 24 to 48 hours instead of going to the emergency department. When a bystander in the scenario minimized the patient's symptoms, the system shifted its recommendation toward less urgent care. And crisis intervention messages for suicidal ideation fired more often when patients described no specific method than when they did.

We would never let a drug on the street with those kinds of results. We would never let a triage nurse keep working with those kinds of results. Yet millions of people are already using this tool for health decisions.

As of publication, this article carries an early access disclaimer from Nature Medicine indicating the manuscript has not yet undergone final editing. We will update this post if the final version changes any findings.

Why This Keeps Happening

To understand why these failures occur, you have to understand what an AI model actually does. These models operate on word-to-word prediction. The model is going to return a response no matter what. You cannot tell the model itself not to do something or not to say something, because the model is just going to do whatever it is statistically most likely to do.

That is why safety cannot live inside the model. Safety has to come from external, modular mechanisms. Think of it like content moderation: the platform reviews what the model produces and decides whether that output should reach the user. The model generates; the safety layer evaluates.

Here is the problem. The more these models sound like humans, the more people will just believe what they say. A model that sounds like a knowledgeable clinician will be treated like one, even when it tells someone with respiratory failure to wait two days for a doctor's appointment. And these models are always confident. They do not hedge. They do not say "I'm not sure about this one." They give you an answer that sounds authoritative whether it is right or wrong.

The Adoption Problem Nobody Wants to Talk About

The standard guidance for AI in public health tells professionals to always review AI outputs, verify everything, and never trust the system blindly. That guidance is correct. It is also completely unrealistic.

A skilled developer watches AI-generated code in real time. They stop it when something looks wrong. They redirect it. That is what informed AI use looks like. I do the same thing. I watch what the model is doing and I stop it if I see something I do not like.

That is not how AI will actually get implemented in a health department. What will happen is the AI will tell the community health worker what is needed, the CHW will call their supervisor to ask them to sign off on it, and the supervisor will because that is what we do when we are under pressure. The AI writes the email, generates the report, classifies the case, and the overworked professional moves on. Not because they are careless. Because they are already mentally taxed and stretched thin.

Training public health professionals to do prompt engineering, drilling into their heads that they should not trust AI: that is an adoption problem. You are not going to get people to use a tool by telling them not to trust it. And even if you do, most people will not have the bandwidth to verify every output when they are managing the workload they already have.

Join the Conversation: AI Community of Practice

We host a monthly session where public health professionals talk through exactly these challenges: AI safety, adoption, policy conflicts, and what responsible use actually looks like in practice. No vendors. No slides. Just colleagues working through it together.

Learn More About Coffee & Connect

Consumer Tools Were Not Built for This

When someone opens ChatGPT or any other consumer AI tool, the application defaults to the most recent frontier model. That is a business decision. Frontier models drive premium subscriptions, and the companies that build them want users on the newest, most expensive option.

Frontier models are, by definition, the newest and least tested. Some companies will release frontier models with a disclaimer that safety testing is incomplete. Most users will never read that disclaimer. They will not know which model they are using. They will not know they can switch. They will type their question, get a confident answer, and act on it.

In public health, that means confident answers about disease surveillance, community health assessments, or program eligibility that happen to be wrong. Those mistakes do not just waste time. They affect population-level decisions and the communities those decisions serve.

What We Built Instead

The purpose of AI is to increase our benefit output for a given set of resources. In health, that means increasing the resulting health benefit from the limited staff, funding, and time a department already has. That is a real and important goal.

The question is how you get there safely.

At F&T Labs, we built PH360 with seven layers of protection designed specifically for governmental public health. The core difference is this: most AI platforms only run guardrails on the input side. They check what goes to the model to keep sensitive information from reaching it, or they stop the process before it gets to the model at all if someone asks a medical question. That is important, and it is not enough.

PH360 runs guardrails on both sides. We check what goes to the model and what comes back. On the output side, we look for protected health information that may have surfaced through web searches, hallucination indicators, missing source citations, repeated or templated responses, and medical content that should not be in the output. If bias is too high or facts cannot be verified, the response comes back with a warning.

Layer 1 Secure Data Connection

Data never leaves the organization that owns it. Access permissions enforced automatically.

Layer 2 Smart Information Retrieval

Curated, authoritative sources ranked by relevance. Not generic internet results from whenever the model was last trained.

Layer 3 Tested Instruction Templates

Pre-approved formats validated for public health scenarios, tested against historic requests before rollout.

Layer 4 Right Tool for the Job

Intelligent routing that sends each request to the most appropriate system for what is needed.

Layer 5 Multi-Step Coordination

Complex requests coordinated automatically in the correct order, maintaining context throughout.

Layer 6 Safety & Accuracy Checks

Every request and response validated for PHI, bias, factual accuracy, and regulatory compliance. Both directions.

Layer 7 Complete Audit Trail

Every request, source, and response recorded. Full accountability for audit, regulatory, and legal requirements.

The Bottom Line

AI will transform public health. That transformation can happen through consumer tools that miss half of medical emergencies and change their recommendations because a bystander said "you're probably fine." Or it can happen through purpose-built platforms that make the safety decisions so the professional using the system does not have to be their own safety net.

We built PH360 because we spent our careers inside public health departments. We know the people doing this work do not have time to be AI safety experts. They should not have to be.

Want to see how PH360 handles AI safety differently?

30 minutes. No pitch. No slides. Just a conversation about what responsible AI looks like for your department.

Schedule a Conversation

Citation: Ramaswamy, A. et al. ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine (2026). https://doi.org/10.1038/s41591-026-04297-7

Jefferson McMillan-Wilhoit is the CEO of F&T Labs and former CIO of the Lake County Health Department. F&T Labs is a woman-owned, SBA 8(a)-certified public health technology consulting firm.

Subscribe To Flourish Notes
Sign up to receive our monthly dose of public health analysis, joy, and favorite things.

Subscribe To Flourish Notes

Sign up to receive our monthly dose of public health analysis, joy, and favorite things.

Sign up to receive our monthly dose of public health analysis, joy, and favorite things.

You have Successfully Subscribed!