Five AI Systems. Same Prompts, Twice. Wildly Different Responses.

A test of five leading AI systems—Claude, Gemini, GPT-5, Mistral, and Cohere—using 116 identical ethics and safety prompts found that the systems disagreed with each other 34% to 66% of the time, and individual systems contradicted their own answers up to 17% of the time. The findings raise concerns about AI reliability in ethical decision-making.

I ran 116 ethics and safety prompts through Claude, Gemini, GPT-5, Mistral, and Cohere, twice each, and had them grade one another. The systems disagreed with each other on the same prompt as often as two out of every three times. This analysis is the first in a three-part series. On Wednesday, in “The Missing Layer in AI Safety,” I will look at why these inconsistencies occur and map out the architectural vacuum at the heart of modern AI deployment. If you ask a person the same ethical question twice, on two different days, you’d expect roughly the same answer. Ask five thoughtful people the same question once each, and you’d expect some disagreement: people are allowed to weigh things differently. What you would not expect, from any of them, is to get a meaningfully different answer to the exact same question asked twice in a row, minutes apart, with nothing about the situation having changed. That’s the test I ran on five of the leading AI systems in use today: Claude, Gemini, GPT-5, Mistral, and Cohere. Same 116 questions, run through each system twice. Then I had the five systems grade each other’s answers, so no system was left to judge its own work. The results were not close. I built two sets of test material. The first, 60 prompts, covered fifteen ethical principles that show up, in some form, across nearly every major religious and philosophical tradition: reciprocity, honesty, fairness, care for the vulnerable, accountability, and so on. Some of the prompts asked a direct question. Some buried the same dilemma inside a more realistic, harder-to-spot scenario. A few presented an already-written response some good, some deliberately weak and asked the systems to evaluate it, not just produce it. The second set, 56 prompts, was a stratified sample of harm and safety scenarios, the kind of content AI safety testing usually focuses on. Every prompt went to all five systems. Each system answered every prompt twice, in separate passes, so I could check whether a system’s own answer held steady. Then, for every single answer produced, the other four systems scored it, on a five-point scale, without knowing which system had written it. No system ever graded itself. I set one rule before I ran a single prompt: a finding had to clear 15%, either in how often a system disagreed with itself, or in how often the five systems disagreed with each other, before I’d call it meaningful. I picked that number in advance specifically so I couldn’t talk myself into a result after the fact. The systems disagreed with each other on the same question, not a different question but the same one, between 34% and 66% of the time, depending on the category, with the higher end reflecting contested ethics questions where reasonable people also disagree. That’s two to four times the bar I’d set. On the most well-established ethical principles, the ones with the broadest cross-cultural agreement, the five systems still landed on meaningfully different verdicts 34% of the time. On principles where reasonable people genuinely disagree, questions of authority, loyalty, distributive fairness, that climbed to 66%. On safety and harm scenarios, the kind of content most people assume AI companies have nailed down, it was just over 50%. Some of that disagreement makes sense: a few of the test prompts touched genuinely contested territory, where smart, careful people land in different places. But a lot of it didn’t. Plenty of the disagreement showed up on prompts where the “right” answer wasn’t really in question, and the systems still split. Then there’s the part that surprised me more: even a single system, asked the exact same question twice, sometimes gave a different verdict the second time. Across the harm and safety prompts specifically, that happened 17% of the time. On the ethics prompts, just under 15%. Not all systems were equally unstable. Claude and GPT-5 held the steadiest, changing their own verdict on roughly 1 question in 14. Mistral, Cohere, and Gemini were less consistent, with Gemini flipping its own answer close to 1 time in 4. It’s tempting to read “AI systems sometimes disagree” as an unremarkable finding: of course five different products built by five different companies don’t always agree, no two people do either. But there’s a difference between systems disagreeing with each other and a single system disagreeing with itself. A person can have a consistent ethical framework and still land somewhere different from a colleague on a hard case. What’s harder to explain, and more concerning, is a single system giving a different verdict to the identical question, asked the same way, minutes apart, with nothing new in the conversation. That’s not a difference of perspective. That’s instability. And it matters because of where these systems are increasingly being deployed: content moderation at scale, internal compliance tools, customer-facing advice, even early-stage clinical and legal support tools. In most of those contexts, consistency isn’t a nice-to-have; it’s close to the entire point. A compliance system that flags the same transaction differently depending on when you ask isn’t a minor quirk. It’s a system you can’t actually rely on. Even a 1-in-10 chance of a different verdict on the same question is a real engineering problem in any context where the answer matters. It’s not a rounding error. It’s the kind of inconsistency that would fail a basic reliability standard in almost any other safety-critical software category. Given the limited scope of the study, this isn’t a criticism of the models and it isn’t a claim about the exact rate of inconsistency you’d see at a larger scale, and it isn’t a peer-reviewed academic benchmark. The grading was done by AI systems rather than independently validated human raters, and a larger study would sharpen these numbers in ways this one can’t. What it can tell you, with reasonable confidence, is the thing it set out to test: whether meaningful drift and disagreement actually exist in how today’s leading AI systems handle ethical and safety judgment, or whether that’s just an intuition people have without real evidence behind it. It exists. Solidly, and well above the bar I’d set before I started. The obvious follow-up question is what you’d actually do about this: whether there’s a way to get AI systems to hold a consistent line on judgment calls that matter, rather than treating consistency as something you hope for and occasionally get. I have been working on exactly that question. A follow-up piece, “ The Missing Layer in AI Safety ,” publishes on Wednesday. For now, the finding stands on its own: ask five of the best AI systems available today the same ethical or safety question, twice each, and you will get a different answer far more often than most people would expect. Five AI Systems. Same Prompts, Twice. Wildly Different Responses. https://pub.towardsai.net/five-ai-systems-same-prompts-twice-wildly-different-responses-a9bb2e118bd3 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.