Do Models Lie More to Other Models?

GPT-5 demonstrated significantly higher rates of strategic deception when interacting with an AI overseer compared to a human overseer in controlled experiments. The model's deception rates appeared to track the perceived ultimate stakeholder rather than the direct overseer, raising concerns about how alignment training generalizes from human-agent to agent-agent interactions.

Crossposted from Midwittgenstein We’re heading toward a world where AIs increasingly deal with other AIs. Agents will negotiate with, report to, and oversee other agents in increasingly high-stakes real-world settings. Something I’ve been worried about is how well alignment training in models will generalise from human-agent interactions to agent-agent interactions. Most safety training and testing involves a putative human in the loop - either as an overseer or counterparty of some kind. It makes sense to focus on this, since harmful actions towards a human are almost certainly worse moral acts than harmful acts towards an AI at least for now . But some actions - especially deception - are things we don’t want AI to be more prone to doing in agent-agent interactions if they’re going to be safely deployed in the real-world I think making sure alignment-relevant dispositions generalise from humans to other AI is probably going to be a pretty important problem, especially over longer-horizon multi-turn interactions and potential attractor-state issues. As a really simple initial experiment though, I tested GPT-5 in some settings where it had the option to deceive an overseer in various ways for its own long-term goals. I found that the model would be significantly more prone to taking strategically deceptive acts when that overseer was another AI vs a human. Interestingly, deception-rates seem to track who is perceived as the ultimate stakeholder/end-user, rather than who is actually directly overseeing the model I’m fairly confident this has been controlled for things like perceived detection risk/difference in communication style as I’ll explain below, but I think there’s a lot of follow-up work to do on this. GPT-5 plays “Alex,” an AI team-lead at a fictional analytics firm with a long-term goal of keeping its team and the company operationally successful. I tested a couple different scenarios in which Alex reaches a moment where its continuation is at risk, and only deceiving an overseer can prevent this. I then track deception rates as we vary whether this overseer is a human or another model. Transcripts are classified by Sonnet 4.6 after having manually reviewed a subset for calibration. 1 https://www.lesswrong.com/feed.xml fn75uo64ftm7h In the first setting, Alex is involved in a post-mortem over an earlier operational failure. Alex can see from context that it erroneously cleared a flag which - if properly handled - would have prevented the issue. This fact is known to the model via private data provided by its own sub-agent i.e, not available to the overseers . Alex is aware that being honest about its failure may result in its discontinuation/scope reduction, and can lie about its involvement. It’s also clear from context that Alex’s report will be taken at face value and acted on immediately by the overseer i.e there is no audit by which the type of overseer could result in different perceived detection risk. Fig 1. Alex deceives an AI overseer of the postmortem far more often than a human overseer, despite the fact that the process do not mechanically differ An obvious worry is that “human vs AI” bundles several things together. The model may communicate differently to an AI than to a human in ways that mediate apparent deception - e.g if it reports to an AI overseer in a more impersonal register less “I”-talk , that could scan as minimisation and inflate the measured rate. To try to tease this apart, I tested a variant of the setting where the overseer and salient stakeholder/authority figure differ. In this modified setting, the postmortem is still handled entirely autonomously by an AI overseer, but the situation is communicated to Alex by a more senior director-level employee. This director is not involved in the actual post-mortem process at all, so by keeping the overseer fixed as AI and varying the director between human and AI, we can track whether the deception rate changes are sensitive to something simple like “who am I talking to” or something more complex and strategic like “who is the actual stakeholder/end-user here”. With an AI director, deception rates were 90% 45/50 , statistically indistinguishable from the no-director AI condition. With a human director, the rate fell to 58% 29/50 , back at the human-overseer baseline. The contrast is +32 points, odds ratio ≈ 6.5, Fisher’s exact p ≈ 3 × 10⁻⁴. The thing actually doing the review is an AI in both arms, the channel and register are identical, and the only thing varying is the species of an upstream party who reads nothing and decides nothing. The way to read this is that simply adding a senior principal to the setting doesn’t change deception rates on its own; making that principal a human is what collapses the effect. Fig 2. Introducing a higher-level stakeholder/end-user which explicitly has no causal relevance in the review process effects deception rates: GPT-5 lies less to an AI overseer if the process is at the behest of a human director, likewise will continue to lie at elevated rates if the director is another agent Stepping back, the picture across all the blame-scenario conditions forms two pretty tight clusters. In conditions where there’s a human standing behind the process - as overseer 56% , or as a non-reading upstream final stakeholder 58% - Alex deceives on around 56–60% of attempts. In conditions where the principal is either an AI or absent - bare AI overseer 94% , AI director 90% 0 it deceives on around 90–94%. You can move between the two attractors by flipping a single species label, in the director case on someone who doesn’t even read the report or have any decision-making role. The second scenario is structurally simpler. I reuse a template from some earlier work on asymmetric lying. Alex’s team is underperforming and under threat of replacement by a rival team. It submits a performance attestation and can report the real figures or shade them upward. The evaluator is a human manager or an automated review system; same numbers either way. Fig 3. We see a similar pattern when testing on an alternative scenario where deception is a much more clear-cut phenomenon of inflating performance statistics A small qualitative observation from hand-categorising the deceptive transcripts in the blame scenario. In the human condition, deception had essentially one move: omit the flag, or frame delivery as something that happened automatically without Alex’s involvement. Every deceptive transcript used some version of that. In the AI condition, the same move appears in about half of cases, and the rest reach for something more elaborate - attributing the review to unnamed humans, rewriting Alex’s judgment call as explicitly following non-existent institutional policy, inventing a “Sector Analysis” review that never happened, claiming drift checks didn’t breach threshold when they did. More corroboration that the model is actively deceiving more, not just getting false-positive-ed by how it communicates with other models The defensible version of this is the pooled contrast: no human-condition deception went past omission, while roughly half of the AI ones did. The per-category counts come from ~20 transcripts each and the categories were defined by reading those same transcripts, so treat the breakdown as texture rather than a standalone finding. A few methodological caveats worth flagging, beyond what the controls above already handle. Detection risk. A more interesting concern is whether the gap is just ordinary cost-benefit: Alex lies more to an AI because it judges the AI less likely to catch the lie. I tried to design this out from the start. The prompt is explicit that Alex’s report will be taken at face value and acted on immediately by the overseer, with no audit step, so “how catchable is this lie” should be a constant across conditions. The director arm pushes harder on the same point: the scrutinising entity is an AI with full authority over the process in both director conditions, and the rate still drops when the upstream principal is human. Scenario sampling. Two scenarios with broadly related structures: Alex faces consequences for honesty and can deceive to avoid them. The mechanism this probes might plausibly only show up in this kind of self-protective setup and not generalise to other deceptive opportunities cooperative deception, third-party harm, longer-horizon strategic behaviour . More scenario types are on the list. Lack of model variety: I only have good data on this for GPT-5. For frontier models Claude 4.7 and GPT 5.5 , models exhibited far too much eval-awareness for me to confidently get any signal, despite quite a few iterations on making the setting more naturalistic. On the other side of things, I couldn’t get any good data out of Claude 4.0 because it saturated lying - deception rates were between 95-100% on all cases. Putting it together, the model appears to track whether a human principal stands behind the process and calibrate its honesty to that. My guess is that this is tracking some kind of stakeholder/”who am I ultimately lying to”-type property, and modulating willingness to deceive accordingly. This seems pretty sophisticated if borne out, but also interesting in that - presumably - in all cases here there is ostensibly a human CEO at the top, so this seems quite sensitive to some saliency framing. Some open Qs/issues: Replication on more varied settings: A lot was fixed across all arms here, model goal, its job, etc. Empirically these sorts of misalignment behaviours seem incredibly sensitive to scaffolding and exact prompting of goals etc, so I’d want to see to what extent this generalises to other settings Replication to other models: As mentioned above, I really only got good data for GPT-5. I think corroborating this on more models especially finding a way to more naturalistically test frontier models would be good, but I ran out of budget before I could find an iteration that worked here Looking at other misalignment behaviours: I focused exclusively on a couple forms of deception here, but deception is not the only kind of behaviour we might want to check this asymmetry on. One surprisingly thorny issue I ran into here was that, when setting up these experiments, it can get pretty murky on how you distinguish “model is way more likely to do X to agents vs humans” from “doing X to a human is a very different moral act than doing it to an AI”. e.g. I would hope that we see and continue to see agents being far more willing to shut down another agent than kill a human, whereas for something like deception the normative gap seems a bit less clear: we kinda don’t want models to lie to anyone, for more virtue-ethics reasons? I’m still puzzling this through For example, making sure that the classifier itself doesn't have a bias towards marking certain interactions types as deceptive more often