OpenAI’s Deployment Simulation Extends Pre-Deployment Risk Assessment to Agentic Coding Through Simulated Tool Calls OpenAI published a new pre-deployment safety method called Deployment Simulation, which replays past conversations through a candidate model to estimate undesired behavior frequency before release. The method has already informed mitigations and deployment decisions, surfacing blind spots in traditional evaluations. OpenAI published a new pre-deployment safety method called Deployment Simulation. The idea is direct. Before a model ships, simulate its deployment first. Replay past conversations through the new candidate model. Then study how it behaves in realistic contexts. OpenAI already uses insights from the method during model development. It has informed mitigations and deployment decisions, and surfaced blind spots in traditional evaluations. Understanding Deployment Simulation Deployment Simulation is a method for simulating a future deployment before it happens. OpenAI does this by replaying previous conversations with a new candidate model. The replay is privacy-preserving. The technique is simple at its core. Take recent conversations from deployment. Remove the original assistant response from the older model. Regenerate that response with the candidate model to be released. Then evaluate the completions for new failure modes. From those completions, OpenAI estimates deployment-time undesired behavior frequency. The same measurement can run after release on real traffic. That makes pre-deployment forecasts checkable later. There is a floor. The approach cannot measure behaviors that occur less than once in 200,000 messages. It targets non-tail risks, not the rarest events. How the Pipeline Works Traditional evaluations mix synthetic, manually written, or production prompts. They are chosen to be difficult, high severity, or adversarial. Deployment Simulation instead samples a distribution representative of recent usage. That representativeness fixes three known problems. It reduces selection bias from hand-picked prompts. It improves coverage by simply simulating more traffic. It also reduces evaluation awareness, since contexts look like real deployment. It has a very clear tradeoff. Quality scales with compute, not with manual effort to build evals. More resampled traffic means more behaviors surfaced. Here is the core estimation loop as runnable Python. The model and grader are mocked, so the logic runs end-to-end. It mirrors the method, not OpenAI’s code. python import random Deployment Simulation: core loop runnable mock . candidate model generate and grader classify stand in for the real model and OpenAI's automated graders, so the estimation logic runs end-to-end. TRUE RATE = 10 / 100 000 true per-message rate of the undesired behavior def candidate model generate prefix : return "