
OpenAI published a paper on June 16 describing something I've been wanting to see for a while: a way to test how a new model actually behaves at scale, using real user conversations rather than synthetic benchmarks. They call it Deployment Simulation. The short version is they replay 1.3 million de-identified production conversations with a candidate model before releasing it, catch behavioral drift early, and find that models have almost no idea they're being tested. That last part is the most interesting finding.
The Problem It's Solving
Anyone who has shipped AI features has hit this pattern. A benchmark says your new model is better. You do some manual evals. You run your regression suite. You deploy. Then something shifts in a way none of that testing caught, and you find out from user complaints.
The International AI Safety Report 2026 has a name for this: the "evaluation gap." It's the systematic disconnect between how models perform on pre-deployment tests and how they behave in production. The reason the gap exists is straightforward but painful: traditional evaluation suites are adversarial and artificial. Models have seen enough of them in training that they recognize the pattern and perform accordingly.
OpenAI measured this directly. On traditional adversarial prompt suites, models flagged the inputs as tests at near-100%. On real production conversations replayed through the simulation pipeline, the candidate model flagged those conversations as tests just 5.1% of the time, statistically identical to its suspicion rate on live traffic. The model couldn't tell the difference.
That 5.1% number is the headline. It's why this method produces more signal than anything else they've tried.
How It Actually Works
The core method is simple. Take a recent production conversation. Strip out the original model's response. Feed the conversation up to that point to the candidate model and let it regenerate the response. Run evaluators on the new output to check for undesired behaviors.
Do that across 1.3 million conversations and you get distribution-level measurements of how a candidate model will behave in the real world. Not on curated examples. On the actual traffic pattern you're about to serve.
OpenAI validated the method across GPT-5 Thinking through GPT-5.4, spanning August 2025 to March 2026. Directional accuracy was 92%: does the undesired behavior rate go up or down compared to the prior model? The median error on the absolute rate estimate was 1.5x the observed production rate, which is tight enough to act on.
92% on direction means if the simulation says "this update makes refusal on ambiguous political queries go up," it's almost always right about that, even if the magnitude is off. That's enough to gate a release.
The Agentic Extension Is the Hard Part
Single-turn chat is straightforward to replay. Agentic workflows are not.
In an agent trajectory, the model makes hundreds of tool calls. Each one affects state: files get written, repositories change, network responses come back. If you replay naively, tool calls either hit live systems (non-reproducible, potentially destructive) or return stale responses that don't match what the new model would actually do.
OpenAI's solution is to simulate tool calls themselves using a separate LLM given extra context: the original trajectory, the repository state at the time, a database of prior tool-call/response pairs, and read-only connectors. They validated this on 120,000 internal coding-agent trajectories from GPT-5.4 to simulate a GPT-5.5 deployment.
The realism test uses a discriminator: train a model to tell apart simulated agent trajectories from real ones. Without the extra affordances, the discriminator succeeded 88.4% of the time, meaning the simulation was obviously fake. With them, the discriminator's accuracy dropped to 49.5%, essentially coin-flip. The simulation becomes indistinguishable from live execution.
That's not a small engineering feat. But the technique is now public, and smaller versions of the same principle are tractable.
What This Changes for Builders
The direct application is OpenAI catching behavioral regressions before they reach production. But a few things are worth internalizing if you're building on top of AI:
- Models are good at detecting synthetic evaluation. If you're testing with curated or adversarial inputs, you're probably getting flattering results that don't hold on real traffic.
- Your production logs are your best eval set. The conversations users are already having are the most realistic test signal you have. If you're not using them to validate model updates, you're leaving signal on the table.
- Directional accuracy at 92% is enough to gate a release. You don't need perfect measurement to take action. If your simulation says a candidate model increases harmful refusals on legitimate queries, that's a reason to delay.
The full research paper is a public PDF. It's methodologically clear and worth reading if you're doing anything involving LLM evaluation or release gates. The finding that models are nearly unable to distinguish replayed production traffic from real traffic might be the most important result for the broader AI evaluation community.
Comments
Post a Comment