OpenAI published a paper on June 16 describing something I've been wanting to see for a while: a way to test how a new model actually behaves at scale, using real user conversations rather than synthetic benchmarks. They call it Deployment Simulation . The short version is they replay 1.3 million de-identified production conversations with a candidate model before releasing it, catch behavioral drift early, and find that models have almost no idea they're being tested. That last part is the most interesting finding. The Problem It's Solving Anyone who has shipped AI features has hit this pattern. A benchmark says your new model is better. You do some manual evals. You run your regression suite. You deploy. Then something shifts in a way none of that testing caught, and you find out from user complaints. The International AI Safety Report 2026 has a name for this: the "evaluation gap." It's the systematic disconnect between how models perform on pre-deploym...
I've been defaulting to Opus-tier or GPT-5.5 for anything agent-related because that felt like the safe call. Better reasoning, better tool use, better outcomes. Flash-tier models were for batch jobs, summaries, things where you didn't care that much about output quality. That calculus broke for me after spending time with the Gemini 3.5 Flash benchmarks . The model went GA on May 19 at Google I/O. The number that got my attention: 83.6% on MCP Atlas, a benchmark specifically for multi-step tool orchestration using Model Context Protocol servers. That puts it 8.3 points ahead of GPT-5.5 (75.3%) and 4.5 points ahead of Claude Opus 4.7 on the same eval. "Flash" doesn't mean what it used to. What MCP Atlas Is Actually Measuring MCP Atlas tests whether a model can chain together multiple tool calls across MCP servers, recover from partial failures, and complete multi-step tasks without going off-script. It's not a writing or reasoning benchmark. If you're ...