Skip to main content

Posts

OpenAI's Deployment Simulation: Testing AI Behavior Against Real Traffic Before Release

OpenAI published a paper on June 16 describing something I've been wanting to see for a while: a way to test how a new model actually behaves at scale, using real user conversations rather than synthetic benchmarks. They call it Deployment Simulation . The short version is they replay 1.3 million de-identified production conversations with a candidate model before releasing it, catch behavioral drift early, and find that models have almost no idea they're being tested. That last part is the most interesting finding. The Problem It's Solving Anyone who has shipped AI features has hit this pattern. A benchmark says your new model is better. You do some manual evals. You run your regression suite. You deploy. Then something shifts in a way none of that testing caught, and you find out from user complaints. The International AI Safety Report 2026 has a name for this: the "evaluation gap." It's the systematic disconnect between how models perform on pre-deploym...
Recent posts

Gemini 3.5 Flash and the End of 'Use the Biggest Model' for Agents

I've been defaulting to Opus-tier or GPT-5.5 for anything agent-related because that felt like the safe call. Better reasoning, better tool use, better outcomes. Flash-tier models were for batch jobs, summaries, things where you didn't care that much about output quality. That calculus broke for me after spending time with the Gemini 3.5 Flash benchmarks . The model went GA on May 19 at Google I/O. The number that got my attention: 83.6% on MCP Atlas, a benchmark specifically for multi-step tool orchestration using Model Context Protocol servers. That puts it 8.3 points ahead of GPT-5.5 (75.3%) and 4.5 points ahead of Claude Opus 4.7 on the same eval. "Flash" doesn't mean what it used to. What MCP Atlas Is Actually Measuring MCP Atlas tests whether a model can chain together multiple tool calls across MCP servers, recover from partial failures, and complete multi-step tasks without going off-script. It's not a writing or reasoning benchmark. If you're ...

Multi-Armed Bandits Are Not Smarter A/B Tests

Multi-armed bandits are an adaptive testing method that shifts traffic toward your best-performing variant as the test runs, rather than holding a fixed 50/50 split throughout. The idea is to minimize the cost of running a losing variant. The problem is that teams adopt them as an upgrade to A/B testing, and they're not: they're a different tool that trades statistical validity for short-term efficiency. If you're using MABs for product features, checkout flows, or anything you'll iterate on, you're probably getting cleaner-looking results that tell you less than you think. The Core Tradeoff You're Actually Making A/B tests assign traffic randomly. That randomness is the whole point. It's what lets you make causal claims. When you can say "I randomly assigned users to this variant, and they converted at a higher rate," you're not just observing a correlation. You've approximated a controlled experiment. MABs discard that guarantee. By ...

The Winner's Curse in A/B Testing: Why Your Biggest Lifts Are Probably Exaggerated

I've audited a lot of experimentation programs. The most common red flag isn't a low win rate. It's a suspiciously high one. If your team is consistently reporting 40%, 50%, or 60%+ win rates with lifts above 20% on your primary metric, something is probably wrong. Not "wrong" in the sense of fraud, but wrong in the statistical sense: you're almost certainly looking at the winner's curse. What the Winner's Curse Actually Is The winner's curse is not about bad luck. It's a mathematical outcome of running underpowered tests. Here's the mechanism: when a test is underpowered (say, 30% or 40% statistical power instead of the standard 80%), the test usually fails to detect a real effect. Most runs come back null. But occasionally, by chance, the noise in your data pushes the result over the significance threshold. When that happens, the observed lift is almost always an exaggeration of the true effect. The only way a small, underpowered tes...

Best AEO Tools in 2026: Top 5 Answer Engine Optimization Platforms Compared

If your brand isn't showing up in ChatGPT, Perplexity, or Google AI Overviews, you're missing a fast-growing slice of product discovery. 37% of product discovery queries now start inside AI interfaces, not search engines. Answer Engine Optimization (AEO) is the practice of fixing that, and you need the right tools to track it, measure it, and improve it. I've gone through the options available in 2026 and narrowed it down to the five that actually deliver. What to Look for in an AEO Tool Before picking a tool, know what you actually need. The AEO tool market splits into two buyer types: teams extending an existing SEO platform (Ahrefs, Semrush, SE Ranking) and teams buying a dedicated AI visibility platform (Profound, Scrunch, Otterly.AI). The core capabilities to check: Engine coverage: Which AI platforms does it monitor? ChatGPT, Perplexity, and Google AI Overviews are the minimum. Claude, Gemini, Copilot, and Grok are increasingly important. Citation and mention t...

n8n MCP in 2026: Three Ways to Connect AI Agents to Your Workflows (Compared)

If you're building AI agent workflows, n8n is no longer just a "webhook plus HTTP node" automation tool. As of late 2025, it has native Model Context Protocol support on both ends: it can call external MCP servers and expose its own workflows as MCP tools. That changes how you think about connecting AI agents to automation. Here are the three distinct ways you can wire n8n and MCP together, and where each one actually fits. Why MCP Matters for n8n Developers MCP (Model Context Protocol) , open-sourced by Anthropic in late 2024, became the de facto standard for AI-to-tool communication through 2025. The idea is simple: instead of hardcoding tool schemas into every AI app, you expose them through a standard JSON-RPC interface over SSE or streamable HTTP. Any MCP-compatible client, Claude, GPT-4o, Cursor, Windsurf, can discover and call those tools without custom integration code. n8n added two nodes that put it on both sides of this equation. The community announcement...

AEO Platform Breakdown: What Gets You Cited in ChatGPT vs Perplexity vs Google AI Overviews (2026)

Only 11% of domains cited by ChatGPT show up in Perplexity's answers too. That figure comes from an Averi analysis of 680 million AI citations published in March 2026. If you're running a single "AEO strategy" and calling it done, you're optimizing for one platform and leaving the other three on the table. I've been digging into this for client work at Publicis Sapient and the platform differences are bigger than most guides admit. Here's what each engine actually rewards. Why Platform-Specific AEO Matters Now Over 40% of search queries in 2026 go to AI assistants rather than traditional search engines. ChatGPT alone accounts for 87.4% of all AI referral traffic to brand websites. And 68% of consumers now start product research in ChatGPT or Perplexity before they visit a brand website at all. The problem is that these platforms don't pull from the same source pool. Each has a different retrieval architecture, different freshness requirements, and...