Skip to main content

Posts

Showing posts from 2026

OpenCode Has More GitHub Stars Than Claude Code. Here's What You're Actually Trading.

OpenCode , the terminal coding agent from the SST team, just shipped v1.17.8 and has 176,000 GitHub stars as of June 2026. Claude Code sits at 132K. OpenCode also topped LogRocket's AI dev tool power rankings this month, displacing Cursor. That's a real market signal, not just GitHub vanity metrics. The core pitch: OpenCode is free, MIT-licensed, and works with 75+ AI providers. You pick the model. The agent is the constant; the intelligence behind it is a config option. What OpenCode Actually Is It's a Go-based CLI with a terminal UI, built by the SST team (the people behind SST and terminal.shop), and it runs on a client/server architecture. The TUI is just one client. The agent process runs on your machine and can be driven remotely from another client, including a desktop app for macOS and Windows. It ships with two built-in agents: a build agent with full filesystem and shell access, and a plan agent that's read-only for code exploration and analysis. An addit...

OpenAI's Deployment Simulation: Testing AI Behavior Against Real Traffic Before Release

OpenAI published a paper on June 16 describing something I've been wanting to see for a while: a way to test how a new model actually behaves at scale, using real user conversations rather than synthetic benchmarks. They call it Deployment Simulation . The short version is they replay 1.3 million de-identified production conversations with a candidate model before releasing it, catch behavioral drift early, and find that models have almost no idea they're being tested. That last part is the most interesting finding. The Problem It's Solving Anyone who has shipped AI features has hit this pattern. A benchmark says your new model is better. You do some manual evals. You run your regression suite. You deploy. Then something shifts in a way none of that testing caught, and you find out from user complaints. The International AI Safety Report 2026 has a name for this: the "evaluation gap." It's the systematic disconnect between how models perform on pre-deploym...

Gemini 3.5 Flash and the End of 'Use the Biggest Model' for Agents

I've been defaulting to Opus-tier or GPT-5.5 for anything agent-related because that felt like the safe call. Better reasoning, better tool use, better outcomes. Flash-tier models were for batch jobs, summaries, things where you didn't care that much about output quality. That calculus broke for me after spending time with the Gemini 3.5 Flash benchmarks . The model went GA on May 19 at Google I/O. The number that got my attention: 83.6% on MCP Atlas, a benchmark specifically for multi-step tool orchestration using Model Context Protocol servers. That puts it 8.3 points ahead of GPT-5.5 (75.3%) and 4.5 points ahead of Claude Opus 4.7 on the same eval. "Flash" doesn't mean what it used to. What MCP Atlas Is Actually Measuring MCP Atlas tests whether a model can chain together multiple tool calls across MCP servers, recover from partial failures, and complete multi-step tasks without going off-script. It's not a writing or reasoning benchmark. If you're ...

Claude Fable 5 Lasted Three Days. What That Means for How You Build.

Claude Fable 5 went live on June 9, 2026. By June 12, the US Department of Commerce had issued an export control directive ordering Anthropic to suspend access to it and Claude Mythos 5 for all foreign nationals. Three days. Anthropic couldn't cleanly separate domestic from foreign users in real time, so they pulled both models for everyone worldwide. As of today, both are still offline with no resolution timeline announced. This is the first time the US government has applied export controls directly to an AI model rather than the chips or hardware underneath it. If you had shipped anything on claude-fable-5 last week, you got a few hours of warning. I want to talk about what this means architecturally, because Anthropic actually shipped a useful resilience mechanism in Fable 5's API design. Most teams hadn't configured it. What Fable 5 Actually Shipped The capabilities are real. Fable 5 shares its weights with Mythos 5 (available only through Project Glasswing to ap...

Multi-Armed Bandits Are Not Smarter A/B Tests

Multi-armed bandits are an adaptive testing method that shifts traffic toward your best-performing variant as the test runs, rather than holding a fixed 50/50 split throughout. The idea is to minimize the cost of running a losing variant. The problem is that teams adopt them as an upgrade to A/B testing, and they're not: they're a different tool that trades statistical validity for short-term efficiency. If you're using MABs for product features, checkout flows, or anything you'll iterate on, you're probably getting cleaner-looking results that tell you less than you think. The Core Tradeoff You're Actually Making A/B tests assign traffic randomly. That randomness is the whole point. It's what lets you make causal claims. When you can say "I randomly assigned users to this variant, and they converted at a higher rate," you're not just observing a correlation. You've approximated a controlled experiment. MABs discard that guarantee. By ...

The Winner's Curse in A/B Testing: Why Your Biggest Lifts Are Probably Exaggerated

I've audited a lot of experimentation programs. The most common red flag isn't a low win rate. It's a suspiciously high one. If your team is consistently reporting 40%, 50%, or 60%+ win rates with lifts above 20% on your primary metric, something is probably wrong. Not "wrong" in the sense of fraud, but wrong in the statistical sense: you're almost certainly looking at the winner's curse. What the Winner's Curse Actually Is The winner's curse is not about bad luck. It's a mathematical outcome of running underpowered tests. Here's the mechanism: when a test is underpowered (say, 30% or 40% statistical power instead of the standard 80%), the test usually fails to detect a real effect. Most runs come back null. But occasionally, by chance, the noise in your data pushes the result over the significance threshold. When that happens, the observed lift is almost always an exaggeration of the true effect. The only way a small, underpowered tes...

Best AEO Tools in 2026: Top 5 Answer Engine Optimization Platforms Compared

If your brand isn't showing up in ChatGPT, Perplexity, or Google AI Overviews, you're missing a fast-growing slice of product discovery. 37% of product discovery queries now start inside AI interfaces, not search engines. Answer Engine Optimization (AEO) is the practice of fixing that, and you need the right tools to track it, measure it, and improve it. I've gone through the options available in 2026 and narrowed it down to the five that actually deliver. What to Look for in an AEO Tool Before picking a tool, know what you actually need. The AEO tool market splits into two buyer types: teams extending an existing SEO platform (Ahrefs, Semrush, SE Ranking) and teams buying a dedicated AI visibility platform (Profound, Scrunch, Otterly.AI). The core capabilities to check: Engine coverage: Which AI platforms does it monitor? ChatGPT, Perplexity, and Google AI Overviews are the minimum. Claude, Gemini, Copilot, and Grok are increasingly important. Citation and mention t...

n8n MCP in 2026: Three Ways to Connect AI Agents to Your Workflows (Compared)

If you're building AI agent workflows, n8n is no longer just a "webhook plus HTTP node" automation tool. As of late 2025, it has native Model Context Protocol support on both ends: it can call external MCP servers and expose its own workflows as MCP tools. That changes how you think about connecting AI agents to automation. Here are the three distinct ways you can wire n8n and MCP together, and where each one actually fits. Why MCP Matters for n8n Developers MCP (Model Context Protocol) , open-sourced by Anthropic in late 2024, became the de facto standard for AI-to-tool communication through 2025. The idea is simple: instead of hardcoding tool schemas into every AI app, you expose them through a standard JSON-RPC interface over SSE or streamable HTTP. Any MCP-compatible client, Claude, GPT-4o, Cursor, Windsurf, can discover and call those tools without custom integration code. n8n added two nodes that put it on both sides of this equation. The community announcement...

AEO Platform Breakdown: What Gets You Cited in ChatGPT vs Perplexity vs Google AI Overviews (2026)

Only 11% of domains cited by ChatGPT show up in Perplexity's answers too. That figure comes from an Averi analysis of 680 million AI citations published in March 2026. If you're running a single "AEO strategy" and calling it done, you're optimizing for one platform and leaving the other three on the table. I've been digging into this for client work at Publicis Sapient and the platform differences are bigger than most guides admit. Here's what each engine actually rewards. Why Platform-Specific AEO Matters Now Over 40% of search queries in 2026 go to AI assistants rather than traditional search engines. ChatGPT alone accounts for 87.4% of all AI referral traffic to brand websites. And 68% of consumers now start product research in ChatGPT or Perplexity before they visit a brand website at all. The problem is that these platforms don't pull from the same source pool. Each has a different retrieval architecture, different freshness requirements, and...

MCP Goes Stateless: Breaking Down the July 2026 Spec and What You Need to Change

The Model Context Protocol's next specification — release candidate locked May 21, 2026, final spec shipping July 28 — is the most significant protocol revision since MCP launched. If you're running MCP servers in any production context, the headline change is architectural: the stateful session layer is gone. I've been tracking the SEPs (Specification Enhancement Proposals) that make up this release. Here's the breakdown of what's actually changing and what you need to do before July 28. The Big Change: MCP Is Now Stateless The current spec requires an initialize / initialized handshake and tracks sessions via Mcp-Session-Id . That means sticky routing — every request mid-session must hit the same server instance that handled the handshake. For anyone running more than one server instance behind a load balancer, this has meant either session affinity configs, shared session stores, or both. The July 2026 spec eliminates all of that. No session handshake. No s...

Sample Ratio Mismatch: Why One in Ten A/B Tests Is Lying to You

A few years back I was consulting for a retail brand running a product page test in Adobe Target. The variant had bolder CTAs and tighter copy. After two weeks, the numbers looked... fine. Flat. The control won by a hair and the team was ready to call it and move on. Something felt off. The control had 52,000 sessions. The variant had 46,000. We'd set it to a 50/50 split. That 6,000-session gap shouldn't exist in a balanced allocation. We ran a chi-squared test. p-value: 0.0001. The test was broken. That was a sample ratio mismatch, and it had silently invalidated two weeks of data. What SRM Actually Is Sample ratio mismatch (SRM) happens when the observed visitor counts across variants don't match the ratio you configured. Set a 50/50 split and get 53/47 on 5,000 sessions? Might be noise. Get 53/47 on 100,000 sessions? Almost certainly not. Detection is a chi-squared goodness-of-fit test comparing observed counts against expected. Microsoft's ExP team uses a thr...

Building Private AI: How to Keep Your Data Local with OpenClaw

Cloud AI means your data goes to cloud providers. What if it didn't have to? Last week, I watched a developer paste an entire customer database into ChatGPT to "analyze patterns." The data left their computer, went to OpenAI's servers, got processed, and theoretically got deleted. Theoretically. That's not acceptable for most businesses. The Problem With Cloud AI When you use ChatGPT, Claude, or any cloud API: Your data leaves your control It gets transmitted over the internet A third party company stores and processes it They might train on it (check the terms) It's subject to their privacy policies and government data requests You lose all compliance guarantees For casual use? Maybe fine. For healthcare, finance, legal, or sensitive business data? Absolutely not. Why Private AI is Actually Better Local AI isn't a step backward. It's a step forward. Security Your data never leaves your servers. Period. No internet tr...

Building Trustworthy AI: Beyond Benchmarks

Last month I was evaluating three frontier models for a client workflow at Publicis Sapient. One of them scored highest on every benchmark we checked. It was also the one that fell apart in production within two weeks. That experience pushed me to write this down, because I think the industry has a benchmark problem it isn't talking about honestly enough. Benchmarks Are Saturated and Getting Gamed MMLU and MMLU-Pro, two of the most cited evaluation benchmarks, are now functionally saturated above 88% for frontier models. The score differences between the top models are statistically meaningless at that level. Meanwhile, data contamination and annotation error rates above 50% undermine what these scores even measure in the first place. It gets worse. Most teams building internal benchmarks overestimate how well their models perform by 30% or more, because they test on clean inputs, cooperative conditions, and scenarios where the model's known strengths are on display. Tha...

From Single API to Network Intelligence: How Request Scout Changed My Debugging Workflow

A story about building smarter dev tools with your own AI The Moment It Clicked Last week, I was debugging a performance issue on a client's site. I had Chrome DevTools open, watching the Network tab with hundreds of requests flying by. I had Gemini in another window, pasting individual API responses, asking "What's this endpoint doing?" and "Are these headers correct?" Then it hit me: Why am I talking to Gemini about one API at a time, when I should be talking to it about my entire network? I opened my notebook and sketched something radical: What if I could ask my network tab questions directly? "Show me all failed API calls" "Which domains took longest?" "Find any requests with leaked auth tokens" "What's the pattern here?" Three days later, Request Scout was born. The Problem Nobody Talks About Network debugging is fragmented: 🕵️ You stare at the Network tab (human-scale = ~100 requests, max) 🔍...

250,000 AI Agent Instances Exposed on the Internet — Is Yours One of Them?

If You're Running OpenClaw, You May Want to Read This A public watchboard has surfaced listing over 250,000 OpenClaw instances that are directly reachable from the internet. Some of these instances have leaked credentials. Many are running on infrastructure already flagged for known CVEs and threat actor activity. This isn't theoretical. It's happening right now. You can check the exposure list yourself at openclaw.allegro.earth . Why This Is a Big Deal OpenClaw is a powerful AI agent framework. That power comes with serious responsibility. A typical OpenClaw deployment runs with: Personal API keys — OpenAI, Anthropic, Google, cloud provider credentials Broad system permissions — file access, shell execution, network requests Autonomous execution capabilities — the agent can act without human approval Complex codebases — large attack surfaces that haven't been fully audited When one of these instances is publicly reachable without authentication...