Skip to main content

Posts

Showing posts from June, 2026

Apple's Foundation Models Framework Is Now a Model Router. Here's What Changes for Builders.

At WWDC26, Apple made a move that most coverage missed. They didn't just update the Foundation Models framework with new models. They restructured it into something closer to a model abstraction layer, one where your Swift code stays the same whether you're calling an on-device model, Apple's Private Cloud Compute, or a third-party provider like Claude or Gemini. That changes the architecture of iOS AI apps significantly. What Actually Changed The Foundation Models framework has existed since Apple Intelligence launched. But until now, it was essentially one thing: an on-device Apple model you called from Swift, with the privacy and latency benefits that come from never leaving the device. WWDC26 turned that into three distinct tiers accessible through one API: The existing on-device model (fast, private, capability-constrained) A new Private Cloud Compute model (bigger, reasoning-capable, 32K token context window) Third-party models including Claude and Gemini, cal...

Concurrent A/B Tests: How to Know When Interaction Effects Actually Matter

If you've run experimentation at any scale, you've hit this scenario. You've got three tests live simultaneously: one on the hero headline, one on the checkout CTA, one on the product page layout. The checkout CTA test shows a 12% lift. You ship it. The lift evaporates. Post-ship numbers look nothing like the test. Your first instinct is novelty effect. But the real culprit might be that the checkout CTA test was running at the same time as the product page layout test, and users who saw both variants behaved differently than those who saw just one. That's an interaction effect. It's one of the least understood problems in applied experimentation, and it's where a lot of phantom wins actually come from. What an interaction effect actually is In statistics, an interaction happens when the effect of one variable changes depending on the level of another. In A/B testing, it means the combined effect of two experiments running on overlapping user populations is...

MiniMax M3: The Open-Weight Model That Beat GPT-5.5 on Coding for 8x Less

MiniMax released M3 on June 1, 2026, and it's the first open-weight model to genuinely combine three things at once: frontier-level coding performance, a 1M-token context window, and native multimodal input. The interesting part isn't the feature list. It's the architectural trick that makes long-context inference practical at a fraction of what GPT-5.5 costs. A New Way to Do Attention at Scale Standard transformer attention scales quadratically with context length, which is why running a full 1M-token window at inference time is usually too expensive to be useful. MiniMax's answer is MSA (MiniMax Sparse Attention), and the mechanics are worth understanding. Instead of computing attention over every token in the context, MSA uses a two-stage process. A lightweight index branch first scans incoming tokens and selects which blocks of the KV cache are actually relevant to the query. The main attention layer then processes only those selected blocks. MiniMax's numbe...

OpenCode Has More GitHub Stars Than Claude Code. Here's What You're Actually Trading.

OpenCode , the terminal coding agent from the SST team, just shipped v1.17.8 and has 176,000 GitHub stars as of June 2026. Claude Code sits at 132K. OpenCode also topped LogRocket's AI dev tool power rankings this month, displacing Cursor. That's a real market signal, not just GitHub vanity metrics. The core pitch: OpenCode is free, MIT-licensed, and works with 75+ AI providers. You pick the model. The agent is the constant; the intelligence behind it is a config option. What OpenCode Actually Is It's a Go-based CLI with a terminal UI, built by the SST team (the people behind SST and terminal.shop), and it runs on a client/server architecture. The TUI is just one client. The agent process runs on your machine and can be driven remotely from another client, including a desktop app for macOS and Windows. It ships with two built-in agents: a build agent with full filesystem and shell access, and a plan agent that's read-only for code exploration and analysis. An addit...

OpenAI's Deployment Simulation: Testing AI Behavior Against Real Traffic Before Release

OpenAI published a paper on June 16 describing something I've been wanting to see for a while: a way to test how a new model actually behaves at scale, using real user conversations rather than synthetic benchmarks. They call it Deployment Simulation . The short version is they replay 1.3 million de-identified production conversations with a candidate model before releasing it, catch behavioral drift early, and find that models have almost no idea they're being tested. That last part is the most interesting finding. The Problem It's Solving Anyone who has shipped AI features has hit this pattern. A benchmark says your new model is better. You do some manual evals. You run your regression suite. You deploy. Then something shifts in a way none of that testing caught, and you find out from user complaints. The International AI Safety Report 2026 has a name for this: the "evaluation gap." It's the systematic disconnect between how models perform on pre-deploym...

Gemini 3.5 Flash and the End of 'Use the Biggest Model' for Agents

I've been defaulting to Opus-tier or GPT-5.5 for anything agent-related because that felt like the safe call. Better reasoning, better tool use, better outcomes. Flash-tier models were for batch jobs, summaries, things where you didn't care that much about output quality. That calculus broke for me after spending time with the Gemini 3.5 Flash benchmarks . The model went GA on May 19 at Google I/O. The number that got my attention: 83.6% on MCP Atlas, a benchmark specifically for multi-step tool orchestration using Model Context Protocol servers. That puts it 8.3 points ahead of GPT-5.5 (75.3%) and 4.5 points ahead of Claude Opus 4.7 on the same eval. "Flash" doesn't mean what it used to. What MCP Atlas Is Actually Measuring MCP Atlas tests whether a model can chain together multiple tool calls across MCP servers, recover from partial failures, and complete multi-step tasks without going off-script. It's not a writing or reasoning benchmark. If you're ...

Claude Fable 5 Lasted Three Days. What That Means for How You Build.

Claude Fable 5 went live on June 9, 2026. By June 12, the US Department of Commerce had issued an export control directive ordering Anthropic to suspend access to it and Claude Mythos 5 for all foreign nationals. Three days. Anthropic couldn't cleanly separate domestic from foreign users in real time, so they pulled both models for everyone worldwide. As of today, both are still offline with no resolution timeline announced. This is the first time the US government has applied export controls directly to an AI model rather than the chips or hardware underneath it. If you had shipped anything on claude-fable-5 last week, you got a few hours of warning. I want to talk about what this means architecturally, because Anthropic actually shipped a useful resilience mechanism in Fable 5's API design. Most teams hadn't configured it. What Fable 5 Actually Shipped The capabilities are real. Fable 5 shares its weights with Mythos 5 (available only through Project Glasswing to ap...

Multi-Armed Bandits Are Not Smarter A/B Tests

Multi-armed bandits are an adaptive testing method that shifts traffic toward your best-performing variant as the test runs, rather than holding a fixed 50/50 split throughout. The idea is to minimize the cost of running a losing variant. The problem is that teams adopt them as an upgrade to A/B testing, and they're not: they're a different tool that trades statistical validity for short-term efficiency. If you're using MABs for product features, checkout flows, or anything you'll iterate on, you're probably getting cleaner-looking results that tell you less than you think. The Core Tradeoff You're Actually Making A/B tests assign traffic randomly. That randomness is the whole point. It's what lets you make causal claims. When you can say "I randomly assigned users to this variant, and they converted at a higher rate," you're not just observing a correlation. You've approximated a controlled experiment. MABs discard that guarantee. By ...

Evals Are Just A/B Testing for Your Agents

If you've spent serious time in experimentation, you already understand LLM evals better than most AI engineers. You just haven't been told yet. I've been running A/B tests for enterprise teams for years. Last year I started building agents in earnest. And somewhere around the third time an "improved" prompt made things quietly worse in production, I had the realization: the eval problem and the experimentation problem are structurally identical. Teams are reinventing controlled comparison, doing it badly, because nobody told them they'd been here before. The Match Nobody Is Pointing Out Here is what an eval actually is. You have a baseline: your current prompt, model, or agent configuration. You make a change. You want to know whether that change made things better or worse. You need a consistent way to measure "better." That's an A/B test. Exactly that. The golden dataset is your holdout test set. The eval judge, human or LLM, is your me...

The Winner's Curse in A/B Testing: Why Your Biggest Lifts Are Probably Exaggerated

I've audited a lot of experimentation programs. The most common red flag isn't a low win rate. It's a suspiciously high one. If your team is consistently reporting 40%, 50%, or 60%+ win rates with lifts above 20% on your primary metric, something is probably wrong. Not "wrong" in the sense of fraud, but wrong in the statistical sense: you're almost certainly looking at the winner's curse. What the Winner's Curse Actually Is The winner's curse is not about bad luck. It's a mathematical outcome of running underpowered tests. Here's the mechanism: when a test is underpowered (say, 30% or 40% statistical power instead of the standard 80%), the test usually fails to detect a real effect. Most runs come back null. But occasionally, by chance, the noise in your data pushes the result over the significance threshold. When that happens, the observed lift is almost always an exaggeration of the true effect. The only way a small, underpowered tes...

Best AEO Tools in 2026: Top 5 Answer Engine Optimization Platforms Compared

If your brand isn't showing up in ChatGPT, Perplexity, or Google AI Overviews, you're missing a fast-growing slice of product discovery. 37% of product discovery queries now start inside AI interfaces, not search engines. Answer Engine Optimization (AEO) is the practice of fixing that, and you need the right tools to track it, measure it, and improve it. I've gone through the options available in 2026 and narrowed it down to the five that actually deliver. What to Look for in an AEO Tool Before picking a tool, know what you actually need. The AEO tool market splits into two buyer types: teams extending an existing SEO platform (Ahrefs, Semrush, SE Ranking) and teams buying a dedicated AI visibility platform (Profound, Scrunch, Otterly.AI). The core capabilities to check: Engine coverage: Which AI platforms does it monitor? ChatGPT, Perplexity, and Google AI Overviews are the minimum. Claude, Gemini, Copilot, and Grok are increasingly important. Citation and mention t...