Skip to main content

Posts

OpenAI's JalapeƱo Chip: Nine Months to Custom Silicon and What the 50% Cost Claim Really Means

OpenAI just announced JalapeƱo , its first custom inference processor, built in partnership with Broadcom and taped out in just nine months. If the cost numbers hold, this is a structural shift in how OpenAI runs its models, and it eventually affects what builders pay to call the API. What JalapeƱo Actually Is JalapeƱo is an inference-only ASIC (application-specific integrated circuit). Not a training chip. Inference is what runs every time you call gpt-4o or o3 . That's where the compute costs actually land at scale. The chip is built on TSMC's 3nm process node, the same manufacturing tier Apple uses for its A18 Pro. It's a reticle-sized die, meaning it's about as large as a chip can physically be before yield becomes a serious problem at that node. The package includes one large compute chiplet surrounded by eight HBM (high-bandwidth memory) stacks. HBM is what you need for LLM inference: huge memory bandwidth, physically close to the compute. GPUs do this too, b...
Recent posts

GitHub Copilot Switched to Token Billing in June. Some Teams Saw Their Bills Jump Overnight.

GitHub Copilot switched to usage-based billing on June 1, 2026. If your team didn't notice until the invoice arrived, you're not alone. This is the biggest change to Copilot's pricing model since launch, and the developer community's response was clear: over 900 downvotes and 400 comments on GitHub's own announcement thread in the first week. Here's what actually changed, who got hurt, and what to do before next month's bill lands. What the Old Model Was The previous system used Premium Request Units (PRUs). Your plan came with a fixed monthly allotment. When you burned through it, Copilot didn't cut you off. It quietly fell back to a lighter base model. You kept working. You just didn't know you'd dropped to a less capable model. That was a reasonable trade-off for predictability. That safety net is gone. The New Model: AI Credits Every plan now ships with a monthly AI Credits allowance. One credit costs $0.01. The plan prices didn't...

Apple's Foundation Models Framework Is Now a Model Router. Here's What Changes for Builders.

At WWDC26, Apple made a move that most coverage missed. They didn't just update the Foundation Models framework with new models. They restructured it into something closer to a model abstraction layer, one where your Swift code stays the same whether you're calling an on-device model, Apple's Private Cloud Compute, or a third-party provider like Claude or Gemini. That changes the architecture of iOS AI apps significantly. What Actually Changed The Foundation Models framework has existed since Apple Intelligence launched. But until now, it was essentially one thing: an on-device Apple model you called from Swift, with the privacy and latency benefits that come from never leaving the device. WWDC26 turned that into three distinct tiers accessible through one API: The existing on-device model (fast, private, capability-constrained) A new Private Cloud Compute model (bigger, reasoning-capable, 32K token context window) Third-party models including Claude and Gemini, cal...

Concurrent A/B Tests: How to Know When Interaction Effects Actually Matter

If you've run experimentation at any scale, you've hit this scenario. You've got three tests live simultaneously: one on the hero headline, one on the checkout CTA, one on the product page layout. The checkout CTA test shows a 12% lift. You ship it. The lift evaporates. Post-ship numbers look nothing like the test. Your first instinct is novelty effect. But the real culprit might be that the checkout CTA test was running at the same time as the product page layout test, and users who saw both variants behaved differently than those who saw just one. That's an interaction effect. It's one of the least understood problems in applied experimentation, and it's where a lot of phantom wins actually come from. What an interaction effect actually is In statistics, an interaction happens when the effect of one variable changes depending on the level of another. In A/B testing, it means the combined effect of two experiments running on overlapping user populations is...

MiniMax M3: The Open-Weight Model That Beat GPT-5.5 on Coding for 8x Less

MiniMax released M3 on June 1, 2026, and it's the first open-weight model to genuinely combine three things at once: frontier-level coding performance, a 1M-token context window, and native multimodal input. The interesting part isn't the feature list. It's the architectural trick that makes long-context inference practical at a fraction of what GPT-5.5 costs. A New Way to Do Attention at Scale Standard transformer attention scales quadratically with context length, which is why running a full 1M-token window at inference time is usually too expensive to be useful. MiniMax's answer is MSA (MiniMax Sparse Attention), and the mechanics are worth understanding. Instead of computing attention over every token in the context, MSA uses a two-stage process. A lightweight index branch first scans incoming tokens and selects which blocks of the KV cache are actually relevant to the query. The main attention layer then processes only those selected blocks. MiniMax's numbe...

OpenCode Has More GitHub Stars Than Claude Code. Here's What You're Actually Trading.

OpenCode , the terminal coding agent from the SST team, just shipped v1.17.8 and has 176,000 GitHub stars as of June 2026. Claude Code sits at 132K. OpenCode also topped LogRocket's AI dev tool power rankings this month, displacing Cursor. That's a real market signal, not just GitHub vanity metrics. The core pitch: OpenCode is free, MIT-licensed, and works with 75+ AI providers. You pick the model. The agent is the constant; the intelligence behind it is a config option. What OpenCode Actually Is It's a Go-based CLI with a terminal UI, built by the SST team (the people behind SST and terminal.shop), and it runs on a client/server architecture. The TUI is just one client. The agent process runs on your machine and can be driven remotely from another client, including a desktop app for macOS and Windows. It ships with two built-in agents: a build agent with full filesystem and shell access, and a plan agent that's read-only for code exploration and analysis. An addit...

OpenAI's Deployment Simulation: Testing AI Behavior Against Real Traffic Before Release

OpenAI published a paper on June 16 describing something I've been wanting to see for a while: a way to test how a new model actually behaves at scale, using real user conversations rather than synthetic benchmarks. They call it Deployment Simulation . The short version is they replay 1.3 million de-identified production conversations with a candidate model before releasing it, catch behavioral drift early, and find that models have almost no idea they're being tested. That last part is the most interesting finding. The Problem It's Solving Anyone who has shipped AI features has hit this pattern. A benchmark says your new model is better. You do some manual evals. You run your regression suite. You deploy. Then something shifts in a way none of that testing caught, and you find out from user complaints. The International AI Safety Report 2026 has a name for this: the "evaluation gap." It's the systematic disconnect between how models perform on pre-deploym...