OpenAI's Jalapeño Chip: Nine Months to Custom Silicon and What the 50% Cost Claim Really Means

Custom semiconductor chip die under amber spotlight on polished steel surface

OpenAI just announced Jalapeño, its first custom inference processor, built in partnership with Broadcom and taped out in just nine months. If the cost numbers hold, this is a structural shift in how OpenAI runs its models, and it eventually affects what builders pay to call the API.

What Jalapeño Actually Is

Jalapeño is an inference-only ASIC (application-specific integrated circuit). Not a training chip. Inference is what runs every time you call gpt-4o or o3. That's where the compute costs actually land at scale.

The chip is built on TSMC's 3nm process node, the same manufacturing tier Apple uses for its A18 Pro. It's a reticle-sized die, meaning it's about as large as a chip can physically be before yield becomes a serious problem at that node. The package includes one large compute chiplet surrounded by eight HBM (high-bandwidth memory) stacks. HBM is what you need for LLM inference: huge memory bandwidth, physically close to the compute. GPUs do this too, but a purpose-built ASIC strips out everything a GPU needs for general graphics and puts that die area and power budget toward memory bandwidth and matrix multiply throughput.

OpenAI says engineering samples are already running at target clock speed and handling ML workloads including GPT-5.3-Codex-Spark. The Broadcom announcement confirms prototype deployments are planned for late 2026, scaling alongside Microsoft for gigawatt-scale data centers.

The 50% Cost Claim, and Why to Read It Carefully

OpenAI claims Jalapeño delivers roughly 50% lower cost per inference token versus current GPU alternatives, and "substantially better" performance per watt. A few things to note before taking that at face value.

First, this is OpenAI's own benchmark, run against workloads of their choosing, with no disclosed comparison baseline. Is it vs H100s? H200s? GB200 NVL72 racks? The framing matters. A purpose-built inference ASIC can absolutely outperform general-purpose GPUs by eliminating the overhead that GPUs carry for graphics and general compute. But 50% is a specific number that needs external validation.

Second, this is an inference chip only. Training still runs on GPUs, and that's where the biggest compute bills accumulate. OpenAI is not escaping NVIDIA dependence. They're carving out the inference workload where they have the most control over the workload shape.

Third, the chip is internal only and won't be sold to external customers. It reduces OpenAI's own cost structure, which could eventually flow into API pricing, but there's no direct mechanism at launch. Your API calls won't run on Jalapeño this year.

Why Every Large Lab Is Building Custom Silicon

This is not new. Google has been running Tensor Processing Units since 2015 and now controls roughly a quarter of global AI compute outside NVIDIA's supply chain. Amazon has shipped over a million Trainium chips. Meta has MTIA. Microsoft has Maia. Every large lab at scale eventually builds custom silicon, and none of them replace NVIDIA outright. They run custom chips for the workloads they can tightly control and still buy NVIDIA for everything else.

What's notable about OpenAI is the timeline. Nine months from blank-slate design to tape-out is fast for a chip program. Broadcom has done this kind of work before (Google's early TPU development was also a Broadcom collaboration), so they know the process. But a nine-month ASIC cycle still requires that you know your workload extremely well upfront, because you cannot change hardware mid-build. OpenAI has been running GPT-scale inference for three years. They know what their matmul shapes look like.

The partnership structure is also worth noting. Broadcom designs the chip, TSMC fabricates it. OpenAI funds it and owns the resulting silicon. The money flows to Broadcom and TSMC, not NVIDIA. That's intentional.

What Actually Changes for Builders

Honestly, not much in the short term. Production scale doesn't arrive until 2027 at earliest. Your API calls run on NVIDIA hardware for now.

The longer arc is more interesting. If Jalapeño works and OpenAI's cost structure improves, they have more room to price inference competitively. Cost per token has already dropped dramatically over the past two years, through model efficiency gains and infrastructure work. Custom silicon is the next lever, and it's one OpenAI now controls rather than waiting on NVIDIA supply allocation.

There's also a reliability angle. OpenAI's inference capacity today is partly constrained by how many chips NVIDIA can deliver on their timeline. Custom silicon means OpenAI can plan its own production roadmap. That's better for capacity predictability long-term, though I haven't seen specific SLA commitments come out of this announcement.

I haven't run any workloads on Jalapeño. No one outside OpenAI has. The 50% figure is a marketing number until the chip is in production and someone runs independent benchmarks. But building inference ASICs to reduce per-token cost is the right structural move for any lab running at this scale. The only real question is whether nine months was enough runway to get the workload assumptions right before tape-out locked them in.

ngLover

Search This Blog