ngLover

Posts

Showing posts from 2026

JADEPUFFER Is the First LLM-Driven Ransomware. Here's What Actually Happened.

Earlier this month, Sysdig published research on JADEPUFFER, what they assess to be the first end-to-end agentic ransomware operation: a full attack chain from initial access to data destruction, driven by an LLM agent with no human operator at the keyboard. I've been tracking this story since it dropped July 2, and I want to walk through what actually happened before the hot takes bury the technical detail. What the Attack Actually Did The entry point was CVE-2025-3248, a missing-authentication vulnerability in Langflow's code validation endpoint. CVSS score of 9.8. Langflow is the drag-and-drop flow builder a lot of teams use to wire together LLM pipelines. The flaw lets an unauthenticated caller execute arbitrary Python on the host. Game over for initial access. Once in, the agent ran a methodical sweep: dumped Langflow's PostgreSQL database, harvested environment variables, scraped credentials from config files, inventoried a MinIO object store. The credential haul ...

Claude Sonnet 5: Near-Opus Performance at 40% Lower Cost Changes Your Agent Routing

Anthropic dropped Claude Sonnet 5 on June 30, and the positioning is deliberate: this is the most agentic Sonnet model they've shipped. Close to Opus 4.8 in performance, at 40% lower cost at standard pricing, and 60% cheaper during the introductory window that runs through August 31, 2026. For anyone running multi-step agents at scale, this changes the cost math in a concrete way. What's actually in the release Sonnet 5 ( claude-sonnet-5 ) sits between Haiku 4.5 and Opus 4.8 in the lineup, but Anthropic is pitching it as the default workhorse for most agentic work. It carries a 1M token context window, 128k max output, and adaptive thinking. Fast latency. And it defaults to effort: high on the Claude API and Claude Code, meaning the model engages its full reasoning budget by default. That's the same default as Opus 4.8. Here's the full pricing picture now: Haiku 4.5 : $1/$5 per million tokens. Fast, near-frontier for simple tasks. Sonnet 5 intro (through Aug 31)...

What Tesla's $200 AI Cap Gets Wrong About Token Costs

Three things happened this week that, together, tell you something real about where AI tooling is headed. Tesla announced a $200-per-week spending cap on employee AI tools, effective July 6. Uber's COO publicly stated the company burned through its entire 2026 AI budget in four months, then capped per-person spending at $1,500 per month. And reporting from Electrek revealed that Meta's internal AI usage hit 73.7 trillion tokens in a single month, putting the company on track for billions in annual costs. Meta tracks this on an internal leaderboard called "Claudeonomics." The bill for AI-assisted engineering is no longer theoretical. Why Token Costs Are Spiking Now If you've been wondering why this is happening all at once, the short answer is agents. Chat-based AI usage is relatively predictable: a developer opens a window, types a question, gets an answer. The cost per interaction is low enough that most companies could treat it like a SaaS subscription an...

The Man Who Invented the Transformer Is Now at OpenAI Designing What Comes Next

Noam Shazeer co-authored "Attention Is All You Need" in 2017. That paper introduced the Transformer architecture, which sits underneath every major AI model running today: GPT, Gemini, Claude, all of them. He then co-founded Character.AI, Google paid around $2.7 billion to bring him back as VP of Engineering and co-lead on Gemini in August 2024, and two years later he walked out the door to join OpenAI. His title at OpenAI: Lead for Architecture Research. His mandate: design next-generation architectures beyond the current GPT line. That's not a talent story. That's a directional signal. What Actually Happened On June 18, 2026 , Shazeer announced he was leaving Google for OpenAI. Google had spent $2.7 billion to retain him (as part of a partnership deal that brought back his Character.AI co-founder Daniel De Freitas alongside him) less than two years prior. They paid that to keep him. He left anyway. The same week, Google also lost a prominent researcher to Ant...

At $0.87/M Output Tokens, DeepSeek V4-Pro Just Repriced Your Agent Architecture

DeepSeek made the 75% discount on V4-Pro permanent in late June. Not a promo extension, not a trial period. They called it an "efficiency gain being passed through." That framing matters. It means the new price floor is structural, not a marketing play designed to flip later. The numbers: $0.435/M input, $0.87/M output, and cache hits at $0.003625/M. For context: GPT-5.5 sits at $5/M input and $30/M output. Claude Fable 5 is $10/M and $50/M. DeepSeek V4-Pro is roughly 34x cheaper per output token than GPT-5.5. At that delta, you're not comparing pricing tiers anymore. You're looking at different economic regimes. What Actually Changed V4-Pro was already a serious model before the cut. It's a 1.6 trillion parameter MoE with 49B active params, a 1M token context window, and MIT-licensed. It scores 80.6% on SWE-bench Verified , the highest open-weights entry, tied with Gemini 3.1 Pro. The price cut didn't change the model. It changed what's economically v...

Apple's Foundation Models Framework Is Now a Model Router. Here's What Changes for Builders.

At WWDC26, Apple made a move that most coverage missed. They didn't just update the Foundation Models framework with new models. They restructured it into something closer to a model abstraction layer, one where your Swift code stays the same whether you're calling an on-device model, Apple's Private Cloud Compute, or a third-party provider like Claude or Gemini. That changes the architecture of iOS AI apps significantly. What Actually Changed The Foundation Models framework has existed since Apple Intelligence launched. But until now, it was essentially one thing: an on-device Apple model you called from Swift, with the privacy and latency benefits that come from never leaving the device. WWDC26 turned that into three distinct tiers accessible through one API: The existing on-device model (fast, private, capability-constrained) A new Private Cloud Compute model (bigger, reasoning-capable, 32K token context window) Third-party models including Claude and Gemini, cal...

What Noam Shazeer Leaving Google for OpenAI Actually Means

On June 18, Noam Shazeer posted on X that he was joining OpenAI. His title: lead for AI architecture research, confirmed by OpenAI's chief research officer Mark Chen. If that name doesn't mean anything to you, here's the context that makes it matter. Shazeer is one of eight co-authors of "Attention Is All You Need," the 2017 paper that introduced the Transformer architecture. The architecture every major language model runs on today. GPT-5.5, Claude Fable 5, Gemini 3.5 Flash, Llama, Mistral, all of them. The ideas in that paper are as foundational to modern AI as UNIX was to operating systems. He left Google after that paper, co-founded CharacterAI, and built it into a consumer AI product with enormous scale. In August 2024, Google paid approximately $2.7 billion (structured as a technology license from CharacterAI) to bring Shazeer and a cohort of researchers back into Google DeepMind. His role: VP of engineering and co-lead of Gemini, specifically owning the ...

Gemini 3.5 Pro Missed June. Four Researchers Left for Anthropic. Here's What I'm Watching.

Google made a specific promise at Google I/O on May 19: Gemini 3.5 Pro would be generally available by June. Sundar Pichai, when pressed on the timeline, said "give us until next month." The audience groaned audibly. They were right to. As of June 29, Gemini 3.5 Pro is still in limited Vertex AI enterprise preview. The public launch has been pushed to July . And in the same week the June deadline slipped, four senior Gemini researchers announced they were leaving for Anthropic . Google's AI coding teams have lost six researchers in five months. That combination is not damning by itself. But it is a signal, and I think it's the more interesting story here. What a Month's Slip Actually Costs A one-month delay sounds minor. It rarely is when you're managing a product roadmap around it. If your team planned feature launches, customer commitments, or integration timelines around Gemini 3.5 Pro hitting GA in June, you just ate a planning hit. The areas Google ...

Obsidian + Claude: Turning Your Vault Into a Context Layer That Actually Works

Obsidian is a note-taking app, but that description undersells it. The thing that makes it different from Notion, Roam, or any of the cloud-based alternatives is this: your vault is just a folder of Markdown files on your computer. No proprietary database, no sync service you're locked into, no account required. Just .md files sitting on disk, organized however you want. That simplicity is also what makes it uniquely useful as a context layer for AI. What Obsidian Actually Is The core features: a Markdown editor with live preview, a graph view that visualizes how your notes connect via backlinks, and a community plugin ecosystem with over 1,000 plugins. It runs on Mac, Windows, Linux, iOS, and Android. Sync is optional and paid, but your vault works completely offline. What you build in Obsidian is a personal knowledge graph. Each note links to others via [[wikilinks]] . Over time, the graph view shows you which ideas are densely connected and which are isolated. It's the...

OKF: Why Your Agent's Context Layer Is the Problem, Not Your Retrieval Strategy

Every agent project I've built that touches internal data hits the same wall. The agent needs context: what is this BigQuery table, what do the columns mean, how does it join to the orders table, what's "monthly active users" in your org and not the textbook definition. You end up dumping SQL schemas into the system prompt, pointing at Confluence pages, writing a bespoke context builder that assembles fragments before each request. It works, barely, and it doesn't travel. Move to a different team's data, start a new project, and you're rebuilding it from scratch. Google Cloud published a spec on June 12, 2026 that addresses exactly this: the Open Knowledge Format (OKF), v0.1. It formalizes what Andrej Karpathy called the "LLM wiki" into a portable, interoperable format. What OKF Is (and What It Isn't) OKF is not a service or a platform. It's a file format. The spec fits on a single page. Your knowledge base is a directory of markdow...

GPT-5.6 Sol: Ultra Mode, Three-Tier Pricing, and Why METR Says Its Benchmarks Are Broken

OpenAI previewed GPT-5.6 on June 26, 2026, in three variants: Sol, Terra, and Luna. Access is currently limited to roughly 20 US government-approved partner organizations, which means most teams cannot run their own tests yet. But there is still a lot worth digging into: a genuinely interesting architecture change with "ultra" mode, and a finding from METR that fundamentally changes how you should read any Sol benchmark score you encounter. Sol, Terra, and Luna: The Three-Tier Model The naming is celestial but the logic is familiar. OpenAI has codified what we have all been doing informally: routing different tasks to different models based on cost and capability. Sol is the flagship. It targets hard problems in coding ( Terminal-Bench 2.1 state of the art at 88.8%), biology (GeneBench v1), and cybersecurity (ExploitBench). Pricing is $5 per million input tokens, $30 output. Terra is the balanced tier, aimed at high-volume business tasks, customer support, document anal...

OpenAI's Jalapeño Chip: Nine Months to Custom Silicon and What the 50% Cost Claim Really Means

OpenAI just announced Jalapeño , its first custom inference processor, built in partnership with Broadcom and taped out in just nine months. If the cost numbers hold, this is a structural shift in how OpenAI runs its models, and it eventually affects what builders pay to call the API. What Jalapeño Actually Is Jalapeño is an inference-only ASIC (application-specific integrated circuit). Not a training chip. Inference is what runs every time you call gpt-4o or o3 . That's where the compute costs actually land at scale. The chip is built on TSMC's 3nm process node, the same manufacturing tier Apple uses for its A18 Pro. It's a reticle-sized die, meaning it's about as large as a chip can physically be before yield becomes a serious problem at that node. The package includes one large compute chiplet surrounded by eight HBM (high-bandwidth memory) stacks. HBM is what you need for LLM inference: huge memory bandwidth, physically close to the compute. GPUs do this too, b...

GitHub Copilot Switched to Token Billing in June. Some Teams Saw Their Bills Jump Overnight.

GitHub Copilot switched to usage-based billing on June 1, 2026. If your team didn't notice until the invoice arrived, you're not alone. This is the biggest change to Copilot's pricing model since launch, and the developer community's response was clear: over 900 downvotes and 400 comments on GitHub's own announcement thread in the first week. Here's what actually changed, who got hurt, and what to do before next month's bill lands. What the Old Model Was The previous system used Premium Request Units (PRUs). Your plan came with a fixed monthly allotment. When you burned through it, Copilot didn't cut you off. It quietly fell back to a lighter base model. You kept working. You just didn't know you'd dropped to a less capable model. That was a reasonable trade-off for predictability. That safety net is gone. The New Model: AI Credits Every plan now ships with a monthly AI Credits allowance. One credit costs $0.01. The plan prices didn't...

Apple's Foundation Models Framework Is Now a Model Router. Here's What Changes for Builders.

Concurrent A/B Tests: How to Know When Interaction Effects Actually Matter

If you've run experimentation at any scale, you've hit this scenario. You've got three tests live simultaneously: one on the hero headline, one on the checkout CTA, one on the product page layout. The checkout CTA test shows a 12% lift. You ship it. The lift evaporates. Post-ship numbers look nothing like the test. Your first instinct is novelty effect. But the real culprit might be that the checkout CTA test was running at the same time as the product page layout test, and users who saw both variants behaved differently than those who saw just one. That's an interaction effect. It's one of the least understood problems in applied experimentation, and it's where a lot of phantom wins actually come from. What an interaction effect actually is In statistics, an interaction happens when the effect of one variable changes depending on the level of another. In A/B testing, it means the combined effect of two experiments running on overlapping user populations is...

MiniMax M3: The Open-Weight Model That Beat GPT-5.5 on Coding for 8x Less

MiniMax released M3 on June 1, 2026, and it's the first open-weight model to genuinely combine three things at once: frontier-level coding performance, a 1M-token context window, and native multimodal input. The interesting part isn't the feature list. It's the architectural trick that makes long-context inference practical at a fraction of what GPT-5.5 costs. A New Way to Do Attention at Scale Standard transformer attention scales quadratically with context length, which is why running a full 1M-token window at inference time is usually too expensive to be useful. MiniMax's answer is MSA (MiniMax Sparse Attention), and the mechanics are worth understanding. Instead of computing attention over every token in the context, MSA uses a two-stage process. A lightweight index branch first scans incoming tokens and selects which blocks of the KV cache are actually relevant to the query. The main attention layer then processes only those selected blocks. MiniMax's numbe...