Skip to main content

Posts

Showing posts with the label DevTools

Gemini 3.5 Flash and the End of 'Use the Biggest Model' for Agents

I've been defaulting to Opus-tier or GPT-5.5 for anything agent-related because that felt like the safe call. Better reasoning, better tool use, better outcomes. Flash-tier models were for batch jobs, summaries, things where you didn't care that much about output quality. That calculus broke for me after spending time with the Gemini 3.5 Flash benchmarks . The model went GA on May 19 at Google I/O. The number that got my attention: 83.6% on MCP Atlas, a benchmark specifically for multi-step tool orchestration using Model Context Protocol servers. That puts it 8.3 points ahead of GPT-5.5 (75.3%) and 4.5 points ahead of Claude Opus 4.7 on the same eval. "Flash" doesn't mean what it used to. What MCP Atlas Is Actually Measuring MCP Atlas tests whether a model can chain together multiple tool calls across MCP servers, recover from partial failures, and complete multi-step tasks without going off-script. It's not a writing or reasoning benchmark. If you're ...

From Single API to Network Intelligence: How Request Scout Changed My Debugging Workflow

A story about building smarter dev tools with your own AI The Moment It Clicked Last week, I was debugging a performance issue on a client's site. I had Chrome DevTools open, watching the Network tab with hundreds of requests flying by. I had Gemini in another window, pasting individual API responses, asking "What's this endpoint doing?" and "Are these headers correct?" Then it hit me: Why am I talking to Gemini about one API at a time, when I should be talking to it about my entire network? I opened my notebook and sketched something radical: What if I could ask my network tab questions directly? "Show me all failed API calls" "Which domains took longest?" "Find any requests with leaked auth tokens" "What's the pattern here?" Three days later, Request Scout was born. The Problem Nobody Talks About Network debugging is fragmented: 🕵️ You stare at the Network tab (human-scale = ~100 requests, max) 🔍...