Claude Sonnet 4.5 Claims the Coding Throne
LAUNCH
1Claude Sonnet 4.5 Claims the Coding Throne.
Anthropic just shipped Claude Sonnet 4.5 and is calling it the best coding model in the world — strongest for building complex agents, top of the charts on coding benchmarks, and designed to handle the multi-step agentic workflows that are becoming the default way serious teams ship software. The confidence here is notable: not "competitive with" or "approaching" — Anthropic is planting the flag outright. If you're evaluating models for code-heavy agent pipelines, this is your new baseline to test against. (20,052 likes | 3,159 RTs) Read more →
2ChatGPT Adds Instant Checkout — OpenAI Wants to Be Your Mall.
OpenAI is making a commerce play: ChatGPT now supports instant checkout, letting users go from product discovery to purchase without leaving the conversation. This isn't just a feature — it's a business model pivot. OpenAI has been watching millions of users ask ChatGPT "what should I buy?" and decided to close the loop with a buy button. The affiliate revenue potential here is enormous, and it positions ChatGPT as a direct competitor to Google Shopping, not just Google Search. (10,284 likes | 1,310 RTs) Read more →
3Mistral Drops Open Speech Recognition Models, Claims SOTA.
Mistral is going after Whisper's crown with a new family of open speech recognition models they're calling the world's best. The "open" part matters — Mistral is doubling down on the strategy of releasing weights while OpenAI keeps tightening access. If you're building voice-first applications and tired of API-only options, these are worth benchmarking on your language and domain. (4,353 likes | 470 RTs) Read more →
MiniMax Ships M2.7 on Hugging Face. MiniMax-M2.7 lands on Hugging Face as a text-generation model, continuing the trend of Chinese AI labs releasing competitive open-weight models. Early downloads are modest (873), but MiniMax has been quietly building a reputation for punching above its weight class. (455 likes | 873 downloads) Read more →
TOOL
4Claude Code Gets /ultraplan — Build Plans on the Web, Execute Anywhere.
Claude Code just added /ultraplan, and it's a workflow shift: Claude builds a full implementation plan for you on the web, you review and edit it in your browser, then execute it back in your terminal or on the web. This bridges the gap between "thinking about architecture" and "writing code" — you get a collaborative planning phase with a human-in-the-loop before a single line ships. Available now in preview for all Claude Code web users. (10,245 likes | 658 RTs) Read more →
Claude Code Upgrades Plus New Context Management on Developer Platform. Anthropic shipped several upgrades to Claude Code alongside two new context management features on the Claude Developer Platform. The context tools are the interesting part — managing what your agent knows and when is becoming the key bottleneck in agentic workflows, and dedicated platform-level support for it signals where Anthropic thinks the hard problems are. For a deeper look at how Claude Code stacks up against alternatives, see our Claude Code vs Codex comparison. (4,146 likes | 336 RTs) Read more →
Claude Lands in Slack with Web Search and Tool Access. Claude is now available directly in Slack — DMs, thread mentions via @Claude, or the AI assistant panel — with web search, document analysis, and connected tools built in. This is Anthropic's enterprise distribution play: meet knowledge workers where they already live instead of making them context-switch to a separate app. (3,683 likes | 385 RTs) Read more →
RESEARCH
Berkeley Researchers Show How Easy It Is to Game AI Agent Benchmarks. UC Berkeley's Trustworthy AI group published a damning investigation into the most prominent AI agent benchmarks — and showed they can be exploited with relatively simple techniques. The implication is stark: the leaderboard numbers that labs trumpet in launch announcements may reflect optimization for benchmark mechanics rather than genuine capability. If you've been choosing between models based on a 2-point SWE-Bench gap, you might be measuring test-taking strategy, not engineering skill. (488 likes | 124 RTs) Read more →
INSIGHT
OpenAI Flags Axios Supply Chain Scare, Forces macOS App Update. OpenAI disclosed a security incident involving the third-party Axios library — part of a broader industry supply chain attack. No user data was accessed and no systems were compromised, but OpenAI is updating security certifications and requiring all macOS users to update. The real story: even frontier AI labs are vulnerable to dependency chain attacks, and the forced update shows how seriously they're taking it. (5,777 likes | 512 RTs) Read more →
Anthropic Quietly Cut Cache TTL and Developers Noticed. Anthropic reduced their prompt cache time-to-live back on March 6th, and developers have been filing issues about it ever since. Shorter cache TTL means higher costs for applications that rely on cached prompts — and many agentic workflows do. The quiet rollout without announcement drew frustration; developers building cost-sensitive applications need predictable pricing infrastructure, not surprise changes. (461 likes | 355 RTs) Read more →
ChatGPT Voice Mode Runs a Weaker Model Than You Think. Simon Willison flags what many users suspected: ChatGPT's voice mode uses a less capable model than what you get in text mode. The trade-off makes sense — voice needs lower latency — but the lack of transparency is the issue. If you're evaluating ChatGPT's capabilities through voice interactions, you're testing a different (and weaker) model than the one on the benchmark charts. Read more →
TECHNIQUE
Run a Free Local Coding Agent with Gemma 4 + Ollama + Claude Code. Here's a setup worth bookmarking: run Gemma 4 through Ollama with Anthropic API compatibility, plug it into Claude Code, and you get a fully local agentic coding workflow — no API costs, no cloud dependency, full control. The tutorial walks through the entire stack. If you've been wanting to experiment with coding agents but don't want to burn through API credits, this is your on-ramp. (301 likes | 42 RTs) Read more →
BUILD
narrator-ai-cli-skill Automates Full Movie Commentary Videos via Agent. This open-source agent skill file plugs into Claude Code or similar agent environments and automates the entire movie commentary video pipeline — script generation, scene matching, voice synthesis (63 voices), visual templates, background music, and final render. It ships with a 93-film asset library, cost estimation before each run, and 18 pre-handled API error codes. The engineering is surprisingly mature for a community project. (729 likes | 191 RTs) Read more →
Gemma 4 31B Gets a Fast Quantized Release on Hugging Face. LilaRest/gemma-4-31B-it-NVFP4-turbo brings Google's Gemma 4 31B to consumer hardware via NVFP4 quantization. With 21.6K downloads already, the demand for running capable models locally is clearly there. Pair this with the Ollama setup above and you've got a serious local stack. (169 likes | 21.6K downloads) Read more →
MODEL LITERACY
Benchmark Contamination vs. Benchmark Gaming. With Sonnet 4.5 claiming the coding crown and Berkeley exposing agent benchmark exploits on the same weekend, understanding the difference matters. Benchmark contamination is accidental — training data happens to include test questions, so the model has "seen the exam" without anyone intending it. Benchmark gaming is deliberate — optimizing model behavior specifically for benchmark tasks in ways that don't generalize to real-world performance. Contamination is a data hygiene problem; gaming is a strategy choice. Both inflate scores, but gaming is harder to detect because the model genuinely "solves" the benchmark — just through shortcuts that don't transfer. Next time a lab drops a leaderboard chart, ask: does this score predict performance on my task, or just on their test?
QUICK LINKS
- ChatGPT Goals: Persistent task tracking comes to ChatGPT — set goals and let it follow up over time. (6,015 likes | 413 RTs) Link
- "Claude Is Eating Startups": The meme that launched a thousand nervous pitch decks goes viral. (424 likes) Link
- AI Engineer Europe 2026: Latent Space recaps the first AIE conference in London. Link
- SQLite 3.53.0: New release of everyone's favorite embedded database. Link
- Mistral's "European AI" Playbook: Mistral publishes its strategic vision for European AI sovereignty. (132 likes | 67 RTs) Link
- SQLite Query Result Formatter: Neat demo for formatting SQLite output. Link
PICK OF THE DAY
Berkeley's benchmark exploits reveal that the leaderboards driving billion-dollar model races may be measuring optimization tricks more than genuine capability. This is the most important story of the week, and the timing is almost poetic — it drops the same weekend Anthropic declares Sonnet 4.5 the best coding model in the world and every other lab is jockeying for position on the same benchmarks. Berkeley's researchers didn't just theorize about benchmark fragility; they demonstrated concrete exploits against the most prominent AI agent evaluations. The implications ripple outward: enterprise buyers choosing models by SWE-Bench score, VCs valuing companies by benchmark position, developers picking tools based on leaderboard rank — all of them are making decisions on potentially shaky ground. This doesn't mean benchmarks are useless, but it means the gap between "scores well on a test" and "works well on your problem" may be wider than anyone comfortable admits. The fix isn't to abandon benchmarks — it's to run your own evals, on your own data, on your own infrastructure. Anything else is trusting someone else's exam to predict your employee's performance. Read more →
Until next time ✌️
|