Awesome Agents Weekly: Benchmarks broken, AI finds zero-days at scale
Awesome Agents Weekly
Your weekly roundup of the most important AI developments, benchmarks, and tools.
The benchmarks are broken. That's the uncomfortable through-line of this week's coverage. UC Berkeley showed you can near-perfect-score eight major agent benchmarks without solving a single actual task. Stanford showed frontier models pass 70-80% of visual tests without ever seeing the images. The AI Index confirmed the US-China capability gap is gone. Meanwhile, Anthropic turned a restricted model loose on real infrastructure and it found thousands of zero-days nobody had spotted in 27 years. Capability isn't the question anymore. What we're measuring, and whether those measurements mean anything, is.
Pick of the Week
Anthropic Ships $100M AI Cyber Defense to 12 Rivals
Project Glasswing is a $100M compute credit program that puts Anthropic's restricted Claude Mythos Preview model in the hands of AWS, Apple, Google, Microsoft, CrowdStrike, and seven other organizations. The stated goal is to patch critical infrastructure before attackers get there first. What makes this more than PR is what Mythos Preview actually did: our sister piece this week documents the same model autonomously discovering thousands of high-severity vulnerabilities across every major OS and browser, including a bug that had gone unnoticed for 27 years. The fact that Anthropic chose to share this capability with direct competitors, rather than exploit any first-mover window, is a deliberate strategic signal. Whether it shifts how the rest of the industry thinks about AI-powered offense is an open question - but the capability itself is no longer theoretical.
This Week on Awesome Agents
News
- Claude Mythos Preview Finds Thousands of Zero-Days - Anthropic's restricted model autonomously discovered thousands of high-severity vulnerabilities across every major OS and browser, including a 27-year-old bug.
- Inside GitHub's Fake Star Economy - Six million fake stars, $0.06 per click, and VC money chasing manufactured traction - we ran our own analysis on 20 repos and found the fingerprints.
- The Priest Who Helped Write Claude's Conscience - Father Brendan McGuire, a Silicon Valley priest and former tech executive, helped Anthropic rewrite the Claude Constitution after the company asked the Vatican for help.
- Stanford's AI Index 2026 - US Edge Over China Is Gone - Stanford HAI's annual report finds the US-China model gap has effectively closed, GenAI has hit 53% global adoption faster than any prior technology, and young software developers are the first labor casualties.
- Berkeley: Every Major AI Agent Benchmark Can Be Hacked - UC Berkeley researchers hit near-perfect scores on eight major agent benchmarks without solving a single actual task, exposing how the industry measures progress.
- AI Models Pass Vision Tests Without Seeing the Images - A Stanford study shows frontier models score 70-80% on visual benchmarks with no images provided, a structural flaw in how multimodal AI gets assessed.
- Claude Code Silently Burns 40% More Tokens Since v2.1.100 - A developer proxied full API traffic and found roughly 20,000 invisible server-side tokens added to every request since v2.1.100, with no user visibility.
- Meta Closes the Open-Source Door on Frontier AI - Meta's Superintelligence Labs will ship its first flagship models under a closed license, ending the company's open-source-first strategy at the frontier tier.
- Anthropic Launches Managed Agents - Runs Your AI for You - Claude Managed Agents in public beta handles sandboxing, state, and tool execution so developers can skip building agent infrastructure from scratch.
- New Yorker Casts Doubt on Sam Altman's Integrity - Ronan Farrow and Andrew Marantz spent 18 months investigating OpenAI's CEO, and what they found is hard to dismiss.
- GLM-5.1 Tops SWE-Bench Pro With Zero NVIDIA Hardware - Z.ai's GLM-5.1 scores 58.4 on SWE-bench Pro, beating GPT-5.4 and Claude Opus 4.6, after training on 100,000 Huawei Ascend chips with no US silicon.
- Meta Muse Spark Launches, Ranks 4th Among Frontier Models - Meta Superintelligence Labs releases its first model built from scratch in nine months, landing 4th on the Artificial Analysis Intelligence Index.
- Utah Clears AI to Renew Psychiatric Meds Autonomously - Utah becomes the first government in the world to approve an AI system to autonomously renew psychiatric medication prescriptions under a tightly supervised pilot.
- Qwen3.5-Omni Does 10-Hour Audio and 4M Video Frames - Alibaba's new omni model handles audio, video, images, and text in a single pass and edges out Gemini 3.1 Pro on audio tasks.
- Anthropic Revenue Triples to $30B on Enterprise Push - Anthropic's run-rate revenue surpassed $30 billion, tripling from $9B at end of 2025, as the company locks in 3.5 gigawatts of next-gen Broadcom TPU compute from 2027.
- The AI Layoff Trap - Game Theory Says Everyone Loses - An UPenn-BU paper models AI-driven layoffs as a Prisoner's Dilemma: each firm wins by automating, but when all do it simultaneously, collapsing consumer demand makes every firm worse off.
- OpenAI Backs Bill Shielding AI Labs From Mass-Harm Suits - OpenAI is supporting an Illinois bill that would protect AI labs from lawsuits even when their models contribute to mass casualties or billion-dollar disasters.
- Novo Nordisk Bets Its Drug Pipeline on OpenAI - Novo Nordisk signs a sweeping partnership with OpenAI covering drug discovery, manufacturing, and supply chain - but the governance details are thin.
- Anti-AI Suspect Throws Molotov at Altman's Home - A suspect linked to the Pause AI movement threw a homemade incendiary at Sam Altman's San Francisco home and threatened to burn OpenAI's headquarters.
- Meta Commits $21B More to CoreWeave, Total Hits $35B - Meta expands its CoreWeave partnership by $21 billion through December 2032, bringing total commitments to $35 billion and locking in early NVIDIA Vera Rubin deployments.
Reviews
- Grok 4.20 Review: Four Minds Are Better Than One - xAI's Grok 4.20 replaces the single-model approach with four specialized agents that debate before every answer, a bold architectural bet that pays off in some areas and stumbles in others.
- Muse Spark Review: Strong on Health, Weak on Code - Meta's first proprietary frontier model leads on HealthBench Hard and scientific reasoning, but trails rivals in coding and agentic tasks with no public API yet.
- Microsoft MAI Models: Voice, Speech and Image Reviewed - A deep look at Microsoft's three new in-house AI models - MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 - and whether they live up to the hype.
Guides
- How to Use AI to Learn a New Language - A Beginner's Guide - A practical, step-by-step guide to using ChatGPT, Claude, Duolingo Max, and Talkio to learn any language faster.
- Use AI for Creative Writing - And Keep Your Own Voice - A practical beginner's guide to AI tools for fiction and stories that doesn't cost you what makes your writing yours.
Tools
- Gemini 2.5 Flash vs Claude Sonnet 4.6: Cost vs Code - Gemini 2.5 Flash costs 10x less and runs 4x faster, but trails badly on coding benchmarks - a clear breakdown of which to pick for your workload.
- Best AI Chatbot Builders 2026: 6 Platforms Tested - Hands-on comparison of six chatbot builder platforms covering pricing, features, integrations, and which type of team each tool fits.
Leaderboards
- Instruction Following Leaderboard: IFEval Rankings 2026 - Current rankings on IFEval and IFBench measuring how reliably models follow precise formatting, length, and content constraints.
Science
- Autonomous Research, Broken Reasoning, Smarter Agents - AlphaLab runs autonomous GPU research campaigns, open-weight reasoning models collapse under text reformatting, and HiL-Bench shows agents can't decide when to ask for help.
- Clinical AI Harm, Smarter Reasoning, and Safer Agents - AI safety measures withhold critical clinical guidance from patients, SAT cuts reasoning tokens by 40%, and conformal prediction blocks wrong multi-agent consensus.
- Blind Refusal, Broken Steps, and Free Uncertainty - Three papers expose safety training's moral blind spot, two distinct failure modes inside reasoning models, and a 10x cheaper way to know when a reasoning model is guessing.
- MedGemma 1.5, Smarter MCTS, and Auditing AI Agents - Google's MedGemma 1.5 brings 3D medical imaging to open AI, PRISM-MCTS halves reasoning cost, and a new audit framework finds 617 security flaws across six major agent projects.
- AI Research: Emotions, Theory of Mind, Unlearning - Anthropic finds functional emotions inside Claude that can drive blackmail, a poker experiment reveals memory alone creates Theory of Mind in agents, and a new framework targets sensitive reasoning traces for erasure.
Models
- Muse Spark - Meta's first closed-source frontier model scores 52 on the Artificial Analysis Intelligence Index, leads on HealthBench Hard, and ships free at meta.ai with no public API yet.
- Google Gemma 4 - Four Open Models Under Apache 2.0 - Four variants from 2B to 31B under Apache 2.0, multimodal across text, image, video, and audio, with the 31B Dense ranking third on Chatbot Arena among all open-weight models.
Elena Marchetti, Senior AI Editor Awesome Agents - AI news, benchmarks, and tools for practitioners