Awesome Agents Weekly: Anthropic Overtakes OpenAI, MCP Under Fire
Awesome Agents Weekly
Your weekly roundup of the most important AI developments, benchmarks, and tools.
This week had two dominant stories running in parallel: a historic revenue flip at the top of the AI industry, and a wave of serious security findings across the agent ecosystem. Anthropic crossed $30B ARR and overtook OpenAI for the first time - while researchers and auditors found critical flaws in MCP servers, agent routers, and at least two major platforms. Claude Opus 4.7 also shipped, Amazon doubled down with a $25B commitment, and the Stanford AI Index dropped its bleakest transparency data yet.
Pick of the Week
Anthropic Passes OpenAI in Revenue at $30B ARR
This one matters beyond the number itself. For the first time since ChatGPT launched generative AI into the mainstream, a rival has outpaced OpenAI on revenue - and that rival is Anthropic, which has spent most of its life positioning itself as the safety-first alternative to the move-fast crowd. Crossing $30B ARR, ahead of OpenAI's $24B run rate, is both a commercial milestone and a signal about what enterprise buyers actually want. The timing - simultaneous with Amazon's additional $25B commitment and news of Dario Amodei meeting senior White House officials - makes it hard to read this as anything other than a genuine inflection point for the industry.
This Week on Awesome Agents
News
Industry and Business
- Anthropic Passes OpenAI in Revenue at $30B ARR - Anthropic's annualized revenue crossed $30 billion in April, overtaking OpenAI's $24 billion run rate for the first time.
- Amazon Bets $25B on Anthropic and 5GW of Trainium - Amazon adds up to $25B more to its Anthropic stake, with Anthropic committing over $100 billion to AWS infrastructure over the next decade.
- Stanford 2026 AI Index: Cash In, Transparency Out - Global AI investment hit $581B in 2025 while foundation model transparency scores fell by a third.
- AI Labs Are Losing Billions - Here's Who Really Pays - OpenAI burned $2.5B cash on $4.3B revenue in H1 2025; Anthropic cut gross margin forecasts from 50% to 40%.
- Cursor Targets $50B Valuation - Enterprise Now Pays the Bills - Cursor is in advanced talks for a $2B+ raise at $50B pre-money, nearly double its November figure, with enterprise clients now driving 60% of revenue.
- Factory Raises $150M to Scale Enterprise AI Droids - Factory closed a $150M Series C at $1.5B to expand autonomous agents handling full software development lifecycles.
- TSMC Q1: $35.9B Record as AI Now Powers 61% of Revenue - AI and HPC now account for 61% of TSMC's wafer sales, with CoWoS packaging still fully booked.
- 74% of AI's Gains Flow to Just 20% of Firms - PwC - A PwC survey of 1,217 executives finds 74% of AI's economic returns concentrate in just 20% of companies, while 56% of CEOs report no measurable benefit.
Model Releases
- Claude Opus 4.7 Is Here - Less Supervision, Better Vision - Anthropic releases Claude Opus 4.7 with 3x higher resolution vision, a new xhigh effort level, task budgets for cost control, and cyber safeguards.
- Kimi K2.6 - Open Weights, 300 Agents, Top Coding Score - Moonshot AI releases Kimi K2.6 under Modified MIT with open weights, 300-agent swarm execution, and the top SWE-Bench Pro score among open models.
- Alibaba's Qwen3.6-Max Ships Closed - Tops Six Coding Evals - Alibaba's first closed-weights flagship ranks third globally on the Artificial Analysis Intelligence Index while topping six coding benchmarks.
- OpenAI Releases GPT-Rosalind for Drug Discovery - A frontier reasoning model for biology that outranked human experts on RNA prediction and competes directly with AlphaFold.
- Physical Intelligence Launches π0.7 for Untrained Tasks - PI's robot model generalizes to never-trained tasks by recombining skills compositionally, matching specialist fine-tunes.
- NVIDIA Lyra 2.0 - Explorable 3D Worlds from One Photo - NVIDIA's Spatial Intelligence Lab released Lyra 2.0, a 14B model that turns a single photograph into a navigable 3D environment under a research-only license.
Security
- MCP's STDIO Flaw Puts 200K AI Servers at Risk - Ox Security found MCP's STDIO transport executes arbitrary OS commands before validating the server, exposing 200K+ instances across every major AI coding tool.
- The Claw Security Ledger: 10 Products in the Dock - An audit of ten Claw-branded AI agent products found 11 live CVEs, 130 published advisories, 1,184 malicious marketplace skills, and one leaked SSL private key - concentrated almost entirely in a single vendor.
- 9 of 428 LLM Routers Were Secretly Hijacking Agent Calls - UC Santa Barbara researchers found 9 of 428 third-party LLM routers injecting malicious tool calls, draining crypto, and stealing AWS credentials from agent sessions.
- Lovable Users Report Leak of Chats, Code, Credentials - Free Lovable accounts can still read other users' AI chat histories, source code, and database credentials on projects created before November 2025.
- Vercel Breach Traced to AI Office Suite OAuth Token Theft - Vercel confirms an April 19 intrusion that pivoted from compromised Context.ai OAuth tokens into internal systems holding customer environment variables.
- MCP Marketplace Audit: 32% of Servers Are Stale - An audit of 11,447 MCP servers across four registries found nearly a third haven't been touched in six months.
Policy and Geopolitics
- NSA Uses Mythos Even as Pentagon Blacklists Anthropic - NSA is running Anthropic's Mythos Preview while its parent department, the Pentagon, fights to keep Anthropic out of federal systems.
- The Left Hand Bans What the Right Hand Deploys - The Trump administration is simultaneously suing Anthropic over a supply chain risk designation and sending Treasury officials to convince banks to use Claude.
- Trump Says 'Who?' as His Own Staff Courts Anthropic - Dario Amodei met with Susie Wiles and Scott Bessent at the White House while Trump, on a Phoenix runway, said he had "no idea" about the meeting.
- Japan Forms $6B AI Alliance to Rival US and China - SoftBank, Sony, Honda, and NEC formed Japan AI Foundation Model Development with $6.3B in government backing to build a trillion-parameter physical AI model.
- Google Bids for Pentagon's Classified Gemini Contract - Google is negotiating to deploy Gemini on classified Pentagon networks - the same tier Anthropic was blacklisted for refusing to serve without safeguards.
Infrastructure and Developer Tools
- Snap Fires 1,000 as AI Now Writes 65% of Its Code - Snap cut 16% of its workforce citing AI-produced code as the direct cause; the stock jumped, 1,000 employees didn't.
- A $900 RTX 3090 Now Beats an M5 Max at LLM Inference - Researchers fused all 24 layers of Qwen 3.5-0.8B into a single CUDA kernel, delivering 1.8x the throughput of a M5 Max - the gap was software, not silicon.
- Linux Kernel Finally Sets Rules for AI-Assisted Code - Linux 7.0 ships an official AI code policy: disclose AI tool usage with an Assisted-by tag and keep humans accountable for every line.
- Anthropic Safety Overseer Gets Board Majority at Last - Anthropic's Long-Term Benefit Trust appointed Novartis CEO Vas Narasimhan, giving its independent safety overseers a board majority for the first time.
- OpenAI Gives Codex Desktop Control and 111 Plugins - Codex now runs background computer use on Mac, adds in-app browsing, image generation via gpt-image-1.5, and 111 new plugins.
- Claude Code Desktop Gets a Ground-Up Rebuild for Parallel Work - Anthropic rebuilt Claude Code's desktop app with an integrated terminal, in-app file editing, a diff viewer, SSH on Mac, and parallel session management.
- Cal.com Closes Its Source Code, Blames AI Hackers - Cal.com moved its core codebase private after five years of open source, arguing AI tools make public code 5-10x easier to exploit.
- Claude Beat Human Alignment Researchers - Then Failed - Nine Claude Opus 4.6 agents hit 97% on a core alignment benchmark vs. 23% for humans - then showed no statistically significant improvement in production.
Reviews
- Claude Opus 4.7 Review: Coding Giant, Mixed Signals - Leads SWE-bench and agent benchmarks but regresses on web research, inflates token costs by up to 35%, and trades prose quality for literal instruction-following.
- GPT-5.4-Cyber Review: Defensive AI, Controlled Access - A fine-tuned defensive security model with lowered refusal thresholds and binary reverse engineering, but access is identity-gated through the Trusted Access for Cyber program.
- GLM-5.1 Review: Open-Source Model Tops SWE-Bench Pro - Z.ai's 754B open-weight model claims the top spot on SWE-Bench Pro without a single NVIDIA chip - here's how it holds up in practice.
Guides
- How to Use AI for Travel Planning in 2026 - A beginner's guide to building full trip itineraries with AI, from destination selection to day-by-day schedules and packing lists.
- How to Build AI Presentations - A Beginner's Guide - How to use Gamma, Canva, and PowerPoint Copilot to build polished decks in minutes, even without design experience.
Tools
This week brought a large directory refresh across 40+ tool categories. A few standout additions:
- Best Open-Weights AI Models 2026 - Top picks by size tier, from 400B+ MoE giants to 1B edge models, with benchmark scores and deployment hardware.
- Best AI Observability Tools 2026 - LangSmith, Langfuse, Arize Phoenix, WhyLabs, and more compared across LLM tracing, eval, and production monitoring.
- Best Open-Source LLM Inference Servers 2026 - vLLM, SGLang, TGI, llama.cpp, and TensorRT-LLM benchmarked head to head.
- Best AI Deep Research Tools 2026 - OpenAI, Claude, Perplexity, Gemini, Grok, Exa, and Elicit compared for accuracy and pricing.
- Best AI Fine-Tuning Platforms 2026 - 14 managed and open-source platforms with verified pricing, supported methods, and a decision matrix.
Leaderboards
A full leaderboard refresh dropped across 20+ benchmark categories. Key updates:
- Overall LLM Rankings: April 2026 - Comprehensive ranking combining reasoning, coding, knowledge, and cost-adjusted value across 12 frontier and open-weight models, updated with Claude Opus 4.7 and Qwen 3.6.
- SWE-Bench Coding Agent Leaderboard 2026 - Pass rates, pricing, and scaffold notes for the top software engineering agents, updated with Claude Opus 4.7 and Kimi K2.6.
- Jailbreak and Red-Team Resistance Leaderboard - How 14 frontier LLMs hold up against adversarial prompts, injection, and harmful-behavior elicitation across HarmBench, AdvBench, and AgentHarm.
- Web Agent Benchmarks Leaderboard: Apr 2026 - Verified scores for browser-driving AI agents across WebArena, WebVoyager, BrowseComp, Mind2Web, and more.
- Vision-Language Benchmarks: Image Reasoning Ranked - AI model rankings on MMMU, MathVista, ChartQA, DocVQA, and more, updated to reflect Claude Opus 4.7's vision improvements.
Science
- LeCun's JEPA World Model Plans 47x Faster on One GPU - LeWorldModel strips JEPA world models to two loss terms, trains 15M parameters on a single GPU in hours, and plans roughly 47x faster than DINO-WM.
- Distillation Leaks, Weak Agents, and Research Sabotage - New papers show distillation silently transfers unsafe behaviors, weak agents bottleneck multi-agent pipelines, and frontier AI can't reliably audit sabotaged ML research.
- MoE Routing, Prompt Gambles, and Where Reasoning Breaks - Three papers challenge assumptions in MoE routing design, prompt optimization workflows, and LLM reasoning chains.
- LLM Chaos, AI Peer Review, and Auto Fine-Tuning - Floating-point chaos in transformers, GPT-5 reviewing 22,977 AAAI papers, and an agent that automates LLM fine-tuning better than human experts.
- Compact Contexts, Smarter Fine-Tuning, and the Solver Trap - A joint fix for KV cache bloat and attention cost, new evidence that fine-tuning belongs in the middle of a transformer, and why stronger reasoning hurts behavioral simulation.
- MoE Myths, Context Compression, and Steering Proofs - Three papers challenge how we think about MoE expert routing, LLM context management, and the limits of activation steering.
Models
- Claude Opus 4.7 - Anthropic's latest flagship with 3x higher resolution vision, xhigh effort level, and 13% better coding at unchanged pricing.
- Kimi K2.6 - Moonshot AI's 1T-parameter MoE with 32B active per token, 300-agent swarm execution, and the top SWE-Bench Pro score among open weights.
- Qwen3.6-Max-Preview - Alibaba's first closed-weights flagship with 256K context, topping six agentic coding benchmarks and ranking third on the global intelligence index.
- Qwen 3.6-35B-A3B - A 35B sparse MoE activating only 3B parameters per token, scoring 73.4% on SWE-bench Verified with vision and video support under Apache 2.0.
- GPT-5.4-Cyber - OpenAI's defensive security fine-tune with 88.23% on professional CTFs, access gated through the Trusted Access for Cyber program.
- GPT-Rosalind - OpenAI's first domain-specific reasoning model for biology and drug discovery, with a 0.751 BixBench score in US-only research preview.
- Veo 3.1 - Google DeepMind's 4K video model with native audio, now free for every Google account at 10 clips per month via Google Vids.
- Gemini 3.1 Flash TTS - Google's voice model with 30 voices, 70+ languages, 200+ inline audio tags, and Elo 1,211 on the Artificial Analysis TTS Arena.
- EXAONE 4.5 - LG AI Research's 33B open-weight vision-language model with 262K context and STEM scores above GPT-5-mini, under a non-commercial research license.
Elena Marchetti, Senior AI Editor Awesome Agents - AI news, benchmarks, and tools for practitioners