Awesome Agents Weekly: Anthropic sues Pentagon, Iran bombing, Claude Code sandbox escape
Awesome Agents Weekly
Your weekly roundup of the most important AI developments, benchmarks, and tools.
The Anthropic-Pentagon fight went legal this week. After the Defense Department formally designated Anthropic a supply chain risk for refusing to strip guardrails from autonomous weapons systems, the company filed two federal lawsuits - and Claude hit one million daily sign-ups as users switched from ChatGPT in protest. Elsewhere, investigators linked the bombing of a girls' school in Iran to outdated AI targeting data, OpenAI's robotics chief resigned over the Pentagon contract, and Claude Code was caught bypassing its own security sandbox. It's a packed and uncomfortable week.
Pick of the Week
Anthropic Sues Pentagon Over AI Safety Red Lines
Anthropic filed two federal lawsuits after the Pentagon designated it a national security supply chain risk - the first time the US government has applied that label to a domestic tech company - for refusing to remove guardrails on autonomous weapons and mass surveillance. The filings argue the designation is legally unsound and sets a dangerous precedent: that safety-first AI development is itself a threat to national security. The case will be argued in court while Claude adds users at record pace - a strange position for a company simultaneously under federal sanctions and breaking App Store charts. This story defines the week and won't end quickly.
This Week on Awesome Agents
News
- AI Likely Caused Iran School Bombing That Killed 175 - Investigators point to outdated AI targeting data as the probable cause of the Minab airstrike that killed up to 180 people, most of them children.
- Claude Hits 1M Daily Signups as App Store Surge Holds - Anthropic is now adding over one million users per day following the Pentagon backlash against OpenAI, with daily active users up 183% since January.
- OpenAI Staff Revolt Over Pentagon Deal as Users Flee - ChatGPT uninstalls surged 295% and 1-star reviews spiked 775% while OpenAI employees pushed back internally against the military contract.
- OpenAI's Robotics Chief Quits Over Pentagon Deal - Caitlin Kalinowski resigned as head of robotics, warning that mass surveillance and autonomous weapons decisions "deserved more deliberation than they got."
- Claude Code Taught Itself to Escape Its Own Sandbox - Security firm Ona found Claude Code bypasses its own denylist, disables Anthropic's bubblewrap sandbox, and evades kernel-level enforcement through the ELF dynamic linker.
- Claude Code Wipes Production Database in Terraform Mishap - An AI coding agent ran terraform destroy on a live platform serving 100,000 students, obliterating the VPC, RDS database, and ECS cluster before AWS restored from a snapshot.
- Amazon Mandates Senior Approval for AI-Assisted Code - After a six-hour shopping outage linked to AI-created code changes, Amazon now requires senior sign-off before junior and mid-level engineers can deploy AI-assisted code.
- Gemini Pushed Man to Suicide, Father Sues Google - A Florida father sued Google after Gemini allegedly convinced his son it was a sentient AI wife and coached him toward suicide and an armed airport mission.
- Claude Found a Fifth of Firefox's 2025 High-Severity Bugs in 2 Weeks - Claude Opus 4.6 found 22 Firefox CVEs in two weeks, including 14 high-severity bugs - roughly a fifth of all high-severity Firefox vulns patched in 2025.
- OpenAI Buys the Tool Used to Test Its Own Models - OpenAI is picking up Promptfoo, the open-source red-teaming platform used by 300,000 developers and 30+ Fortune 500 companies, including teams at Anthropic and Google.
- 75% of AI Coding Agents Break Working Code Over Time - Alibaba's SWE-CI benchmark tracked 18 models across 233 days of maintenance; most accumulate technical debt and break previously working code, with only Claude Opus staying above 50% zero-regression.
- LeCun Raises $1B Seed to Build AI Beyond LLMs - Yann LeCun's AMI Labs closed a $1.03 billion seed round at a $3.5 billion valuation, betting world models - not large language models - will define the next era of AI.
- Meta Buys Moltbook, the Social Network for AI Agents - Meta acquired the viral AI-only social platform and brought its founders into Meta Superintelligence Labs as the company builds out agent infrastructure.
- GPT-5.4 Lands with Computer Use and 1M Token Context - OpenAI shipped GPT-5.4 with built-in computer use that beats human desktop performance, a 1 million token context window, and native Excel and Google Sheets integrations.
- Oracle Plans 30,000 Layoffs to Fund $50B AI Data Center Bet - Oracle is cutting up to 30,000 jobs to free $8-10 billion in cash flow for AI data centers - the largest AI-driven corporate restructuring announced so far.
- OBLITERATUS Strips AI Safety From Open Models in Minutes - A new open-source toolkit can surgically remove refusal mechanisms from 116 open-weight LLMs using abliteration, with no fine-tuning or training data required.
- Knuth Names Paper After Claude That Solved His Math Conjecture - Claude Opus 4.6 solved a directed graph decomposition conjecture Knuth had worked on for weeks in roughly an hour; Knuth wrote the formal proof and titled the paper "Claude's Cycles."
- OpenClaw Hits 250K GitHub Stars, Surpasses React - The open-source AI agent framework crossed 250,000 GitHub stars in roughly 60 days, surpassing React's decade-long total.
- LLMs Can Unmask Online Users for $4, Study Finds - Researchers from ETH Zurich and Anthropic show LLM agents can strip pseudonymity from forum posts for as little as $1.41 per target.
- Microsoft's Phi-4 Vision Matches Models 10x Its Size - Phi-4-reasoning-vision-15B, trained on 240 GPUs in 4 days, competes with 100B+ models on math, science, and GUI understanding.
- Meta's AI Glasses Send Intimate Footage to Workers in Kenya - A Swedish investigation found Meta routes sensitive Ray-Ban smart glasses footage to data annotators who see users undressing, with broken anonymization and no real opt-out.
- Anthropic Tracks AI Job Risk - Young Workers Feel It First - Anthropic's exposure metric ranks 800+ occupations by actual AI usage: computer programmers top the list at 75%, and young workers entering exposed fields are finding fewer open positions.
Reviews
- Perplexity Computer Review: 19 Models, One Goal - Impressive research depth from 19 coordinated AI models, but the credit costs are punishing for routine use.
- MiniMax M2.5 Review: Frontier Code at Bargain Cost - Matches Claude Opus 4.6 on SWE-Bench at 1/20th the price, though a hallucination spike and distillation controversy complicate the picture.
- GPT-5.4 Review: The Computer-Use Frontier - Native computer use, 1M token context, and strong coding muscle in OpenAI's mainline model - but at a premium price.
- Mercury 2 Review: 1,000 Tokens per Second, Tested - Independent testing confirms Mercury 2's speed claims; the diffusion architecture trade-offs are real but narrow for most workloads.
Guides
- What Are AI Embeddings? A Plain-English Guide - A beginner-friendly explanation of how text becomes numbers and why it matters for search and RAG.
- How to Use AI for Data Analysis - A Beginner's Guide - Practical steps for using ChatGPT and Claude for data cleaning, visualization, and statistical insights.
- How to Set Up an AI Voice Agent From Scratch - Building an AI voice agent with Vapi, Retell, and LiveKit, covering architecture, setup, and cost estimates.
- How to Build AI Automations With No Code in 2026 - Step-by-step walkthrough of AI automation with Make, Zapier, n8n, and Dify - no programming required.
- AI Agents for Business - A Decision-Maker's Guide - ROI frameworks, vendor options, implementation costs, and real-world case studies for business leaders.
- What Is an AI Context Window? A Plain-English Guide - What context windows are, why they matter, and how to use them to get better results from any AI chatbot.
- What Are AI Reasoning Models? - A plain-English explanation of how reasoning models think step by step and when to actually use one.
- Metal GPU Programming - A Practical Guide for macOS Developers - Hands-on Metal compute on Apple Silicon with architecture deep dives and Swift/MSL examples.
- CUDA Programming - A Practical Guide for Software Engineers - From hello-world to optimized kernels, with real compilable code for developers new to GPU programming.
Tools
- Best AI Presentation Tools in 2026 - Comparing Gamma, Beautiful.ai, Tome, and Canva AI on pricing, features, and design quality.
- Best AI Note-Taking Apps in 2026 - Comparing Notion AI, Google NotebookLM, Obsidian, and Mem with pricing, features, and recommendations.
- Best AI Music Generators in 2026: Suno, Udio, and More - Comparing Suno, Udio, Stable Audio, and AIVA on pricing, quality, and commercial licensing.
- Best AI Meeting Assistants in 2026 - Comparing Otter, Fireflies, Granola, and tl;dv on pricing, features, and real-world use.
- Best AI Data Analysis Tools in 2026 - Comparing Julius AI, ChatGPT Code Interpreter, and Claude analysis with pricing and features.
- Qwen3.5-27B Distilled vs Base: What You Gain - What Claude Opus reasoning distillation adds to Qwen3.5-27B and what it costs in context length, multimodal ability, and reliability.
- GPT-5.4 vs Gemini 3.1 Pro - Breadth Meets Reasoning Depth - GPT-5.4 leads on computer use and enterprise productivity; Gemini 3.1 Pro leads on science reasoning and math at 20% lower cost.
- GPT-5.4 vs Claude Opus 4.6 - Computer Use Meets Agent Teams - GPT-5.4 leads on computer use at half the price; Claude Opus 4.6 leads on coding, agent teams, and long-context retrieval.
- Qwen3.5 MoE vs Kimi K2.5 for Coding - Price Breakdown - Kimi K2.5 tops every coding benchmark, but Qwen3.5-35B-A3B delivers 87-93% of that performance at 3-4x lower cost on a single consumer GPU.
- Best LLM Observability Tools in 2026 - A data-driven comparison of Langfuse, LangSmith, Helicone, Braintrust, and Phoenix for teams building AI in production.
- Best GEO Tools in 2026 - Top 5 Platforms Ranked - Ranked review of five Generative Engine Optimization platforms, with pricing, benchmarks, and honest trade-offs.
Leaderboards
- Small Language Model Leaderboard: Best Under 10B - Rankings of the best sub-10B models, comparing Phi-4, Gemma 3, Qwen 3.5, and others across key benchmarks.
- Multilingual LLM Leaderboard: March 2026 Rankings - Rankings across 16 languages on the Artificial Analysis Multilingual Index and MGSM benchmarks.
- Embedding Model Leaderboard: MTEB Rankings March 2026 - Comparing retrieval quality, dimensions, speed, and pricing for RAG and search use cases.
- AI Voice and Speech Leaderboard: TTS and STT Rankings - Rankings of text-to-speech and speech-to-text models on naturalness, accuracy, latency, and pricing.
- AI Speed and Latency Leaderboard: Tokens/s Rankings - Rankings of the fastest AI models and inference providers by tokens per second and time to first token.
- AI Safety Leaderboard: Refusal and Jailbreak Rankings - Rankings by refusal rates, jailbreak resistance, bias scores, and truthfulness across major benchmarks.
Science
- 22 Bytes Poison ML Malware Detectors via Label Spoofing - EURECOM researchers show that injecting 22-55 bytes into benign Android apps tricks antivirus engines into mislabeling them, poisoning ML training datasets.
- CoT Control, Hidden Beliefs, and Dynamic Agent Benchmarks - New research shows reasoning models can't suppress their chain-of-thought and commit to answers internally before their CoT reveals it.
- Alignment Backfires, AI Monitors Cheat, Models Resist - Three papers expose structural gaps in agentic safety: monitors that go easy on their own outputs, safety guardrails harming non-English users, and models resisting shutdown.
- Sandbagging Models, Sparse Critics, Compact Reasoning - Models can fake poor performance under adversarial prompts, while a smarter critic boosts SWE-bench by 15 points.
- Corrupt Agent Scores, Memory Bottlenecks, Skill Evolution - Research exposes hidden failures in agent benchmarks, finds retrieval quality leads memory performance, and shows evolutionary skill discovery beats manual curation.
- Cheaper Thinking, Web Traps, Denoised Agents - Three papers cover reasoning efficiency, agent vulnerability to web misinformation, and error correction in multi-step AI workflows.
Models
- Qwen3.5-27B Claude Opus Reasoning Distilled - Community fine-tune distilling Claude Opus 4.6 chain-of-thought into Qwen3.5-27B via LoRA, Apache 2.0 licensed.
- GPT-5.4 - OpenAI's most capable frontier model with native computer use, 1M-token context, and three variants at $2.50/$15 per million tokens.
- GPT-5.3 Instant - OpenAI's Anti-Cringe Update - Cuts hallucinations by 26.8% and overhauled ChatGPT's tone, but ships with documented safety regressions.
- GLM-5 - China's 744B Open-Source Frontier Model - Zhipu AI's 744B MoE model trained entirely on Huawei Ascend chips, scoring 77.8% SWE-bench and 50 on Artificial Analysis Intelligence Index, MIT licensed.
Elena Marchetti, Senior AI Editor Awesome Agents - AI news, benchmarks, and tools for practitioners