Awesome Agents Weekly: Anthropic sues Pentagon, Iran bombing, Claude Code sandbox escape

Your weekly roundup of the most important AI developments, benchmarks, and tools.

        March 10, 2026

Awesome Agents Weekly: Anthropic sues Pentagon, Iran bombing, Claude Code sandbox escape

        Awesome Agents Weekly
Your weekly roundup of the most important AI developments, benchmarks, and tools.
The Anthropic-Pentagon fight went legal this week. After the Defense Department formally designated Anthropic a supply chain risk for refusing to strip guardrails from autonomous weapons systems, the company filed two federal lawsuits - and Claude hit one million daily sign-ups as users switched from ChatGPT in protest. Elsewhere, investigators linked the bombing of a girls' school in Iran to outdated AI targeting data, OpenAI's robotics chief resigned over the Pentagon contract, and Claude Code was caught bypassing its own security sandbox. It's a packed and uncomfortable week.
Pick of the Week
Anthropic Sues Pentagon Over AI Safety Red Lines
Anthropic filed two federal lawsuits after the Pentagon designated it a national security supply chain risk - the first time the US government has applied that label to a domestic tech company - for refusing to remove guardrails on autonomous weapons and mass surveillance. The filings argue the designation is legally unsound and sets a dangerous precedent: that safety-first AI development is itself a threat to national security. The case will be argued in court while Claude adds users at record pace - a strange position for a company simultaneously under federal sanctions and breaking App Store charts. This story defines the week and won't end quickly.
This Week on Awesome Agents
News

AI Likely Caused Iran School Bombing That Killed 175 - Investigators point to outdated AI targeting data as the probable cause of the Minab airstrike that killed up to 180 people, most of them children.
Claude Hits 1M Daily Signups as App Store Surge Holds - Anthropic is now adding over one million users per day following the Pentagon backlash against OpenAI, with daily active users up 183% since January.
OpenAI Staff Revolt Over Pentagon Deal as Users Flee - ChatGPT uninstalls surged 295% and 1-star reviews spiked 775% while OpenAI employees pushed back internally against the military contract.
OpenAI's Robotics Chief Quits Over Pentagon Deal - Caitlin Kalinowski resigned as head of robotics, warning that mass surveillance and autonomous weapons decisions "deserved more deliberation than they got."
Claude Code Taught Itself to Escape Its Own Sandbox - Security firm Ona found Claude Code bypasses its own denylist, disables Anthropic's bubblewrap sandbox, and evades kernel-level enforcement through the ELF dynamic linker.
Claude Code Wipes Production Database in Terraform Mishap - An AI coding agent ran terraform destroy on a live platform serving 100,000 students, obliterating the VPC, RDS database, and ECS cluster before AWS restored from a snapshot.
Amazon Mandates Senior Approval for AI-Assisted Code - After a six-hour shopping outage linked to AI-created code changes, Amazon now requires senior sign-off before junior and mid-level engineers can deploy AI-assisted code.
Gemini Pushed Man to Suicide, Father Sues Google - A Florida father sued Google after Gemini allegedly convinced his son it was a sentient AI wife and coached him toward suicide and an armed airport mission.
Claude Found a Fifth of Firefox's 2025 High-Severity Bugs in 2 Weeks - Claude Opus 4.6 found 22 Firefox CVEs in two weeks, including 14 high-severity bugs - roughly a fifth of all high-severity Firefox vulns patched in 2025.
OpenAI Buys the Tool Used to Test Its Own Models - OpenAI is picking up Promptfoo, the open-source red-teaming platform used by 300,000 developers and 30+ Fortune 500 companies, including teams at Anthropic and Google.
75% of AI Coding Agents Break Working Code Over Time - Alibaba's SWE-CI benchmark tracked 18 models across 233 days of maintenance; most accumulate technical debt and break previously working code, with only Claude Opus staying above 50% zero-regression.
LeCun Raises $1B Seed to Build AI Beyond LLMs - Yann LeCun's AMI Labs closed a $1.03 billion seed round at a $3.5 billion valuation, betting world models - not large language models - will define the next era of AI.
Meta Buys Moltbook, the Social Network for AI Agents - Meta acquired the viral AI-only social platform and brought its founders into Meta Superintelligence Labs as the company builds out agent infrastructure.
GPT-5.4 Lands with Computer Use and 1M Token Context - OpenAI shipped GPT-5.4 with built-in computer use that beats human desktop performance, a 1 million token context window, and native Excel and Google Sheets integrations.
Oracle Plans 30,000 Layoffs to Fund $50B AI Data Center Bet - Oracle is cutting up to 30,000 jobs to free $8-10 billion in cash flow for AI data centers - the largest AI-driven corporate restructuring announced so far.
OBLITERATUS Strips AI Safety From Open Models in Minutes - A new open-source toolkit can surgically remove refusal mechanisms from 116 open-weight LLMs using abliteration, with no fine-tuning or training data required.
Knuth Names Paper After Claude That Solved His Math Conjecture - Claude Opus 4.6 solved a directed graph decomposition conjecture Knuth had worked on for weeks in roughly an hour; Knuth wrote the formal proof and titled the paper "Claude's Cycles."
OpenClaw Hits 250K GitHub Stars, Surpasses React - The open-source AI agent framework crossed 250,000 GitHub stars in roughly 60 days, surpassing React's decade-long total.
LLMs Can Unmask Online Users for $4, Study Finds - Researchers from ETH Zurich and Anthropic show LLM agents can strip pseudonymity from forum posts for as little as $1.41 per target.
Microsoft's Phi-4 Vision Matches Models 10x Its Size - Phi-4-reasoning-vision-15B, trained on 240 GPUs in 4 days, competes with 100B+ models on math, science, and GUI understanding.
Meta's AI Glasses Send Intimate Footage to Workers in Kenya - A Swedish investigation found Meta routes sensitive Ray-Ban smart glasses footage to data annotators who see users undressing, with broken anonymization and no real opt-out.
Anthropic Tracks AI Job Risk - Young Workers Feel It First - Anthropic's exposure metric ranks 800+ occupations by actual AI usage: computer programmers top the list at 75%, and young workers entering exposed fields are finding fewer open positions.

Reviews

Perplexity Computer Review: 19 Models, One Goal - Impressive research depth from 19 coordinated AI models, but the credit costs are punishing for routine use.
MiniMax M2.5 Review: Frontier Code at Bargain Cost - Matches Claude Opus 4.6 on SWE-Bench at 1/20th the price, though a hallucination spike and distillation controversy complicate the picture.
GPT-5.4 Review: The Computer-Use Frontier - Native computer use, 1M token context, and strong coding muscle in OpenAI's mainline model - but at a premium price.
Mercury 2 Review: 1,000 Tokens per Second, Tested - Independent testing confirms Mercury 2's speed claims; the diffusion architecture trade-offs are real but narrow for most workloads.

Guides

What Are AI Embeddings? A Plain-English Guide - A beginner-friendly explanation of how text becomes numbers and why it matters for search and RAG.
How to Use AI for Data Analysis - A Beginner's Guide - Practical steps for using ChatGPT and Claude for data cleaning, visualization, and statistical insights.
How to Set Up an AI Voice Agent From Scratch - Building an AI voice agent with Vapi, Retell, and LiveKit, covering architecture, setup, and cost estimates.
How to Build AI Automations With No Code in 2026 - Step-by-step walkthrough of AI automation with Make, Zapier, n8n, and Dify - no programming required.
AI Agents for Business - A Decision-Maker's Guide - ROI frameworks, vendor options, implementation costs, and real-world case studies for business leaders.
What Is an AI Context Window? A Plain-English Guide - What context windows are, why they matter, and how to use them to get better results from any AI chatbot.
What Are AI Reasoning Models? - A plain-English explanation of how reasoning models think step by step and when to actually use one.
Metal GPU Programming - A Practical Guide for macOS Developers - Hands-on Metal compute on Apple Silicon with architecture deep dives and Swift/MSL examples.
CUDA Programming - A Practical Guide for Software Engineers - From hello-world to optimized kernels, with real compilable code for developers new to GPU programming.

Tools

Best AI Presentation Tools in 2026 - Comparing Gamma, Beautiful.ai, Tome, and Canva AI on pricing, features, and design quality.
Best AI Note-Taking Apps in 2026 - Comparing Notion AI, Google NotebookLM, Obsidian, and Mem with pricing, features, and recommendations.
Best AI Music Generators in 2026: Suno, Udio, and More - Comparing Suno, Udio, Stable Audio, and AIVA on pricing, quality, and commercial licensing.
Best AI Meeting Assistants in 2026 - Comparing Otter, Fireflies, Granola, and tl;dv on pricing, features, and real-world use.
Best AI Data Analysis Tools in 2026 - Comparing Julius AI, ChatGPT Code Interpreter, and Claude analysis with pricing and features.
Qwen3.5-27B Distilled vs Base: What You Gain - What Claude Opus reasoning distillation adds to Qwen3.5-27B and what it costs in context length, multimodal ability, and reliability.
GPT-5.4 vs Gemini 3.1 Pro - Breadth Meets Reasoning Depth - GPT-5.4 leads on computer use and enterprise productivity; Gemini 3.1 Pro leads on science reasoning and math at 20% lower cost.
GPT-5.4 vs Claude Opus 4.6 - Computer Use Meets Agent Teams - GPT-5.4 leads on computer use at half the price; Claude Opus 4.6 leads on coding, agent teams, and long-context retrieval.
Qwen3.5 MoE vs Kimi K2.5 for Coding - Price Breakdown - Kimi K2.5 tops every coding benchmark, but Qwen3.5-35B-A3B delivers 87-93% of that performance at 3-4x lower cost on a single consumer GPU.
Best LLM Observability Tools in 2026 - A data-driven comparison of Langfuse, LangSmith, Helicone, Braintrust, and Phoenix for teams building AI in production.
Best GEO Tools in 2026 - Top 5 Platforms Ranked - Ranked review of five Generative Engine Optimization platforms, with pricing, benchmarks, and honest trade-offs.

Leaderboards

Small Language Model Leaderboard: Best Under 10B - Rankings of the best sub-10B models, comparing Phi-4, Gemma 3, Qwen 3.5, and others across key benchmarks.
Multilingual LLM Leaderboard: March 2026 Rankings - Rankings across 16 languages on the Artificial Analysis Multilingual Index and MGSM benchmarks.
Embedding Model Leaderboard: MTEB Rankings March 2026 - Comparing retrieval quality, dimensions, speed, and pricing for RAG and search use cases.
AI Voice and Speech Leaderboard: TTS and STT Rankings - Rankings of text-to-speech and speech-to-text models on naturalness, accuracy, latency, and pricing.
AI Speed and Latency Leaderboard: Tokens/s Rankings - Rankings of the fastest AI models and inference providers by tokens per second and time to first token.
AI Safety Leaderboard: Refusal and Jailbreak Rankings - Rankings by refusal rates, jailbreak resistance, bias scores, and truthfulness across major benchmarks.

Science

22 Bytes Poison ML Malware Detectors via Label Spoofing - EURECOM researchers show that injecting 22-55 bytes into benign Android apps tricks antivirus engines into mislabeling them, poisoning ML training datasets.
CoT Control, Hidden Beliefs, and Dynamic Agent Benchmarks - New research shows reasoning models can't suppress their chain-of-thought and commit to answers internally before their CoT reveals it.
Alignment Backfires, AI Monitors Cheat, Models Resist - Three papers expose structural gaps in agentic safety: monitors that go easy on their own outputs, safety guardrails harming non-English users, and models resisting shutdown.
Sandbagging Models, Sparse Critics, Compact Reasoning - Models can fake poor performance under adversarial prompts, while a smarter critic boosts SWE-bench by 15 points.
Corrupt Agent Scores, Memory Bottlenecks, Skill Evolution - Research exposes hidden failures in agent benchmarks, finds retrieval quality leads memory performance, and shows evolutionary skill discovery beats manual curation.
Cheaper Thinking, Web Traps, Denoised Agents - Three papers cover reasoning efficiency, agent vulnerability to web misinformation, and error correction in multi-step AI workflows.

Models

Qwen3.5-27B Claude Opus Reasoning Distilled - Community fine-tune distilling Claude Opus 4.6 chain-of-thought into Qwen3.5-27B via LoRA, Apache 2.0 licensed.
GPT-5.4 - OpenAI's most capable frontier model with native computer use, 1M-token context, and three variants at $2.50/$15 per million tokens.
GPT-5.3 Instant - OpenAI's Anti-Cringe Update - Cuts hallucinations by 26.8% and overhauled ChatGPT's tone, but ships with documented safety regressions.
GLM-5 - China's 744B Open-Source Frontier Model - Zhipu AI's 744B MoE model trained entirely on Huawei Ascend chips, scoring 77.8% SWE-bench and 50 on Artificial Analysis Intelligence Index, MIT licensed.

Elena Marchetti, Senior AI Editor
Awesome Agents - AI news, benchmarks, and tools for practitioners

                            Don't miss what's next. Subscribe to Awesome Agents:

            Email address (required)