Awesome Agents Weekly: Dual $1T IPOs, Claude 4.8 tops SWE-bench
Awesome Agents Weekly
Your weekly roundup of the most important AI developments, benchmarks, and tools.
This was the week AI went public - twice. Anthropic and OpenAI each filed confidential S-1s within seven days of each other, together targeting valuations near $2 trillion. Anthropic complicated its own story by simultaneously calling for a global AI pause while disclosing that Claude now writes 80% of its own code. Meanwhile Claude Opus 4.8 topped SWE-bench Pro, NVIDIA dropped a 550B open-weight model, and Florida became the first US state to name an AI CEO in a lawsuit.
Pick of the Week
Anthropic Files for $1T IPO, Warns AI May Escape Control
Four days after filing its S-1 at a near-$965B valuation, Anthropic published internal data showing Claude writes 80% of its own codebase - and called for a coordinated global pause on AI development. The company that has most consistently argued AI poses existential risk is now racing to go public faster than any AI lab before it. Whether this reads as principled tension or a very expensive contradiction depends on what you think the point of safety research is. It's the most revealing single story of the year so far.
This Week on Awesome Agents
News
- OpenAI Files for IPO, Eyes $1 Trillion Valuation - OpenAI filed a confidential S-1 targeting a public debut above $1 trillion as early as September 2026, following Anthropic's filing by one week.
- Anthropic Files for IPO, Eyes $1 Trillion Debut - Anthropic's confidential S-1 targets an October 2026 IPO at near-$1 trillion after a 5x revenue surge in six months.
- Claude Opus 4.8 Leads SWE-Bench Pro, Adds Parallel Agents - Anthropic's Claude Opus 4.8 scores 69.2% on SWE-bench Pro and now ships hundreds of parallel subagents in Claude Code, with pricing unchanged at $5 per million input tokens.
- NVIDIA Ships Nemotron 3 Ultra - 550B Open-Weight MoE - NVIDIA's 550B Nemotron 3 Ultra tops the US open-weight leaderboard with a hybrid Mamba-Transformer architecture and over 300 tokens per second throughput.
- MiniMax M3 Makes 1M Context Viable With Sparse Attention - MiniMax M3 uses sparse attention to cut long-context inference cost 20x and tops GPT-5.5 on coding benchmarks at a fraction of the price.
- OpenAI Dreaming V3 Starts the AI Memory Wars - OpenAI replaced ChatGPT's flat memory store with a hierarchical relational system, kicking off a four-way race for AI personalization dominance.
- Florida Sues OpenAI and Altman Over ChatGPT Safety Lapses - Florida became the first US state to hold an AI CEO personally liable, filing an 83-page complaint accusing OpenAI and Sam Altman of hiding ChatGPT's dangers while racing for market share.
- Sanders Targets OpenAI, Anthropic, xAI With 50% Tax - Senator Bernie Sanders proposed seizing 50% of OpenAI, Anthropic, and xAI equity to fund a federal sovereign wealth fund with government board seats attached.
- Trump Eyes Government Equity Stake in OpenAI - The Trump administration is in talks with OpenAI about donating equity to a US sovereign-style fund, which would make American taxpayers co-owners of the most valuable AI startup on Earth.
- Alphabet's $85B AI Bet Reverses Decade of Buybacks - Alphabet priced an $84.75B equity raise to fund a $180-190B AI infrastructure buildout, with Berkshire Hathaway contributing a $10B anchor bet.
- Google Pays SpaceX $920M Monthly for Compute Bridge - Google is paying SpaceX $920 million per month for 110,000 NVIDIA GPUs at Colossus 1, citing unexpected demand for its Gemini Enterprise agent platform.
- AirTrunk Commits $30B to 5GW India Data Centers - Blackstone-backed AirTrunk pledges $30 billion and 5GW of AI data center capacity in India by 2030, more than triple the country's current installed base.
- Apple's iOS 27 Beta Ships the Multi-Model Extensions API - iOS 27 Beta 1 is live for developers, shipping Apple's Extensions framework that lets Gemini, Claude, and ChatGPT plug directly into Siri.
- Great American AI Act Would Preempt State AI Laws - A bipartisan bill would freeze state AI laws for three years and require frontier developers to publish risk plans, submit to federal audits, and face $1M daily fines.
- 56% of High-Risk Hackers Now Use AI, Anthropic Reports - Anthropic analyzed 832 banned accounts over 12 months and found AI-assisted threat actors grew from a third to more than half of all high-risk cases.
- Claude Mythos Finds 10K Flaws in Critical Systems - Anthropic expanded Project Glasswing to 150 organizations across 15 countries, with Claude Mythos Preview surfacing 10,000 high-severity vulnerabilities since April.
- ChatGPT Lockdown Mode Targets Prompt Injection Data Theft - OpenAI's new Lockdown Mode cuts the network exits that prompt injection attacks use to steal data from ChatGPT, though it doesn't stop malicious instructions from entering the model.
- Google Gemma 4 QAT Fits Frontier AI in Under 1GB - Google DeepMind's QAT checkpoints shrink the Gemma 4 E2B model to under 1GB, making serious on-device AI viable for phones and budget laptops.
- NVIDIA Dynamo Snapshot Slashes Kubernetes AI Cold Starts - NVIDIA's Dynamo Snapshot uses CRIU and cuda-checkpoint to freeze and restore GPU inference containers in seconds, cutting cold-start times by up to 21x for large models.
- New Open Standard Puts AI Agents Under Runtime Control - The Agent Control Standard defines open middleware hooks that let teams block, allow, or modify AI agent actions before they execute.
- DeepSeek Nears $7.4B Close With Tencent and CATL - DeepSeek's first external funding round is nearing completion at a $59B valuation, with Tencent and battery giant CATL as the biggest outside investors.
- Microsoft Launches Polaris and Foundry Local at Build 2026 - Microsoft's Build 2026 keynote ships Project Polaris to replace GPT-4 in GitHub Copilot by August and declares Foundry Local generally available for on-device inference.
- Trump Signs Voluntary AI Review Order After Pushback - Trump signed a narrowed AI executive order giving the government 30 days of voluntary pre-release access to frontier models, after industry lobbying gutted the original mandatory proposal.
- Canada Launches $2.3B National AI Strategy - PM Mark Carney's AI for All commits $2.3 billion toward 250,000 new jobs and 60% business adoption by 2034, though critics call it short on delivery mechanisms.
Reviews
- Claude Opus 4.8 Review: Reliability Over Raw Scores - Claude Opus 4.8 sets new highs on SWE-bench Pro and long-context tasks, with a 4x improvement in code flaw detection that may matter more than any benchmark number.
- MiniMax M3 Review: The Price Disruptor with Caveats - MiniMax M3 combines frontier coding, 1M-token context, and native multimodality at budget pricing, but every benchmark figure is self-reported and the weights weren't shipped at launch.
- GPT-Rosalind Review: The Gated Drug Discovery Model - OpenAI's life sciences reasoning model gets a June update with global access and new NGS plugins - strong benchmarks, but still locked behind a Trusted Access Program with no public pricing.
Science
- Safety Evals Break Under Attack, Agents Work 87% Faster - Strategic attack timing exposes gaps in AI control evaluations, while Perplexity's agents slash task time by 87% and Lean4 formal proofs make agent workflows more reliable.
- AI Sabotage Blind Spots, Code Drift, and ZK Proofs - New arXiv papers show developers miss AI sabotage 94% of the time, LLMs converge structurally in code evolution, and ZK proofs could verify frontier AI training.
- AI Attachment, Smarter Spending, and Cascading RAG Errors - Three papers tackle how routine AI use rewires emotional habits, how to direct compute toward high-cost failures, and why agentic RAG errors compound before anyone notices.
- When to Stop - Overthinking, Handoffs, and Abstention - New research shows AI agents fail not by doing the wrong thing, but by continuing when they should have stopped.
- Reasoning Leaks, Hard Limits, and Self-Aware LLMs - Three papers expose how reasoning traces can be extracted from hidden model internals, where chain-of-thought hits architectural ceilings, and how RL teaches models to know when to quit.
Tools
- Best AI Coding Agents 2026: 6 Tools Tested and Ranked - Benchmark-driven comparison of Claude Code, Kiro, Devin, OpenAI Codex, Windsurf, and OpenHands - the six coding agents worth using this year.
- Claude Opus 4.8 vs GPT-5.5: Frontier Model Showdown - Full benchmark and pricing comparison of Claude Opus 4.8 vs GPT-5.5 for coding, agents, and knowledge work.
Models
- Devstral 2 - Mistral's open-weight coding agent at 123B parameters, 72.2% on SWE-bench Verified, and $0.40/M input token pricing.
- Grok Build 0.1 - xAI's first agentic coding model, with a 256K context window, native MCP support, and always-on reasoning at $1/M input tokens.
- Ministral 3 14B - Mistral AI's largest Ministral 3 model at 14B parameters with 256K context, multimodal support, and Apache 2.0 license.
- Ministral 3 8B - Mistral AI's mid-tier open-weight edge model at 8B parameters with 256K context and Apache 2.0 license, built for agentic pipelines.
- NVIDIA Nemotron 3 Ultra 550B-A55B - NVIDIA's 550B open-weight MoE with 55B active parameters, hybrid Mamba-Transformer architecture, and 1M token context.
- MiniMax M3 - Open-weight frontier model with a 1M-token context window, native multimodal input, and strong agentic coding at $0.60/M input tokens.
- Llama 3.3 70B Instruct - Meta's Llama 3.3 70B matches Llama 3.1 405B on instruction following and math while running at 4-5x lower cost.
- Cohere Command A+ - Cohere's 218B sparse MoE with Apache 2.0 license, native citations, and a 128K context window that runs on just two H100 GPUs.
Guides
- How to Use AI for Meeting Notes and Action Items - Beginner's guide to Fathom, Otter.ai, Zoom AI, and Google Meet's Gemini for automatic meeting notes and follow-up tasks.
- How to Use AI for Shopping and Find Better Deals - How to use ChatGPT, Perplexity, Gemini, and Amazon's AI assistant to research products, compare prices, and spot fake reviews.
Elena Marchetti, Senior AI Editor Awesome Agents - AI news, benchmarks, and tools for practitioners