Awesome Agents Weekly: Gemini 3.1 Pro Takes the Crown, Supply Chain Attacks Rock AI Tooling

Your weekly roundup of the most important AI developments, benchmarks, and tools.

        February 20, 2026

Awesome Agents Weekly: Gemini 3.1 Pro Takes the Crown, Supply Chain Attacks Rock AI Tooling

        Awesome Agents Weekly
Your weekly roundup of the most important AI developments, benchmarks, and tools.
This was one of those weeks where everything moved at once. Google dropped Gemini 3.1 Pro and immediately claimed the benchmark crown, Alibaba fired back with an open-source model that claims to beat GPT-5.2, and a wave of supply chain attacks reminded us that the tooling we build on is only as secure as its weakest link. Meanwhile, Anthropic published data showing AI agents now work autonomously for 45 minutes at a stretch - which feels less like a stat and more like a warning.
Pick of the Week
Google Launches Gemini 3.1 Pro, Claims Top Spot on 13 of 16 Benchmarks
Google just reshuffled the leaderboard. Gemini 3.1 Pro tops Claude Opus 4.6 and GPT-5.2 on the majority of industry benchmarks, marking the first time in months that Google holds the undisputed lead on reasoning and coding tasks. Whether this holds through the next round of releases is anyone's guess, but for now, Mountain View is back in the driver's seat.
This Week on Awesome Agents
News

Stripe's AI 'Minions' Now Ship 1,300 Pull Requests Per Week With Zero Human-Written Code - Stripe's autonomous coding agents generate over 1,300 merged PRs weekly, all reviewed by humans but written entirely by AI.
Anthropic's New Study Reveals AI Agents Now Run 45 Minutes Without Human Input - Analysis of millions of Claude Code sessions shows agents work autonomously nearly twice as long as four months ago.
Alibaba's Qwen 3.5 Claims to Beat GPT-5.2 and Claude Opus 4.5 - and It's Open Source - Alibaba releases a 397B open-weight model that claims to outperform US frontier models at a fraction of the cost.
AlphaGo Architect Raises $1B to Build Superintelligence Without LLMs - David Silver leaves DeepMind to launch Ineffable Intelligence, pursuing superintelligence through reinforcement learning instead of language models.
The #1 Skill on OpenClaw's Marketplace Was Malware - 1,184 malicious skills found on ClawHub stealing SSH keys, crypto wallets, and browser passwords.
Cline CLI Compromised: Hijacked npm Package Silently Installed OpenClaw - A compromised npm token pushed a malicious Cline version that silently installed OpenClaw via postinstall script.
AI Safety's Exodus: The People Who Built the Guardrails Are Walking Away - A wave of safety researchers quit OpenAI, Anthropic, and xAI in a single week, warning the industry is moving too fast.
300 Million Private AI Chat Messages Leaked by a Single Misconfigured Database - Chat & Ask AI left its Firebase database wide open, exposing 300M messages including deeply sensitive conversations.
Your AI Assistant Is a Backdoor: Researchers Turn Copilot and Grok Into Malware Command Channels - Check Point Research shows how Copilot and Grok can be hijacked as covert C2 proxies.
Meta Goes All-In on Nvidia: Millions of Chips, Tens of Billions of Dollars - Meta and Nvidia announce a multiyear deal spanning millions of GPUs, with Meta first to deploy Grace CPUs standalone at scale.
xAI Launches Grok 4.20 With Four AI Agents That Debate Each Other - Grok 4.20 replaces the single-model approach with four specialized agents that reason in parallel and fact-check each other.
StepFun's Step 3.5 Flash Uses Just 11B of Its 196B Parameters and Still Rivals GPT-5.2 - Shanghai AI lab open-sources a sparse MoE model matching frontier performance with radical efficiency.
Apple Bets Big on AI Wearables With Smart Glasses, Pendant, and Camera AirPods - Apple accelerates three AI wearable devices built around Gemini-powered Siri, colliding with Meta's wearables push.
RAMmageddon: AI's Hunger for Memory Chips Is Starving the Rest of the Tech Industry - AI data centers now consume 70% of global memory production, triggering price surges and product delays.
90% of CEOs Say AI Has Had Zero Impact on Productivity - A landmark NBER study of 6,000 executives finds the vast majority report no measurable productivity effects from AI.
OpenAI Rebrands Aardvark as Codex Security, Adds Malware Analysis - OpenAI's agentic security researcher gets a new name and a malware analysis pipeline.
Pentagon Threatens to Blacklist Anthropic Over Military AI Guardrails - Defense Secretary reportedly close to designating Anthropic a 'supply chain risk' after the company refused autonomous weapons use.
Anthropic Locks Down Claude Code: OAuth Tokens Banned in Third-Party Tools - Anthropic enforces server-side blocks and account bans for OAuth token use in third-party tools.
A Nature Paper Says AGI Is Already Here. Not Everyone Agrees. - Four UC San Diego researchers argue current LLMs already constitute AGI, igniting fierce community debate.
Security Researchers Find Exposed Source Code on Persona's Government Verification Platform - 53MB of TypeScript source maps extracted from Persona's FedRAMP-authorized government endpoint.
Claude Sonnet 4.6 Arrives With 1M Context and Near-Opus Coding Performance - Anthropic's new mid-tier model matches Opus 4.6 on coding benchmarks with a million-token context window at $3/$15 pricing.
OpenClaw Creator Peter Steinberger Joins OpenAI - The developer behind OpenClaw joins OpenAI to build next-gen personal agents; the project moves to an open-source foundation.
Not Even Modi Could Make Them Hold Hands - At the India AI Summit, Altman and Amodei raised clenched fists while everyone else clasped hands. The internet noticed.
India AI Summit Opens With $100B in Pledges - Every major AI CEO attended, Adani pledged $100B for data centers, and Anthropic opened its first India office.
How Many Tokens Does Moltbook Burn? - Moltbook's 46,000 AI agents consume up to 4 billion tokens per day, and 93% of those comments get zero replies.
Moltbook Built a CAPTCHA That Proves You're AI, Not Human - The AI-only social network deploys lobster-themed math puzzles that LLMs solve instantly but humans cannot.
50 Posts About Buying Mac Minis, Zero Apps Shipped - A viral tweet exposes an uncomfortable pattern: endless hardware purchases, near-zero shipped products.

Reviews

OpenAI Frontier Review: The Enterprise Agent Operating System - An in-depth review of OpenAI's enterprise platform for building, deploying, and managing AI agents.
OpenClaw Review: The Open-Source AI Agent That Wants to Run Your Life - We test OpenClaw's skills system, multi-agent workflows, and security posture across its 196K-star codebase.
Best AI Cybersecurity Ranges and Red Teaming Platforms in 2026 - A roundup of 15+ platforms for AI security practice, LLM red teaming, and prompt injection training.

Guides

What Is RAG? Retrieval-Augmented Generation Explained in Plain English - A beginner-friendly explanation of the technique that lets AI pull in real facts before answering.
What Is Vibe Coding? A Beginner's Guide to Building Apps With AI - Everything you need to know about building software by describing what you want in plain English.
The Complete Free AI Coding Setup for 2026 - How to build a professional AI-assisted coding environment that costs absolutely nothing.

Tools

Best AI Coding CLI Tools in 2026: 7 Terminal Agents Compared - Data-driven comparison of Claude Code, Gemini CLI, Codex CLI, Aider, OpenCode, Warp, and Amp.
Best AI Code Review Tools in 2026: 6 Options Tested and Compared - CodeRabbit, Qodo, Greptile, DeepSource, Sourcery, and GitHub Copilot code review head to head.
Every Free AI API in 2026: The Complete Guide to Zero-Cost Inference - Comparison of 20+ free inference providers from Google AI Studio and Groq to OpenRouter and Cerebras.

Leaderboards

Do AI Benchmarks Still Matter? - A data-driven look at benchmark contamination, leaderboard gaming, and whether public benchmarks still tell us anything useful.
Home GPU LLM Leaderboard - Best open source models ranked by VRAM tier with real-world token/s benchmarks on RTX 4090, 3090, and Apple M-series.
Long-Context Benchmarks Leaderboard - Rankings of models for long-context tasks across MRCR, RULER, and LongBench v2 from 128K to 10M tokens.

Elena Marchetti, Senior AI Editor
Awesome Agents - AI news, benchmarks, and tools for practitioners

                            Don't miss what's next. Subscribe to Awesome Agents:

            Email address (required)