The 4 Agents That Actually Shipped This Week (and why it matters)

This week in agentic AI
Four stories worth your attention. One you probably missed entirely.
OpenAI Codex gets persistent memory between runs.
The CLI agent can now maintain state across sessions through its MCP memory integration and local session transcripts. Persistent state is what separates a tool you call once from an agent you actually trust with a workflow. Early users report it cuts repeated context-setting significantly on multi-day projects.
Anthropic advances multi-agent orchestration.
Features discovered in Claude Code point to native agent team coordination, with typed outputs, error codes, and retry rules for sub-task handoffs. The pattern reduces the glue code every team is currently writing by hand. If Anthropic ships this broadly, expect the LangGraph-style orchestration layer to become optional rather than required.
Google DeepMind expands SIMA to new environments.
The embodied agent that plays video games now operates in expanded 3D scenarios, including simulated manufacturing and logistics. SIMA 2 showed roughly double the success rate of its predecessor. The interesting signal: performance still drops sharply when spatial reasoning is required, which tells you where the next wave of multimodal training will focus.
Mistral releases an agent-tuned open model.
A 7B parameter model fine-tuned for tool use and chain-of-thought planning, published under Apache 2.0 with minimal guardrails. Useful for automated pipelines running behind proper access controls. A liability if you point it at anything user-facing without a wrapper. Builders are already integrating it into local agent stacks where latency matters.

One thing to watch
Every major lab is racing to solve agent reliability: the gap between "it works in the demo" and "it works at 2am when no one is watching." The teams winning right now are not the ones with the best models. They are the ones who built the best evals, logging, and retry logic around adequate models. Build your stack like something will fail, because something will.
Quick hits
- Cursor adds agent mode to its web interface (no IDE required)
- Scale AI publishes a new benchmark for multi-step tool-use accuracy across leading models
- A 3-person team shipped an agent that autonomously files GDPR deletion requests on behalf of users; 400 sign-ups in 48 hours
That is #004. Forward this to one person who is actually building with agents.
- Lenny
Edition #004 | March 17, 2026