The Dawn of Autonomous ML: How Andrej Karpathy's 'Autoresearch' Agent Ran 700 Experiments While He Slept

how

        March 26, 2026

The Dawn of Autonomous ML: How Andrej Karpathy's 'Autoresearch' Agent Ran 700 Experiments While He Slept

        The Dawn of Autonomous ML: How Andrej Karpathy's 'Autoresearch' Agent Ran 700 Experiments While He Slept
Andrej Karpathy's open-source 'autoresearch' framework is transforming Small Language Model (SLM) optimization by handing the execution loop to AI agents. By autonomously running hundreds of experiments overnight, the system discovers compounding code improvements that even top human experts miss.

March 2026 marked a quiet but profound shift in how machine learning research is conducted. Andrej Karpathy, one of the field's most respected researchers, released a minimalist open-source framework called autoresearch. The premise is deceptively simple: hand the repetitive loop of hypothesis, execution, and evaluation over to an autonomous AI agent, and let it run overnight.
The results, however, are anything but simple. In a span of two days, Karpathy's agent ran approximately 700 autonomous experiments on his already well-optimized nanochat codebase, discovering 20 genuine improvements that human intuition had missed.
The Mechanics of Autonomous Iteration
The traditional machine learning workflow is heavily bottlenecked by human execution. A researcher forms a hypothesis, tweaks the code, kicks off a training run, waits for the GPU to finish, checks the results, and decides what to do next. Autoresearch automates this entire cycle.
Karpathy's implementation enforces strict boundaries to keep the agent focused and productive:

The Immutable Sandbox: Data preparation (prepare.py) and evaluation protocols are locked down. The agent is only permitted to modify a single mutable file (train.py), which contains the model architecture, optimizer, and training loop.
The Five-Minute Constraint: Every experiment operates under a hard five-minute training window. This ensures rapid iteration—roughly 12 experiments per hour on a single GPU—while maintaining comparable baselines.
The Binary Objective: The agent evaluates success based on a single scalar metric: validation bits per byte (val_bpb). If the metric improves, the change is committed via Git. If it degrades, the change is reverted.

By removing the human from the micro-decisions, autoresearch allows an LLM agent to grind through the grueling mechanical work of hyperparameter tuning and architectural tweaking while the researcher sleeps.
Catching the Blind Spots of Human Experts
What makes the 700-experiment nanochat run so compelling is not just the volume of tests, but the nature of the discoveries. Karpathy is a world-class ML engineer, yet the autoresearch agent found stacking improvements that cut the time-to-GPT-2-quality by 11% (from 2.02 hours to 1.80 hours).
The agent identified several critical blind spots in the human-written code:

A missing scalar multiplier in the Parameterless QKNorm, which had made the attention mechanism too diffuse.
An absence of regularization on Value Embeddings.
A banded attention window that was overly conservative.
Suboptimal initializations, AdamW betas, and weight decay schedules.

Crucially, these weren't just overfit hacks for a tiny toy model. Karpathy noted that the improvements transferred seamlessly from 12-layer depth configurations to larger 24-layer models.
The SLM Renaissance and Industry Adoption
The timing of autoresearch aligns perfectly with the industry's pivot toward Small Language Models (SLMs). As enterprises realize they don't always need massive, trillion-parameter models, the demand for highly optimized, domain-specific SLMs has surged.
Because SLMs are small enough to train and iterate on quickly, they are the ideal target for the autoresearch pattern. The framework has already seen rapid adoption beyond Karpathy's personal lab:

Shopify's Internal Models: Tobi Lütke, CEO of Shopify, pointed the autoresearch framework at an internal 0.8 billion parameter model. Left running overnight, the agent executed 37 experiments and delivered a massive 19% performance gain.
Cluster-Scale Parallelization: Teams like SkyPilot scaled the agent across 16 GPUs, running over 910 experiments in just eight hours. The system autonomously learned to screen ideas on cheaper hardware before validating them on premium H200 chips.

Redefining the Role of the ML Researcher
The success of autoresearch forces a reevaluation of what it means to be a machine learning researcher. As AI agents become proficient at the execution loop, the human's role is elevating from writing training code to designing the boundaries of exploration.
In the autoresearch paradigm, the human's primary contribution is writing the program.md file—the "operator's manual" that defines the agent's permissions, constraints, and the ultimate metric of success. The value shifts from knowing exactly how to optimize an AdamW optimizer, to knowing what measurable outcome matters most.
We are moving away from ML research as a manual craft and toward an era of highly parallel, autonomous iteration. As agents continue to refine their own architectures, the 11% speedups of today will compound, accelerating the deployment of highly efficient, hyper-optimized models across the tech landscape.
Read the full article on Air Snips

                            Don't miss what's next. Subscribe to Verified:

            Email address (required)