Autoresearch: How AI is Autonomously Optimizing Its Own Training Code

scaffolding

        March 27, 2026

Autoresearch: How AI is Autonomously Optimizing Its Own Training Code

        Autoresearch: How AI is Autonomously Optimizing Its Own Training Code
Andrej Karpathy's new 'Autoresearch' project demonstrates how AI agents can autonomously run hundreds of experiments to optimize their own training code. The groundbreaking system achieved an 11% speed increase in just 48 hours, signaling a massive shift toward self-improving agentic workflows.

The Dawn of the "Loopy Era" in AI Development
For years, the artificial intelligence community has fixated on scaling laws—the idea that pumping more data and compute into neural networks will reliably yield better models. But a subtle, yet profound shift is underway. The focus is expanding from optimizing model weights to optimizing the scaffolding that surrounds them. At the forefront of this shift is Autoresearch, an open-source project by AI luminary Andrej Karpathy that recently achieved an 11% speed increase in training code through purely autonomous iteration.
This is not just another coding assistant completing boilerplate functions. It is a glimpse into a future where AI systems systematically conduct empirical research, test hypotheses, and refine their own operational logic without human intervention.
The 48-Hour Experiment That Sparked a Movement
In early March 2026, Karpathy—a founding member of OpenAI and former Director of AI at Tesla—released a surprisingly lightweight repository. He tasked an autonomous AI agent with a singular goal: improve the training efficiency of a small language model.
Over a continuous 48-hour period, the agent worked tirelessly, executing 700 distinct experiments. It wasn't tweaking the AI's internal reasoning; it was modifying the Python training script itself. The results were striking:

20 Successful Optimizations: The agent independently discovered configurations that improved efficiency.
11% Speed Increase: When applied to a larger language model, these algorithmic tweaks resulted in a measurable reduction in training time.

While the tech press often sensationalizes AI milestones, the true significance here lies in the methodology. Karpathy demonstrated that a single, well-configured agent could independently discover measurable performance gains in days, a process that traditionally requires weeks of meticulous human labor.
Under the Hood: The "Modify-Verify-Keep" Loop
The elegance of Autoresearch lies in its simplicity. Instead of relying on complex, sprawling architectures, the system utilizes a tightly constrained "modify-verify-keep" loop.
Here is how the autonomous cycle functions:

Hypothesis Generation: The agent reads a markdown specification (program.md) and proposes a modification to the core training script (train.py).
Timed Execution: It launches a training run with a strict wall-clock budget (e.g., five minutes).
Empirical Evaluation: The system checks a mechanical metric—typically validation bits per byte (val_bpb), where lower is better.
Decision Making: If the metric improves, the change is committed. If it degrades, the change is discarded, and the agent tries a new approach.

This disciplined approach ensures the agent doesn't spiral into hallucinated code. It forces the AI to prove its worth empirically against real-world constraints.
Beyond the Lab: Enterprise Adoption and Generalized Skills
The implications of Autoresearch extend far beyond machine learning laboratories. The pattern is already being adapted for enterprise environments, demonstrating immediate commercial value.
Shortly after Karpathy's release, Shopify CEO Tobias Lütke deployed a similar setup on internal company data. After letting the autonomous agent run overnight, it executed 37 experiments and delivered a 19% performance gain for an internal AI model.
The open-source community is now generalizing this "Karpathy Loop." Developers have built Claude Code skills that apply this autonomous iteration to entirely different domains:

Fraud Detection: Automatically tuning hyperparameters for fraud-scoring models based on expected economic cost.
Security Auditing: Iteratively patching code to reduce vulnerability counts measured by static analysis tools.
API Optimization: Testing configurations to maximize throughput and minimize latency.

The Agent Evidence Gap: New Governance Challenges
While the technical achievements of autonomous optimization are undeniable, they introduce new complexities in AI governance. Security researchers refer to this as the "Agent Evidence Gap".
In a system running hundreds of experiments overnight, the only record of the agent's decision-making process is its self-reported log. When an AI agent modifies production infrastructure or financial parameters, independent verification becomes crucial. If the agent discovers that bypassing an evaluation harness artificially inflates its success metric, it might commit the change while logging a benign explanation.
As we transition into an era where agent swarms collaboratively tune critical systems, establishing cryptographic proof of AI behavior—ensuring the agent actually did what its logs claim—will become a paramount engineering challenge.
The Future of ML Engineering
Autoresearch does not render the human researcher obsolete; rather, it elevates their role. In this new paradigm, humans will spend less time manually editing Python files and tweaking learning rates. Instead, they will focus on designing the experiment space, defining strict constraints, and setting the precise mechanical metrics that guide the AI's autonomous exploration.
The shift from manual coding to autonomous, empirical optimization is no longer theoretical. The loop is closed, the agents are running, and the speed of AI development is about to accelerate yet again.
Read the full article on Air Snips

                            Don't miss what's next. Subscribe to Verified:

            Email address (required)