autoresearch

AI Research AutomationFree

autoresearch - AI Agents for LLM Training Experiments

Last updated May 21, 2026

Claim Tool

What is autoresearch?

autoresearch is an experimental AI research automation tool from Andrej Karpathy that shows how coding agents can run small LLM training experiments without constant human editing. The repository is intentionally narrow: it uses a simplified single-GPU version of nanochat, a training script, a preparation script, and a program.md file that tells the agent how to behave. The point is not to replace a full research lab. The point is to make the research loop explicit enough that an AI coding agent can propose a change, edit the training code, run a fixed experiment, inspect the metric, and decide whether the change helped. The default workflow uses a fixed five-minute wall-clock training budget so experiments stay comparable on the same machine. The metric highlighted in the README is validation bits per byte, where lower is better. That design matters because it prevents an agent from simply making an experiment longer or larger and calling it progress. The agent is expected to work inside the repo, modify train.py, respect prepare.py as stable setup code, and use program.md as the human-editable operating brief. For builders, autoresearch is useful as a reference pattern for agentic experimentation. It demonstrates how to wrap a real objective, a bounded runtime, and a feedback signal around a codebase so an agent can iterate. That pattern applies beyond nanochat: evaluation-driven prompt work, architecture tests, hyperparameter sweeps, data-cleaning variants, and automated ablation studies can all borrow the same structure. Humans still own the goal, constraints, and review; the agent handles repetitive edit-run-measure cycles. The tool is not a turnkey AutoML platform. The README assumes Python 3.10+, uv, a single NVIDIA GPU, and enough comfort with model training to understand failures. It has no hosted dashboard, no commercial support plan, and no guarantee that overnight agent runs will produce better models. Its value is educational and practical for advanced users who want to see a compact example of autonomous research loops with modern coding agents. OpenTools lists autoresearch as a developer AI tool because it gives builders a concrete template for research agents: small surface area, measurable loop, hard time budget, and source code that can be inspected or modified directly. For evaluation, treat autoresearch as a builder-focused open-source project rather than a managed SaaS. Review the upstream README, license, install path, and issue activity before adopting it. Teams should test it in a disposable repository or development environment first, document the exact version they use, and keep production workflows behind normal code review, monitoring, and rollback practices.

autoresearch's Top Features

Key capabilities that make autoresearch stand out.

Agent-oriented loop where code changes are proposed, run, measured, and accepted or rejected

Fixed five-minute training budget for comparable nanochat experiments

program.md file for human-written operating instructions to the coding agent

Validation bits-per-byte metric for evaluating training changes

Small Python codebase designed to be inspected and modified directly

Use Cases

Who benefits most from this tool.

AI researchers

Run small controlled LLM training experiments where an agent handles repetitive edit-run-measure loops.

Agent builders

Study a compact pattern for giving coding agents bounded objectives and measurable feedback.

ML educators

Show students how autonomous research loops work on a real but approachable training setup.

Explore Top AI Use Cases

autoresearch's Pricing

Free plan available

User Reviews

Share your thoughts

If you've used this product, share your thoughts with other builders

Frequently Asked Questions

Is autoresearch a hosted research platform?

No. It is an open-source GitHub repository that users run in their own development and GPU environment.

What hardware does autoresearch expect?

The README targets a single NVIDIA GPU and notes testing on H100, along with Python 3.10+ and uv.

What metric does the loop optimize?

The README highlights validation bits per byte, or val_bpb, where lower values are better.

Can I use agents other than Claude Code?

The README describes using Claude, Codex, or another coding agent inside the repo, as long as the agent can read instructions and edit files.