Recursive reports early results from automated AI research system

Recursive has released early results from an automated AI research system it says can carry out much of the research loop on its own. The company says the system has reached state-of-the-art performance on three benchmarks spanning model training speed, fixed-budget training quality, and GPU kernel optimization.

The announcement describes a setup that does more than search for a single better model. According to Recursive, the system proposes ideas, implements them, runs experiments, checks the results, and uses each outcome to decide what to try next. It can manage multiple research threads over long time horizons, preserve useful context from earlier experiments, and merge promising branches before validating whether an apparent gain is genuine rather than a reward hack or a random fluctuation.

Recursive says the benchmarks were chosen because they matter in practice and provide fast feedback. The tasks focus on three levers that affect AI progress: training algorithms, training speed, and hardware efficiency. The company also said it is open-sourcing artifacts from the runs so other researchers can inspect and build on the results.

In one benchmark, based on Andrej Karpathy's NanoChat autoresearch setup, Recursive said its system beat the best public community result after cleaning up small reward hacks and evaluating across 10 random seeds. The task asks systems to train a small language model within a five-minute compute budget on one GPU. Recursive reported a score of 0.9109 bits per byte, compared with 0.9372 for the prior state of the art. It also said the result reached the same quality as the original overnight NanoChat baseline in about 1.3 times less training time than the best community solution.

Recursive also tested the system starting from a weaker baseline, a vanilla Transformer with AdamW. In that run, the system improved the model from 1.059 BPB to 0.9344 BPB on an NVIDIA B200 GPU, which the company said was still competitive with the public best result. Recursive noted that this does not prove the system independently rediscovered every technique used by the community, since the underlying models may already know many public methods. Even so, the company said the search process was able to assemble a strong training stack from an initial implementation with far fewer built-in optimizations.

The article highlights several techniques discovered or combined by the system. These include hashed bigram and trigram tables, which add sparse n-gram information into the transformer's value stream through learned gates. Recursive said this gives the model a cheap way to capture local context without relying on slower alternatives. The company also said different layers used different hash functions to reduce collisions across layers.

Other changes mentioned in the release include architectural adjustments, short-context memory changes, auxiliary losses, attention tweaks, optimizer changes, weight decay schedules, and compiler settings. In the vanilla Transformer run, Recursive said the system also converged on token shifting, byte-level feature embeddings, and weight averaging before evaluation.

A third benchmark, SOL-ExecBench, measured GPU kernel optimization across 235 kernels. Recursive said the system improved the mean score from 0.699 to 0.754, which it described as an 18% reduction in the gap to the estimated optimum. The company framed the result as evidence that the system can also help with lower-level hardware work, not only model training.

Recursive says the project is an early step toward automated AI research rather than a finished system. Still, the results suggest that a machine-driven loop can already generate and combine ideas that improve practical AI workloads, at least within tightly defined benchmark settings.