AI Engineering Automation Raises Fears of Agent ‘Slop’ From Weak Evaluations

AI automation of the engineering loop draws caution

A new discussion in the AI engineering community is highlighting a growing concern: while much of the software engineering loop can now be automated with AI agents, handing over too much control may create poor-quality output that is hard to detect. The warning centers on the gap between what agents can technically do and what humans should still oversee.

The conversation was prompted by a post from Lotte Verheyden, who shared an article arguing that AI can now eat much of the AI engineering loop. The post suggests that the full workflow, from building to evaluating and refining systems, can be automated in principle. But the broader message is more cautious. Automation alone does not guarantee quality, especially when the checks used to judge agent output are incomplete or unreliable.

The concern is not that agents cannot produce code or other engineering artifacts. Rather, it is that they can produce plausible results that pass weak tests without actually improving the underlying system. In that scenario, teams risk accumulating what critics are calling agent slop, a shorthand for low-value, superficially convincing output that looks productive but does not hold up under closer inspection.

That risk is especially acute in the evaluation stage of the AI development process. If an agent is allowed to generate code, run experiments, and assess its own performance, any flaws in the evaluation mechanism can quickly compound. A model may optimize for the wrong metric, misread results, or repeatedly reinforce mistakes that a human reviewer would catch. The more autonomous the loop becomes, the more damaging those errors can be.

The discussion reflects a broader tension in AI engineering. Teams want to use agents to speed up repetitive work, reduce manual effort, and iterate faster. At the same time, there is growing recognition that evaluation remains one of the hardest parts of building AI systems. When the measurement itself is weak, automated optimization can be misleading, even if it appears efficient on paper.

The article linked in Verheyden’s post appears to recommend a selective approach. Instead of giving agents control over everything, it argues that some tasks are suitable for automation while others should remain under human direction. That distinction appears to be the core message of the debate: agents can help, but they should not replace judgment where quality is difficult to verify.

For engineering teams, the implications are practical. Automation may work well for drafting code, generating tests, or handling routine experimentation. But evaluation design, deciding what counts as success, and interpreting ambiguous outcomes still require careful oversight. Without that, a team can end up producing more output without producing better systems.

The discussion also points to a familiar pattern in AI adoption. New tools often make it possible to automate processes faster than organizations can define safeguards for them. In AI engineering, that mismatch may be especially risky because the systems being built are themselves probabilistic and easy to overestimate.

For now, the message from the debate is less about slowing down and more about being precise. AI agents may be capable of taking on a larger share of the engineering loop, but the quality of the results will still depend on whether the human-created evaluation framework can reliably tell useful progress from convincing noise.