Three Ways Coding Agents Break (and How to Fix Them)
SetupBench tests coding agents on 93 real repositories across Python, TypeScript, Go, Rust, Java, and C++. The results: even the best agent succeeds only 62% of the time, uses over a million tokens, and wastes 38-69% of its actions on redundant work. The failures fall into three clear patterns.
1. Missing Test Tools
Agents read the README, install the runtime dependencies, and call it done. But READMEs are written for end-users, not contributors. The agent misses that tox.ini means tox needs to be installed, or that jest.config.js means the project uses Jest. When validation runs the test suite, the test runner is not found.
Fix: Rank files by setup relevance and read them early. Build a dependency graph that separates runtime, test, and build requirements. Install all of them, not just the obvious ones.
2. Hallucinated Config Values
Agents make up configuration values. They set ports that nobody asked for, add flags the project does not use, and install package versions that contradict the requirements file. This is not a rare edge case: 24-30% of failures come from invented constraints.
Fix: Require a citation for every config decision. The agent must point to a specific file, line, and quote that justifies the value. If it cannot find one, it flags the decision for human review instead of guessing.
3. Setup That Vanishes
The agent installs everything correctly in its session. You open a new terminal and nothing works. Tools are not on PATH. Environment variables are gone. Version managers are not sourced. This happens because agents modify the current shell but do not write to persistent config files like ~/.bashrc.
Fix: A four-step protocol. Write to persistent configs. Source them in the current session. Log every change. Verify everything works in a fresh shell subprocess before reporting success.
Cutting Waste
Beyond the three failure modes, agents waste massive effort. They read the same file incrementally (40 lines, then 60, then 100, then the whole thing). They check whether Python is installed repeatedly even though they know the environment is fresh. They read example files and benchmarks that have nothing to do with setup.
The fix is straightforward: read files completely the first time, track system state so you do not re-check, and skip low-relevance paths. This alone cuts wasted steps from 38-69% to under 15%.
Where This Goes
These are solvable problems. Better file ranking catches test tools. Citation requirements eliminate hallucinations. Persistence protocols make setup durable. None of this requires a breakthrough in AI. It requires good engineering applied to well-understood failure modes.