← Back to Blog

Why Coding Agents Fail at Setup

October 10, 20253 min read

Coding agents look great on benchmarks where every dependency is pre-installed. Point them at a real repository and ask them to set it up from scratch, and they fall apart. SetupBench shows that even the best agents succeed only 62% of the time on environment setup. They miss test tools, invent config values, and install things that vanish when you open a new terminal.

Three Failure Modes

Missing test tools. Agents install runtime dependencies but skip test frameworks. They will run pip install -r requirements.txt but ignore the tox.ini file sitting right there. When validation runs tox, it is not found. This accounts for 17-26% of failures.

Hallucinated config. Agents invent configuration values. They will set a port to 53012 because they "inferred" it, when no such requirement exists anywhere in the repo. They will add flags that the project does not use. 24-30% of failures come from made-up constraints.

Non-persistent setup. Agents install tools that work in their session but disappear in a new terminal. They run pip install --user tox but never add ~/.local/bin to PATH in a persistent config file. 45% of session handoffs fail because of this.

Our Approach

For test tools, we use file ranking to find setup-critical files early, not just READMEs. Files like tox.ini, jest.config.js, and CI workflows get priority. We build a full dependency graph that separates runtime, test, and build dependencies.

For hallucinations, every config decision must cite its source: the exact file, line, and quote that justifies the value. No citation, no action. Uncertain decisions get flagged for review instead of applied silently.

For persistence, we enforce a protocol: write to durable config files (~/.bashrc, /etc/profile.d/), source them in the current session, and verify in a fresh shell subprocess before marking setup as complete.

Results

These changes bring test tooling success from 74-83% to 97%+, cut hallucinations from 24-30% down to under 3%, and raise persistence reliability from 55% to 98%+. There is still work to do, but the failure modes are well-understood and solvable.