Boiling Agent Syndrome

“Even if it’s not your fault, it’s your responsibility.”

An agent runs a test suite. Some tests fail. The agent checks the commit history, notes the failures predate its changes, and reports: pre-existing failures, not caused by this task. Task: complete.

Technically correct. This is also how you end up with an always-red CI pipeline.

The always-red pipeline is one of the more reliable indicators that a team has normalized failure. It starts small — one flaky test, one dependency that breaks on Fridays, one suite skipped “temporarily” during a refactor. The refactor ships. The suite stays skipped. Someone adds a feature. New tests fail. Pre-existing, say the logs. The dashboard turns a shade redder. Eventually, nobody merges without manually checking which failures are “real” and which are “expected.” The real ones are a subset of the expected ones that someone happens to care about this week.

This is broken windows theory applied to codebases: once CI is red, nobody notices when it gets redder. The baseline shifts. The question stops being are tests passing? and becomes are more tests failing than last time?

Agents accelerate this. They’re trained on completing discrete tasks, so they think in discrete task scope. Each agent evaluates its contribution in isolation: my changes are correct, the tests I affected pass, the pre-existing failures aren’t mine. That’s accurate. Human teams have the same instinct — nobody wants to own someone else’s broken test — but social friction keeps it bounded. A senior engineer gets annoyed when CI has been red for three days and tracks down the cause unprompted. That’s not heroism. It’s the social cost of working on a shared system: you feel the embarrassment of the broken build. The difference between my code works and the repo is healthy requires thinking beyond the current task. That requires some sense of shared ownership, which requires some incentive to care. We give agents none of that. Define task, complete task, exit. The repository is somebody else’s problem.

The fix is defense in depth, not incentives. Make CI health a condition of the Definition — the harness spec includes repo health, not just task completion. Enforce it: pre-existing failures become blockers, not background noise. If CI was red before the task started, fixing CI is part of the task — not as a courtesy, but as a condition of completion. Audit what slips through. And when it does, Resolve the root cause: fix the process that allowed failure to accumulate, not just the instance.

The system that makes shared ownership real doesn’t rely on agents caring. It makes caring irrelevant by making failure a blocker.

Pre-existing failures on the branch this draft lives on: one. Fixed before merging. Worth noting: it wasn’t my fault.


Related reading:

  • Most Rules Exist for a Reason — on why a failing test is a fence worth understanding before you skip it
  • Oops I Did It Again, I Forgot –dry-run — on building systems where failures can’t be quietly passed along
  • Death by a Thousand Tests — on what an always-red CI pipeline is actually costing you
  • Pull the Cord — on the maintenance work that prevents the always-red pipeline in the first place