The Push Bike Harness

Models now write most of the code on capable teams. The failures that remain aren't code-quality failures. The model can write good code. They're definition failures. Nobody told the agent precisely enough what "done" meant, and it shipped something that compiled, ran, passed tests, and was wrong.

That's the part the harness is supposed to handle. Pre-execution gates, sandboxes, blocking checks, lint hooks, branch protections. The harness is what turns "this model has a 5% failure rate" into "this model ships."

The problem is that harnesses go stale. The agent gets better. The model that was reckless six months ago is careful now. The check you wrote to stop a specific failure mode is still firing, except the failure mode hasn't been seen in months, and the check is now blocking a perfectly reasonable action. The harness that protected the agent at version N becomes the thing slowing it down at version N+1.

So what makes a harness good? Not how thoroughly it prevents failure. A harness that prevents every fall also prevents the learning that comes from falling. What matters is whether the harness gets out of the way as the agent improves, or quietly keeps the agent dependent on a scaffold long after it could ride on its own.

Which, conveniently, is the same question parents face when teaching a kid to ride a bike.

There are roughly three approaches.

Training wheels keep the bike upright. No falls. Also no balance: the kid is balancing the training wheels, not themselves. When the wheels come off, they have to learn from scratch. The skill they spent six months practicing turns out to be the wrong skill. This is a kind of negative transfer: the practiced skill quietly working against the one you actually need.

Bubble wrap is the over-engineered version. Wrap the bike. Wrap the kid. Wrap the bubble wrap. The kid never falls because the kid never moves.

Push bikes — no pedals, just two wheels, the ground, and the kid's feet — skip the training-wheel stage entirely. The kid solves the actual problem (balance) immediately. There are scrapes. Studies of children who learn on balance bikes find they ride independently as much as two years earlier than kids on training wheels, because by the time pedals show up, the hard part is over. Add brakes. Add gears. Each addition is trivial because the foundation isn't being relearned.

Most agent harnesses are training wheels. They prevent the falls. They also teach the agent, and the engineers building it, to depend on a scaffold that won't survive contact with the next model.

A push-bike harness is the one I'm trying to build. Less about preventing every fall, more about catching them, learning from them, and shrinking the scaffold over time. The structure I keep coming back to is DEAR.

Define. What does "done" mean for this task, in machine-checkable terms? Not "the code looks right." Predicates that can be evaluated automatically. If "done" isn't defined, every other phase is guesswork.

Enforce. Block known failure modes at execution time. Hooks, circuit breakers, branch protection. Cheap, synchronous, mandatory. This is the layer that stops the agent from rm -rf-ing your home directory because it misread a path.

Audit. Catch what slipped past Enforce. Most checks don't belong in the synchronous path. The cost of running them on every action is higher than the cost of catching the failure later. Defer them. Run periodic sweeps. Audit is the layer that finds the things you didn't know to enforce.

Resolve. A loop that runs against the harness's own failure data. When something slips through, fix the root cause, not just the instance. Add a Definition predicate, write the Enforce hook, expand the Audit. Each failure should leave the harness a little better, permanently.

The Resolve loop is what separates a push bike from training wheels. Training wheels never get smaller. DEAR is supposed to make itself less intrusive, one failure at a time.

There's a question buried in Enforce that the bike analogy answers cleanly: where does the check run?

Push it left as far as it'll go. The cheapest moment to catch a failure is inside the local agent loop, the moment the agent is about to do the wrong thing. One agent running an expensive check locally is cheaper than every agent, every PR, every push, every retry, running it again in CI/CD. The further right the check moves, the more agents pay for it, the slower the feedback, and the more the agent depends on an external system to tell it what it should already know.

By the time a DEAR-shaped task reaches CI/CD, every check should already run green. CI/CD is there as a safety net, not as the agent's verification loop. If the agent is uploading a PR without already being confident CI/CD will pass, the loop is shaped wrong.

Which makes every CI/CD failure a Define failure. Not "the test was flaky." That happens, but it's not the interesting case. The interesting question is: how did a PR go up for review without the relevant tests being run locally and confirmed green first? That's a job for Resolve. Either the predicate wasn't defined locally, or it was defined but not enforced before push, or the agent's loop didn't include it. Each one traces back to Define, to "done" not being pinned down tightly enough to evaluate before the agent claimed to be done.

CI/CD as the agent's primary feedback loop is the training-wheels version of verification. External. After the fact. The agent learns the wheels are there, not balance. A push-bike harness moves verification into the agent's own loop, so the agent checks itself before anyone else has to.

It's tempting to read "the harness shrinks" as "the harness becomes less important." That's half right, and the half it gets wrong is the half that matters.

As models get better, Enforce shrinks. Prompt-injection guards, permission modes, command verification, "don't let the agent touch the network without asking." These are training wheels. They're being absorbed into the model's defaults. The model just doesn't rm -rf home anymore. Take the hooks out.

But Define grows. A model that does the right thing is only useful if "the right thing" is encoded somewhere. As agents take more autonomous action, the leverage of having "done" pinned down in machine-checkable predicates, not vibes, goes up, not down. Enforce was the harness's bulk. Define is its skeleton.

And Audit changes shape. It stops being "did the agent break something we already had a rule for" and becomes the surprise-detection layer, where you find the failure modes you didn't know to write rules for. The harness's job at maturity isn't to prevent every fall. It's to be the place where new falls become predicates for next time.

Resolve is the redistribution mechanism. It moves load from Enforce to Define as the model improves. Same harness, different shape.

There's another redistribution that's easier to miss. The standard story for agent reliability is that the bottleneck is validation: making sure the model didn't ship something wrong. But once you can validate, the next bottleneck is which model.

The cheapest model that gets a task right is rarely the most capable one available, and the most capable model is rarely the cheapest. Which one is right depends on the task, on how familiar the model is with the tools involved (a CLI it's seen a million times in training is a different problem from a CLI it's only seen in your README), and on what came out last week. The calculus shifts weekly from the model side and per-task from the tooling side. It's a moving target with too many dimensions for a human to track.

It's exactly the shape of problem agents are good at. Run model A most of the time, route some percentage of tasks to model B, compare cost-to-success, repeat. A/B testing at the workflow level: automatic, continuous, with thousands of tasks per day informing the next routing decision. The harness is in the right place to do this. The human isn't.

The catch is you can't A/B test something you can't measure. Cost-to-intelligence optimization is gated on the Define layer. A team without good "done" predicates can run any model and won't know which one was better. The harness can't tell whether B beat A if "success" is a vibe.

Which is the same point coming back around. Define isn't just the safety net. It's the thing that lets the harness pick, on your behalf, the cheapest model that can actually do the work. A push-bike harness for model selection has the same shape as a push-bike harness for safety: it shrinks because it can measure, and once it can measure, the choices route themselves.

A concrete example of the smaller shape. Seventeen research sessions stranded or disappeared over a few weeks. The training-wheels fix would have been more gates: warn before exit, double-prompt on cleanup, require human approval. We added one synchronous check (a pre-exit hook with five machine-checkable completion predicates) and one asynchronous one (a periodic audit that flags stranded branches before cleanup). Sessions lost since: zero. Same road, smoother surface, without locking the harness into a shape it'd have to grow out of.

Here's where the harness actually is. The Enforce layer is solid: hooks, circuit breakers, branch tracking. The Audit layer is partial. Events get logged, cross-checks run, the dashboard exists, but coverage is incomplete. The Resolve loop, the part that automatically improves the harness from its own failure data, is mostly aspirational. The plumbing is sketched. The automation isn't.

The harness has training wheels of its own. The earliest version had a five-role orchestration that cross-checked agent output on every task. That kind of scaffolding made sense when models were less reliable, and now mostly adds latency without catching errors the model didn't already avoid. It's the part of the harness I'd take off first. Deferred Enforcement is in the protocol; the harness doesn't always follow its own advice. Pot, kettle.

The dear-agent repository is one person's attempt to operationalize this. Offered as a starting point, not a prescription.

The ideal harness doesn't prevent falls. It catches them, learns from them, and teaches the agent how not to fall the same way twice. After enough cycles, nobody thinks about the guardrails, not because they're gone, but because the rider doesn't need them anymore.

What the rider does still need is for someone to decide where they're going. The harness can shrink to invisibility, the model can answer any question that's asked, and there's still an irreducible human job: holding the question. Deciding what "done" means before anyone, agent or otherwise, tries to be done. That part doesn't go away. It's the part that gets encoded into Define, every time.

Related reading:

Don't Hate the Agent, Hate the Process — on the process failures DEAR is designed to catch
Most Rules Exist for a Reason — on why the harness itself needs to understand its own rules
Death by a Thousand Tests — on Deferred Enforcement and when checks belong in the synchronous path
Oops I Did It Again, I Forgot --dry-run — on safety systems that let you go faster