I Don't Debug Code — I Debug Agreements
Systems fail at boundaries, not implementations. A practice for writing falsifiable contracts and debugging by auditing agreements instead of tracing execution.
On a Sunday afternoon, I merged a refactor and load tests started failing. The ArtifactStore was throwing write-contention errors under concurrency. Intermittent, load-dependent. The worst kind.
I didn't open the ArtifactStore implementation. I opened a file I'd written months earlier. A set of rules I'd started keeping for exactly this kind of moment.
textArtifactStore - Responsibility: persist trained artifacts - Invariant: no training or transformation logic - Boundary: called exclusively by orchestrator - If violated: storage layer gains business logic, concurrent writes from multiple callers
Then I checked the Trainer's entry:
textTrainer - Responsibility: execute distributed training - Invariant: stateless, no persistence logic - Boundary: must NOT call storage directly - If violated: training couples to storage, race conditions with orchestrator writes
I grepped the Trainer module for ArtifactStore and found a helper function added during an earlier LLM-assisted refactor: persist_checkpoint_if_present(). Well-meaning. Completely violated the boundary. The Trainer was calling storage directly, racing against the orchestrator's writes.
Ripped the call out. Routed checkpointing through the orchestrator. Reran the test.
Failures vanished. Twelve minutes.
A year ago, this would have taken most of a day. Tracing execution, adding logging, building a mental model of how the pieces interacted. Traditional debugging is archaeological. Dig until you find something.
This felt different. Less like detective work. More like auditing a contract.
But here's what the twelve-minute story doesn't show: the months of getting the contracts wrong.
The Contracts I Got Wrong First
I started keeping these files when I realized that the model was operating without the architecture I thought was obvious. At small scale, this worked fine: a utility function, a single-purpose script with clear inputs and outputs. But at system scale, responsibility drifted. Validators assumed things about persistence. Transformers re-checked schemas. The orchestrator accumulated business logic.
This wasn't about AI being bad at architecture. It was about AI refusing to hallucinate the parts I never wrote down.
The obvious diagnosis: the architecture only existed in shared context, and shared context doesn't survive scale. So I wrote it down.
My first attempt looked like this:
textDataValidator - Validates incoming data - Ensures data quality
Useless. "Validates" could mean schema checks, type coercion, sanitization, logging malformed records, enriching with defaults. When I gave this to the model, it picked whatever pattern it had seen most often. When I came back to debug, the contract told me nothing I couldn't infer from the class name.
Second attempt:
textDataValidator - Checks schema compliance - No side effects
Better. But "no side effects" is a wishlist, not a constraint. How would I know if it was violated? I'd have to read the implementation, which defeats the purpose.
Third attempt:
textDataValidator - Responsibility: schema validation only - Invariant: pure function (no I/O, no logging, no persistence) - Interface: Result<ValidatedData, ValidationError> - If violated: side effects leak into validation layer, failures become unobservable or non-deterministic
Now the contract is falsifiable. I can grep for I/O calls. I can check whether logging exists. I can verify the return type. The "if violated" clause tells me what symptom to look for.
The difference between a useless contract and a useful one: can you structurally audit it without opening the code?
How Boundaries Get Discovered
The module ownership table I ended up with looked clean:
| Module | Responsibility | Invariant |
|---|---|---|
| DataIngestor | Load raw data | No transformation or validation |
| DataValidator | Schema validation | Pure function, no I/O |
| FeatureTransformer | Transform to features | Assumes validated input |
| Trainer | Distributed training | Stateless, no persistence |
| ArtifactStore | Persist artifacts | No business logic |
| PipelineOrchestrator | Control flow | Owns no domain logic |
It didn't start that way.
First pass, I had five modules. The Trainer handled its own checkpointing. Seemed natural; checkpoints are a training concern. Except checkpointing meant persistence, which meant the Trainer needed to know about storage backends, which meant it accumulated configuration for S3 and GCS and local filesystem. By week three, the Trainer was 40% training logic and 60% storage abstraction.
I split out ArtifactStore. But then the Trainer was calling it directly, and so was the orchestrator, and the FeatureTransformer wanted to cache intermediate results there too. Concurrent writes. Race conditions. Same pattern as the Sunday bug, except back then I didn't have the contract that made it obvious.
The failure wasn't incorrect output. It was responsibility drift: modules quietly absorbing concerns that belonged elsewhere, each change locally reasonable, the aggregate slowly incoherent.
The boundary that finally worked: only the orchestrator calls ArtifactStore. Everyone else produces artifacts and hands them up. The orchestrator decides when and where to persist.
This boundary wasn't designed up front. It was discovered by watching where coupling kept reappearing despite refactors. The pattern: any module everyone wants to call directly is infrastructure, and infrastructure without a choke point becomes a concurrency bug factory.
In systems past a certain size, boundaries emerge from pain. You don't design them in the abstract. You notice where drift keeps happening, where coupling keeps recurring, where bugs keep hiding. Then you draw a line and make it explicit.
When This Breaks Down
The twelve-minute story makes this sound clean. It often isn't.
Gray areas. The Trainer produces metrics during training. Are metrics a "persistence concern" or a "training concern"? The contract said "stateless, no persistence logic." But logging metrics to a dashboard isn't really the same as writing model checkpoints, is it? I debated this for an hour, mostly because the contract was doing real work instead of hand-waving. Ended up with a carve-out: "metrics emission via callback is permitted; direct I/O is not." Ugly but functional. The invariant didn't disappear; it narrowed from "no effects" to "no effects that create hidden coupling."
Contracts that lie. Three months in, I found the DataIngestor doing light normalization: lowercasing strings, trimming whitespace. Contract said "no transformation." But is lowercasing a transformation or just cleaning? I'd written the ingestor myself and convinced myself it didn't count. The contract only works if you're honest when you write it.
Evolution pressure. Requirements change. A contract that made sense in month one becomes a straitjacket in month four. The PipelineOrchestrator originally "owned no domain logic." Then I needed conditional branching based on data characteristics. Is a routing rule domain logic or control flow?
I rewrote the contract twice before landing on language that fit. The rule that stuck: the orchestrator may inspect data to choose paths, but may not transform or interpret it.
Invariants that age. Some contracts are technically correct but operationally misleading. Latency expectations, partial failure semantics, performance characteristics under load. These don't violate the contract; they just make it irrelevant when it matters most.
Ownership. On a team, someone has to own each contract. Not "the team" in the abstract. A specific person who can say "no, that crosses the boundary." Without clear ownership, contracts become suggestions. In my experience, module owner works better than tech lead; the person closest to the code is most likely to notice drift.
I use "agreement" deliberately, but in practice these are contracts: written, falsifiable constraints with named owners—not social norms.
Overhead. Maintaining these files takes time. For a side project, probably not worth it. For a codebase with one contributor who holds it all in their head, definitely not worth it. The break-even point, in my experience, is when the system exceeds your working memory or when you're collaborating with agents (human or otherwise) who don't share your context.
Once collaborators stop sharing a mental model—whether remote teammates or LLMs—agreements stop being documentation and start being control.
The Cost Model
Explicit contracts are expensive to create and cheap to check.
Writing a good contract for a module takes an hour or two of real thought: what's the responsibility, what's the invariant, what are the violation symptoms. For a six-module system, that's a day of upfront work.
But checking a contract takes seconds. Grep for violations, verify interfaces, confirm invariants hold. The Sunday bug took twelve minutes. The second time I caught a similar violation, five minutes. Third time, three.
The real cost isn't writing contracts. It's defending them when velocity pressure pushes in the opposite direction. Without someone willing to say "no, that violates the boundary," they rot. With AI accelerating code velocity, they rot faster.
Debugging as Auditing
Here's another one, from a few weeks before the Sunday bug. The pipeline threw validation errors after the transform stage. Made no sense. Validation runs before transformation.
I checked the FeatureTransformer's contract:
textFeatureTransformer - Assumes input is already validated - Performs no validation itself
Grepped for validation logic. Found a helper: ensure_valid_schema(). The model had added a redundant check that used slightly different rules than the real validator. The redundant check was failing on edge cases the real validator allowed.
Responsibility drift again. The transformer had quietly absorbed validation concerns, trying to be helpful.
Deleted the helper. Errors stopped.
Same pattern. Don't trace execution first. Audit the agreements that were supposed to make execution boring.
In Practice
Everything that follows assumes a system past the break-even point: too large to hold in your head, maintained over months, with collaborators who don't share your context.
Delegation
Bounded prompts. When I delegate to the model, the prompt includes the contract, not just the task.
textBuild the ImageValidator module. Rules: - Single entry point: validate() - No I/O, no logging, no persistence - Returns ValidationResult (ok: true or ok: false with reason) - Must not import any other module from this project
The last line matters. "Must not import" is auditable. "Keep it focused" is not.
Git as sandbox. Exploration gets a branch. If the model thrashes (and it will), git reset --hard is cheaper than untangling. If the whole direction was wrong, delete the branch. This removes the emotional cost of letting AI try things that might not work.
bashgit checkout -b avatar-refactor # prompt, iterate, commit # if it thrashes: git reset --hard # if direction was wrong: git checkout main && git branch -D avatar-refactor
Enforcement
Tests as invariants. The agreements I care most about get pinned with tests at the seams. This one asserts that validation runs before storage. If a shortcut gets added, the test catches it.
typescriptit("runs validation before storage", async () => { const invalid = { userId: "u1", data: "xxx", mimeType: "text/plain", sizeBytes: 10 }; const storageSpy = vi.spyOn(Storage, "putObject"); await expect(uploadAvatar(invalid)).rejects.toThrow(); expect(storageSpy).not.toHaveBeenCalled(); });
The test doesn't verify behavior. It verifies that the boundary wasn't crossed.
AI as contract linter. Structural violations—bad imports, I/O where there shouldn't be—are detectable without reading code. Semantic drift is harder: a function that now logs, violating "no side effects." Static analysis can't catch that. AI can. I run a check on PRs that passes the contract and the code to the model:
textHere is the contract for DataValidator: [contract] Here is the current implementation: [code] Does the implementation violate any invariant? List violations only.
This closes the loop. Contracts stop being documentation that drifts and become enforcement that runs on every commit. The same model that can introduce violations can catch them, if you give it the contract to check against.
What This Actually Is
The debugging shift is from "where is the bug?" to "which agreement was violated?"
But that's the easy part. The hard part is writing agreements worth checking.
A contract is only useful if it's falsifiable—if you can check for violations without reading the implementation. It's only accurate if you're honest when the boundary doesn't quite fit. It's only stable if you update it when requirements evolve.
The contracts aren't the insight. The discipline of writing falsifiable constraints and checking them first: that's the practice. The contracts are just the artifact.
I started doing this because LLMs forced the issue. The model couldn't infer my architecture, so I had to write it down. Frontier tools have caught up since—memory, project files, persistent context. Models can remember now. But memory isn't architecture. They still can't infer boundaries you haven't articulated.
The ideas aren't new. What's new is that LLMs remove the safety net of shared context, forcing latent architecture to become explicit. And that collaborator ships code faster than you can review it.
Twelve minutes. That's what the Sunday bug took. Not because I'm fast, but because the contracts were already there, waiting to be checked.
The contracts weren't the insight. Writing them before I needed them was.