Harness Engineering: How AI Development Earns Its Place in Regulated Software

Most of the anxiety about AI-assisted development is aimed at the wrong layer. People argue about which model writes the best code, as if the model were the system. It isn’t. The convenient shorthand the field has settled on is Agent = Model + Harness: the model is one stochastic component inside a much larger apparatus — the environment it runs in, the tools it can call, the gates it must pass, the evidence it leaves behind. That apparatus is the harness, and building it well is a distinct engineering discipline. The signal is hard to ignore: teams have moved benchmark scores by double digits by changing the harness alone and never touching the model. In a regulated product — a medical device, an avionics subsystem, an industrial safety controller — the harness is not a productivity nicety. It is the thing that makes the agent admissible at all.

What the harness actually is

A coding agent left alone produces text. A coding agent inside a good harness produces change you can reason about. The difference is everything the harness adds around the generation step:

A deterministic environment — pinned dependencies, seeded data, reproducible builds. The agent may be non-deterministic; the ground it stands on must not be.
A constrained tool surface — the exact set of commands, files, and APIs the agent may touch, and the ones it may never touch.
Feedback loops — compilers, type checkers, contract validators, and tests that the agent reads and reacts to before a human ever sees the diff.
Verification gates — the pass/fail checkpoints that decide whether a change is allowed to proceed.
Observability — a complete record of what was generated, from what inputs, and why it was accepted.

In my earlier writing on spec-driven development the word harness kept appearing almost incidentally — import { runScenario } from "../harness". That was not an accident of naming. The harness is where the golden tests live, where the runtime probes fire, where traceability IDs get checked. Harness engineering is the work of treating all of that as a first-class system rather than a pile of scripts that accreted around a CI file.

// The agent runs inside the harness, not the other way around

flowchart LR
S[Spec + constraints] --> H
subgraph H[Harness]
  direction TB
  E[Deterministic env] --> A((Agent run<br/>pinned + logged))
  A --> F[Feedback loops<br/>compile / types / contracts]
  F --> A
  A --> G{Verification gates}
end
G -- fail --> D[Drift / evidence report]
D --> S
G -- pass --> R[Merge + evidence pack]
R -. runtime probes .-> D

The harness is a drift-control plant

The SDD posts framed drift as the central new bug class: the gap between what the spec says and what the system does, in its interpretation, edit, regeneration, and runtime flavours. The honest follow-up question is where does drift get caught? The answer is always the same place — inside the harness. Golden tests catch regeneration drift at CI time. Conformance probes catch runtime drift after merge. A pre-merge check on orphan requirement IDs catches edit drift before review.

None of those techniques are interesting in isolation. What makes them work is that they share one machine: the same environment, the same traceability scheme, the same routing of a failure back to the requirement that owns it. Harness engineering is the discipline of building that machine so the techniques compose instead of contradicting each other. A team that bolts on golden tests but runs them against an unpinned environment hasn’t built a harness — it’s built a flaky alarm that everyone eventually mutes.

Why regulated software needs this most

Here is the inversion that makes harness engineering the lever for AI adoption in regulated domains: the agent’s worst property for compliance is non-determinism, and the harness’s entire job is to surround non-determinism with determinism and evidence.

Standards like IEC 62304 and ISO 13485 — and the FDA’s 2025 total-product-lifecycle draft guidance for AI-enabled device software — don’t ask “did a human write this?” They ask for a controlled, repeatable lifecycle with traceability from need to verified code, and an auditable record of who decided what. A well-built harness produces exactly those artifacts as a byproduct of running, not as documentation written afterward. Every gate the agent passes is a verification record. Every pinned generation is a design-history entry. The compliance burden moves from chasing the code with retrospective paperwork to maintaining a harness that emits evidence continuously.

The practical core is to make every agent run a controlled, signed event:

# harness/run-manifest.yaml — emitted on every agent invocation, archived to the DHF
run_id: gen-2026-05-31-REQ-DOSE-12-a47f
requirement: REQ-DOSE-12
safety_class: C                      # IEC 62304 — drives required rigor
inputs:
  spec_commit: 9f3c1a2               # controlled document SHA (ISO 13485 §4.2)
  agent_model: claude-opus-4-8
  prompt_hash: sha256:7b91…          # exact prompt, hashed
  seed: 42
forbidden_surface:                   # the agent must never touch these
  - src/dosing/calc_core.*           # Class C math, human-authored only
gates:
  - contract:   spec/contracts/dose.openapi.yaml   # passed
  - golden:     REQ-DOSE-12.golden.json            # passed  -> verification record V-DOSE-12
  - class_guard: no Class A→C import violations     # passed
  - reviewer:   independent, signed                 # required for Class C
result: accepted
evidence_pack: dhf/V-DOSE-12/

A gate, in turn, is just code that refuses to let a change through and writes down why it did or didn’t:

# harness/gates/class_guard.py
def check(diff, requirement):
    """Reject diffs where a higher-safety module pulls in lower-class symbols
    without an approved interface contract. Emits an audit line either way."""
    violations = [
        imp for imp in diff.new_imports
        if imp.safety_class > requirement.safety_class
        and not has_approved_contract(requirement, imp)
    ]
    record_evidence(requirement.id, gate="class_guard",
                    passed=not violations, detail=violations)
    if violations:
        raise GateFailure(f"{requirement.id}: {violations} crosses safety class")

Three properties make this harness audit-grade rather than merely tidy:

Pinned, signed generations. The harness can answer “show me exactly what produced this binary” — model, prompt, spec SHA, seed. Without that, an agent that emits different code on two runs simply breaks IEC 62304’s expectation of a repeatable lifecycle.
A forbidden surface as a hard boundary. The agent is a tool of known limitation. The harness, not the prompt, enforces that it never authors dosing math or crypto primitives — and logs every attempt that hit the fence.
Gates that double as verification records. Each golden case maps 1:1 to a verification protocol entry. The test report is the V&V evidence pack, not a screenshot pasted into a ticket.

The roles don’t disappear — they own gates. The architect owns the class guard and the pinning policy; QA owns the golden and probe gates; the developer owns the orphan-ID check; the product owner owns whether an acceptance criterion is prose or an executable contract. The harness is where their accountabilities become enforceable instead of aspirational.

Trade-offs

	Advantage	Drawback
Thin harness (lint + unit tests)	Cheap; familiar to any team	Non-determinism and drift pass straight through to review
Deterministic, pinned harness	Reproducible runs; audit can replay any generation	Real investment in environment hygiene and seed control
Evidence-emitting gates	V&V artifacts are a byproduct, not extra work	Gate authoring is upfront effort; weak gates give false comfort
Forbidden-surface enforcement	Keeps the agent out of safety-critical code by construction	Boundary maintenance; over-fencing slows legitimate work
No harness (raw agent output)	Fastest to start	Inadmissible in any regulated lifecycle; drift is invisible

The uncomfortable truth is that AI does not lower the bar for regulated software — it raises the importance of the apparatus around it. A model swap is a Tuesday. A harness that turns stochastic generation into pinned, gated, evidenced change is a year of engineering, and it is the only version of “AI in a medical device” that survives an audit. Build the harness first. The agent is the easy part.