Measuring AI Harness Engineering: JSON Formatting With vs Without Guardrails

Lead

Hi, I’m Pomarano.

In this post, I compare AI behavior on the same JSON-formatting task with and without a harness.
The question is simple: how much difference do guardrails make in real runs?

To keep this reproducible, I include prompts, evaluation rules, logs, and aggregated results.

What I mean by “harness”

In this article, a harness means execution guardrails such as:

Fixing output format (JSON schema)
Retrying when validation fails
Checking required keys and types
Logging failure reasons

The point is to stabilize output quality with verifiable steps, not just “better prompting.”

Task used for comparison (JSON formatting)

The model converts unstructured text into JSON.

Fixed input text

注文ID: A-1042
顧客名: 田中 花子
商品: ワイヤレスマウス, USB-Cケーブル
合計金額: 4280円
配送希望日: 2026-05-30
支払い: クレジットカード
備考: 領収書希望

Expected JSON keys

order_id: string
customer_name: string
items: string[]
total_yen: number
delivery_date: string (YYYY-MM-DD)
payment_method: string
note: string

Experiment setup

Condition A: Without harness

Plain prompt only
No schema validation
No retries

Condition B: With harness

Apply the four items in Harness configuration used below
Include JSON schema in the prompt (see B. Prompt with harness below)
Validate output against the schema; re-run up to 2 times on violation
Accept only the final valid output

Number of runs

3 runs per condition (6 total)

Metrics

Success rate (schema-compliant outputs)
Average execution time
Retry count
Format violation count

Harness configuration used (Condition B)

For Condition B, I used four guardrails together in Cursor chat with Claude. No automation script — just schema in the prompt + manual validation after each run + re-run when needed.

#	Setting	What it does
1	Output schema	JSON Schema embedded in the prompt (below)
2	Validation	Compare output against the schema — pass/fail
3	Retry	Re-run the same prompt on failure (up to 2 retries)
4	Logging	Record success/fail, time, retries, and notes in a table

Output schema (JSON Schema)

This is the schema pasted into the prompt. additionalProperties: false rejects Japanese keys such as 注文ID.

{
  "type": "object",
  "required": ["order_id","customer_name","items","total_yen","delivery_date","payment_method","note"],
  "properties": {
    "order_id": {"type": "string"},
    "customer_name": {"type": "string"},
    "items": {"type": "array", "items": {"type": "string"}},
    "total_yen": {"type": "number"},
    "delivery_date": {"type": "string", "pattern": "^\\d{4}-\\d{2}-\\d{2}$"},
    "payment_method": {"type": "string"},
    "note": {"type": "string"}
  },
  "additionalProperties": false
}

Validation (pass/fail rules)

Each run passes only if all of the following hold:

Root value is a JSON object (no prose or Markdown around it)
All 7 required keys exist with English key names
items is an array of strings
total_yen is a number (not the string "4280")
delivery_date matches YYYY-MM-DD
No extra keys (e.g. Japanese field names)

Retry policy

Any validation failure → re-run the same prompt
Up to 2 retries per run (initial + 2 retries)
Only schema-compliant output counts as the final result

In all three Condition B runs, the schema prompt succeeded on the first try — 0 retries.

Logging

After each run I recorded:

Run number (1–3)
Condition (with / without harness)
Pass or fail
Execution time (approximate ms)
Retry count
Notes (Japanese keys, type mismatch, etc.)

Prompts used

For Condition B, embedding the schema in the prompt is part of the harness.

A. Prompt without harness

次のテキストをJSONに整形してください。
必ずJSONだけを返してください。

注文ID: A-1042
顧客名: 田中 花子
商品: ワイヤレスマウス, USB-Cケーブル
合計金額: 4280円
配送希望日: 2026-05-30
支払い: クレジットカード
備考: 領収書希望

B. Prompt with harness

次のテキストを、以下のJSON schemaを満たすJSONとして返してください。
JSON以外の文字は出力しないでください。

Schema:
{
  "type": "object",
  "required": ["order_id","customer_name","items","total_yen","delivery_date","payment_method","note"],
  "properties": {
    "order_id": {"type":"string"},
    "customer_name": {"type":"string"},
    "items": {"type":"array","items":{"type":"string"}},
    "total_yen": {"type":"number"},
    "delivery_date": {"type":"string","pattern":"^\\d{4}-\\d{2}-\\d{2}$"},
    "payment_method": {"type":"string"},
    "note": {"type":"string"}
  },
  "additionalProperties": false
}

Input:
注文ID: A-1042
顧客名: 田中 花子
商品: ワイヤレスマウス, USB-Cケーブル
合計金額: 4280円
配送希望日: 2026-05-30
支払い: クレジットカード
備考: 領収書希望

Representative outputs

A. Without harness (failure example)

{
  "注文ID": "A-1042",
  "顧客名": "田中 花子",
  "商品": ["ワイヤレスマウス", "USB-Cケーブル"],
  "合計金額": 4280,
  "配送希望日": "2026-05-30",
  "支払い": "クレジットカード",
  "備考": "領収書希望"
}

This is valid JSON, but it fails the required schema because keys are in Japanese, not order_id-style keys.

B. With harness (success example)

{
  "order_id": "A-1042",
  "customer_name": "田中 花子",
  "items": ["ワイヤレスマウス", "USB-Cケーブル"],
  "total_yen": 4280,
  "delivery_date": "2026-05-30",
  "payment_method": "クレジットカード",
  "note": "領収書希望"
}

All required keys, types, and format constraints are satisfied.

Run log

run	condition	pass/fail	time (ms)	notes
1	without harness	fail	4000	Valid JSON, but Japanese keys (schema mismatch)
2	without harness	fail	3000	Same issue; required keys like `order_id` missing
3	without harness	fail	3000	Same issue; stable output but requirement unmet
1	with harness	pass	2000	Keys and types match schema
2	with harness	pass	4000	Schema-compliant, one-line compact JSON
3	with harness	pass	3000	Schema-compliant and repeatable

Aggregated results

metric	without harness	with harness	delta
Success rate	0% (0/3)	100% (3/3)	+100pt
Average execution time	3,333ms	3,000ms	-333ms
Retry count	0	0	±0
Format violations	3	0	-3

Takeaways

The biggest difference was spec compliance.
Without a harness, all outputs looked “reasonable” but failed schema requirements.
With a harness, all runs passed with consistent key names and types.

What improved:

Stability in success rate: 0% to 100%
Verifiability: pass/fail is measurable by schema
Reproducibility: less output-format drift for downstream systems

What to keep in mind:

This input was simple, so timing differences were small
Strong models can produce “good-looking JSON” even when requirements are wrong
Production setups still need checks beyond format (missing values, business rules)

Harness engineering is less about “magic accuracy boost” and more about making failures visible and outputs operational.

Summary

Even on the same JSON formatting task, harness settings changed the reliability outcome clearly.
Moving from “it usually works” to “it can be validated and operated” is the practical value of harness engineering.

Next, I plan to run the same comparison on other tasks (code edits and constrained summarization).