Lead
Hi, I’m Pomarano.
In this post, I compare AI behavior on the same JSON-formatting task with and without a harness.
The question is simple: how much difference do guardrails make in real runs?
To keep this reproducible, I include prompts, evaluation rules, logs, and aggregated results.

What I mean by “harness”
In this article, a harness means execution guardrails such as:
- Fixing output format (JSON schema)
- Retrying when validation fails
- Checking required keys and types
- Logging failure reasons
The point is to stabilize output quality with verifiable steps, not just “better prompting.”
Task used for comparison (JSON formatting)
The model converts unstructured text into JSON.
Fixed input text
注文ID: A-1042 顧客名: 田中 花子 商品: ワイヤレスマウス, USB-Cケーブル 合計金額: 4280円 配送希望日: 2026-05-30 支払い: クレジットカード 備考: 領収書希望
Expected JSON keys
- order_id: string
- customer_name: string
- items: string[]
- total_yen: number
- delivery_date: string (YYYY-MM-DD)
- payment_method: string
- note: string
Experiment setup
Condition A: Without harness
- Plain prompt only
- No schema validation
- No retries
Condition B: With harness
- Explicit JSON schema
- Retry up to 2 times on schema violation
- Only final valid output is accepted
Number of runs
- 3 runs per condition (6 total)
Metrics
- Success rate (schema-compliant outputs)
- Average execution time
- Retry count
- Format violation count
Prompts used
A. Prompt without harness
次のテキストをJSONに整形してください。 必ずJSONだけを返してください。 注文ID: A-1042 顧客名: 田中 花子 商品: ワイヤレスマウス, USB-Cケーブル 合計金額: 4280円 配送希望日: 2026-05-30 支払い: クレジットカード 備考: 領収書希望
B. Prompt with harness
次のテキストを、以下のJSON schemaを満たすJSONとして返してください。
JSON以外の文字は出力しないでください。
Schema:
{
"type": "object",
"required": ["order_id","customer_name","items","total_yen","delivery_date","payment_method","note"],
"properties": {
"order_id": {"type":"string"},
"customer_name": {"type":"string"},
"items": {"type":"array","items":{"type":"string"}},
"total_yen": {"type":"number"},
"delivery_date": {"type":"string","pattern":"^\\d{4}-\\d{2}-\\d{2}$"},
"payment_method": {"type":"string"},
"note": {"type":"string"}
},
"additionalProperties": false
}
Input:
注文ID: A-1042
顧客名: 田中 花子
商品: ワイヤレスマウス, USB-Cケーブル
合計金額: 4280円
配送希望日: 2026-05-30
支払い: クレジットカード
備考: 領収書希望
Representative outputs
A. Without harness (failure example)
{
"注文ID": "A-1042",
"顧客名": "田中 花子",
"商品": ["ワイヤレスマウス", "USB-Cケーブル"],
"合計金額": 4280,
"配送希望日": "2026-05-30",
"支払い": "クレジットカード",
"備考": "領収書希望"
}
This is valid JSON, but it fails the required schema because keys are in Japanese, not order_id-style keys.
B. With harness (success example)
{
"order_id": "A-1042",
"customer_name": "田中 花子",
"items": ["ワイヤレスマウス", "USB-Cケーブル"],
"total_yen": 4280,
"delivery_date": "2026-05-30",
"payment_method": "クレジットカード",
"note": "領収書希望"
}
All required keys, types, and format constraints are satisfied.
Run log
| run | condition | pass/fail | time (ms) | retries | notes |
|---|---|---|---|---|---|
| 1 | without harness | fail | 4000 | 0 | Valid JSON, but Japanese keys (schema mismatch) |
| 2 | without harness | fail | 3000 | 0 | Same issue; required keys like order_id missing |
| 3 | without harness | fail | 3000 | 0 | Same issue; stable output but requirement unmet |
| 1 | with harness | pass | 2000 | 0 | Keys and types match schema |
| 2 | with harness | pass | 4000 | 0 | Schema-compliant, one-line compact JSON |
| 3 | with harness | pass | 3000 | 0 | Schema-compliant and repeatable |
Aggregated results
| metric | without harness | with harness | delta |
|---|---|---|---|
| Success rate | 0% (0/3) | 100% (3/3) | +100pt |
| Average execution time | 3,333ms | 3,000ms | -333ms |
| Retry count | 0 | 0 | ±0 |
| Format violations | 3 | 0 | -3 |
Takeaways
The biggest difference was spec compliance.
Without a harness, all outputs looked “reasonable” but failed schema requirements.
With a harness, all runs passed with consistent key names and types.
What improved:
- Stability in success rate: 0% to 100%
- Verifiability: pass/fail is measurable by schema
- Reproducibility: less output-format drift for downstream systems
What to keep in mind:
- This input was simple, so timing differences were small
- Strong models can produce “good-looking JSON” even when requirements are wrong
- Production setups still need checks beyond format (missing values, business rules)
Harness engineering is less about “magic accuracy boost” and more about making failures visible and outputs operational.
Summary
Even on the same JSON formatting task, harness settings changed the reliability outcome clearly.
Moving from “it usually works” to “it can be validated and operated” is the practical value of harness engineering.
Next, I plan to run the same comparison on other tasks (code edits and constrained summarization).

コメント