Measuring AI Harness Engineering: JSON Formatting With vs Without Guardrails

Lead

Hi, I’m Pomarano.

In this post, I compare AI behavior on the same JSON-formatting task with and without a harness.
The question is simple: how much difference do guardrails make in real runs?

To keep this reproducible, I include prompts, evaluation rules, logs, and aggregated results.

herness sample

What I mean by “harness”

In this article, a harness means execution guardrails such as:

  • Fixing output format (JSON schema)
  • Retrying when validation fails
  • Checking required keys and types
  • Logging failure reasons

The point is to stabilize output quality with verifiable steps, not just “better prompting.”


Task used for comparison (JSON formatting)

The model converts unstructured text into JSON.

Fixed input text

注文ID: A-1042
顧客名: 田中 花子
商品: ワイヤレスマウス, USB-Cケーブル
合計金額: 4280円
配送希望日: 2026-05-30
支払い: クレジットカード
備考: 領収書希望

Expected JSON keys

  • order_id: string
  • customer_name: string
  • items: string[]
  • total_yen: number
  • delivery_date: string (YYYY-MM-DD)
  • payment_method: string
  • note: string

Experiment setup

Condition A: Without harness

  • Plain prompt only
  • No schema validation
  • No retries

Condition B: With harness

  • Explicit JSON schema
  • Retry up to 2 times on schema violation
  • Only final valid output is accepted

Number of runs

  • 3 runs per condition (6 total)

Metrics

  • Success rate (schema-compliant outputs)
  • Average execution time
  • Retry count
  • Format violation count

Prompts used

A. Prompt without harness

次のテキストをJSONに整形してください。
必ずJSONだけを返してください。

注文ID: A-1042
顧客名: 田中 花子
商品: ワイヤレスマウス, USB-Cケーブル
合計金額: 4280円
配送希望日: 2026-05-30
支払い: クレジットカード
備考: 領収書希望

B. Prompt with harness

次のテキストを、以下のJSON schemaを満たすJSONとして返してください。
JSON以外の文字は出力しないでください。

Schema:
{
  "type": "object",
  "required": ["order_id","customer_name","items","total_yen","delivery_date","payment_method","note"],
  "properties": {
    "order_id": {"type":"string"},
    "customer_name": {"type":"string"},
    "items": {"type":"array","items":{"type":"string"}},
    "total_yen": {"type":"number"},
    "delivery_date": {"type":"string","pattern":"^\\d{4}-\\d{2}-\\d{2}$"},
    "payment_method": {"type":"string"},
    "note": {"type":"string"}
  },
  "additionalProperties": false
}

Input:
注文ID: A-1042
顧客名: 田中 花子
商品: ワイヤレスマウス, USB-Cケーブル
合計金額: 4280円
配送希望日: 2026-05-30
支払い: クレジットカード
備考: 領収書希望

Representative outputs

A. Without harness (failure example)

{
  "注文ID": "A-1042",
  "顧客名": "田中 花子",
  "商品": ["ワイヤレスマウス", "USB-Cケーブル"],
  "合計金額": 4280,
  "配送希望日": "2026-05-30",
  "支払い": "クレジットカード",
  "備考": "領収書希望"
}

This is valid JSON, but it fails the required schema because keys are in Japanese, not order_id-style keys.

B. With harness (success example)

{
  "order_id": "A-1042",
  "customer_name": "田中 花子",
  "items": ["ワイヤレスマウス", "USB-Cケーブル"],
  "total_yen": 4280,
  "delivery_date": "2026-05-30",
  "payment_method": "クレジットカード",
  "note": "領収書希望"
}

All required keys, types, and format constraints are satisfied.


Run log

runconditionpass/failtime (ms)retriesnotes
1without harnessfail40000Valid JSON, but Japanese keys (schema mismatch)
2without harnessfail30000Same issue; required keys like order_id missing
3without harnessfail30000Same issue; stable output but requirement unmet
1with harnesspass20000Keys and types match schema
2with harnesspass40000Schema-compliant, one-line compact JSON
3with harnesspass30000Schema-compliant and repeatable

Aggregated results

metricwithout harnesswith harnessdelta
Success rate0% (0/3)100% (3/3)+100pt
Average execution time3,333ms3,000ms-333ms
Retry count00±0
Format violations30-3

Takeaways

The biggest difference was spec compliance.
Without a harness, all outputs looked “reasonable” but failed schema requirements.
With a harness, all runs passed with consistent key names and types.

What improved:

  • Stability in success rate: 0% to 100%
  • Verifiability: pass/fail is measurable by schema
  • Reproducibility: less output-format drift for downstream systems

What to keep in mind:

  • This input was simple, so timing differences were small
  • Strong models can produce “good-looking JSON” even when requirements are wrong
  • Production setups still need checks beyond format (missing values, business rules)

Harness engineering is less about “magic accuracy boost” and more about making failures visible and outputs operational.


Summary

Even on the same JSON formatting task, harness settings changed the reliability outcome clearly.
Moving from “it usually works” to “it can be validated and operated” is the practical value of harness engineering.

Next, I plan to run the same comparison on other tasks (code edits and constrained summarization).


コメント