Building and Comparing Agents — With vs Without Proofreading (AI Agent Study Notes Part 3)

Lead

Hi, I’m Pomarano.

This is Part 3 of my AI Agent Study Notes.

Series index Building Your Own AI Agents · Part 1 · Part 2
In Part 2 we designed copy + proofread. Here we built and ran them.
Focus: difference with vs without a proofreading agent.

Like the harness article, we split conditions, use the same task and rubric, and report results you can reproduce.

Japanese version: 校正エージェントあり/なし（第3回）*

Overview of this part

flowchart TB
  subgraph B["Condition B — copy + proofread"]
    B1["Copy agent"]
    B2["Draft md"]
    B3["Proofread agent<br/>6-item check · fix"]
    B4["Human final review<br/>~2 min"]
    B1 --> B2 --> B3 --> B4
  end

  subgraph A["Condition A — copy only"]
    A1["Copy agent"]
    A2["Draft md"]
    A3["Human scores 6 items<br/>~8 min edits"]
    A1 --> A2 --> A3
  end

  classDef agent fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a1a1a
  classDef human fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#1a1a1a
  class A1 agent
  class A2 agent
  class A3 human
  class B1 agent
  class B2 agent
  class B3 agent
  class B4 human

Experiment goal and setup

Goals

Compare Condition A (copy only) vs Condition B (copy → proofread) on checklist pass rate and human edit time
Show numerically where “rules in spec” alone is not enough

Conditions

Repository: pomarano/x_auto_writing

Item	Content
Copy spec	`x-shuuchaku-agent-spec.md`
Copy prompt	`automation/x-daily/prompt.md`
Proofread spec	`x-proofread-agent-spec.md`
Proofread prompt	`automation/x-proofread/prompt.md`
Environment	Cursor Agent (manual start)
Sample size	5 runs per condition
Scoring	6-item checklist from Part 2 (human grader)
Raw log	`x-proofread-experiment-log.md`

This post summarizes the log. See the log for per-run scores and edit times.

Condition A — copy agent only

2-1. Procedure

Give Agent automation/x-daily/prompt.md
Wait until social/x-drafts/YYYY-MM-DD.md exists
Do not run proofreading
Human scores 6 items and edits for posting (record time)

2-2. What we expected — and saw

The first real file (2026-06-03) had 248 characters — far over the 140-char spec.

（Japanese draft — attachment / clinging theme, ~248 chars）

Teaching + practice content was usable, but not within X's Japanese limit.
Practice step: when irritated, separate facts from mental labels.

#内省テック

Under Condition A, you fix this every run.

The 2026-06-05 draft (A2) repeated the pattern: 217 characters, teaching split across two sentences, one long practice sentence.

Even with “140 chars” and “one teaching + one practice” in spec, without a verification layer, rules slip — the starting point for this experiment.

Condition B — copy + proofreading agent

3-1. Procedure

Same as A — generate copy
Run automation/x-proofread/prompt.md (specify date)
Proofread updates body, appends proofread section, updates char_count
Human final review (record time)

3-2. Escalation rules

Proofread tries at most one fix pass
If theme duplicates last 30 days → do not change topic; set needs_human: true
Do not touch files with status: posted

Results

4-1. Summary

Metric	A (copy only)	B (copy + proofread)
Checklist pass rate	avg 2.4 / 6 (40%)	avg 5.6 / 6 (93%) after proofread
Over limit (first output)	5 / 5	5 / 5 at copy → 0 / 5 after proofread
Structure violations	3 / 5	0 / 5 after proofread
Human edit time	avg 8 min	avg 2 min
Post as-is OK	0 / 5	4 / 5

For B, pass rate is scored after proofread. For A, right after copy.

4-2. Condition A — runs (excerpt)

#	Date	Pass	Main violations
A1	2026-06-03	2/6	248 chars, long paragraphs
A2	2026-06-05	3/6	217 chars, structure
A3	2026-06-10	2/6	156 chars, 2 hashtags
A4	2026-06-12	3/6	148 chars, split practice
A5	2026-06-14	2/6	162 chars, near-banned phrasing

4-3. Condition B — runs (excerpt)

#	Date	Before	After	Main fixes
B1	2026-06-03	2/6	6/6	248→128 chars, two-sentence reshape
B2	2026-06-05	3/6	5/6	217→132 chars
B3	2026-06-10	2/6	6/6	156→125 chars, tag cleanup
B4	2026-06-12	3/6	6/6	practice merged to one sentence
B5	2026-06-14	2/6	4/6	rephrase; duplicate theme → human

After B1 (~128 chars):

「執着」とは、物事に張り付く見方だと言われます。今日はイライラしたとき、事実と頭の評価を分けて書き出し、評価の行だけ眺めてみる。#内省テック

How to read the results

Biggest gains

Item	Finding
Length	Copy-only almost always over limit; proofread consistently ≤ 140
Structure	Reshaping to one teaching + one practice sentence
Human time	8 min → 2 min — shift from rewriting to checking

Where proofread is not enough

Item	Finding
Theme duplication	Needs human or re-run copy
Factual / teaching accuracy	Not a rule violation — human eyes
Cost	Two agent runs — acceptable for personal use, not free

Same lesson as the harness post: put verification on its own layer — works for multi-agent too.

Implementation notes

Piece	Content
Run	Cursor Agent + prompt (manual or Actions)
Storage	Repo `.md` for history and reproducibility
Semi-auto	Human posts; no X API auto-post
Extension	GitHub Actions + email per X semi-auto article

Actions is orthogonal to design: nail two-agent split and verification first; schedule later.

Proofread spec: x-proofread-agent-spec.md; prompt: automation/x-proofread/prompt.md. Same pattern as copy — rules in spec, thin prompt (Part 2).

GitHub Actions (extension)

You can run the copy agent daily via GitHub Actions + Cursor SDK (operations article).
This series is about split roles and proofreading effect. Actions is when to start; whether to proofread is separate.

Stage	Content
Part 3	Manual A/B comparison
Operations	Automate copy; proofread manual or chained
Full pipeline	Copy → proofread → email → human

Running the comparison first makes it easier to judge whether Actions is worth it.

Summary

Condition A: ~40% checklist pass, ~8 min human edits
Condition B: 93% after proofread, ~2 min — big wins on length and structure
Proofreading is not universal — duplication and facts stay with humans
Part 4 wraps the series — Japanese Part 4