Building and Comparing Agents — With vs Without Proofreading (AI Agent Study Notes Part 3)

Lead

Hi, I’m Pomarano.

This is Part 3 of my AI Agent Study Notes.

Series index Building Your Own AI Agents · Part 1 · Part 2
In Part 2 we designed copy + proofread. Here we built and ran them.
Focus: difference with vs without a proofreading agent.

Like the harness article, we split conditions, use the same task and rubric, and report results you can reproduce.


Overview of this part

flowchart TB
  subgraph B["Condition B — copy + proofread"]
    B1["Copy agent"]
    B2["Draft md"]
    B3["Proofread agent<br/>6-item check · fix"]
    B4["Human final review<br/>~2 min"]
    B1 --> B2 --> B3 --> B4
  end

  subgraph A["Condition A — copy only"]
    A1["Copy agent"]
    A2["Draft md"]
    A3["Human scores 6 items<br/>~8 min edits"]
    A1 --> A2 --> A3
  end

  classDef agent fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1a1a1a
  classDef human fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#1a1a1a
  class A1 agent
  class A2 agent
  class A3 human
  class B1 agent
  class B2 agent
  class B3 agent
  class B4 human

Experiment goal and setup

Goals

  • Compare Condition A (copy only) vs Condition B (copy → proofread) on checklist pass rate and human edit time
  • Show numerically where “rules in spec” alone is not enough

Conditions

Repository: pomarano/x_auto_writing

ItemContent
Copy specx-shuuchaku-agent-spec.md
Copy promptautomation/x-daily/prompt.md
Proofread specx-proofread-agent-spec.md
Proofread promptautomation/x-proofread/prompt.md
EnvironmentCursor Agent (manual start)
Sample size5 runs per condition
Scoring6-item checklist from Part 2 (human grader)
Raw logx-proofread-experiment-log.md

This post summarizes the log. See the log for per-run scores and edit times.


Condition A — copy agent only

2-1. Procedure

  1. Give Agent automation/x-daily/prompt.md
  2. Wait until social/x-drafts/YYYY-MM-DD.md exists
  3. Do not run proofreading
  4. Human scores 6 items and edits for posting (record time)

2-2. What we expected — and saw

The first real file (2026-06-03) had 248 characters — far over the 140-char spec.

(Japanese draft — attachment / clinging theme, ~248 chars)

Teaching + practice content was usable, but not within X's Japanese limit.
Practice step: when irritated, separate facts from mental labels.

#内省テック

Under Condition A, you fix this every run.

The 2026-06-05 draft (A2) repeated the pattern: 217 characters, teaching split across two sentences, one long practice sentence.

Even with “140 chars” and “one teaching + one practice” in spec, without a verification layer, rules slip — the starting point for this experiment.


Condition B — copy + proofreading agent

3-1. Procedure

  1. Same as A — generate copy
  2. Run automation/x-proofread/prompt.md (specify date)
  3. Proofread updates body, appends proofread section, updates char_count
  4. Human final review (record time)

3-2. Escalation rules

  • Proofread tries at most one fix pass
  • If theme duplicates last 30 days → do not change topic; set needs_human: true
  • Do not touch files with status: posted

Results

4-1. Summary

MetricA (copy only)B (copy + proofread)
Checklist pass rateavg 2.4 / 6 (40%)avg 5.6 / 6 (93%) after proofread
Over limit (first output)5 / 55 / 5 at copy → 0 / 5 after proofread
Structure violations3 / 50 / 5 after proofread
Human edit timeavg 8 minavg 2 min
Post as-is OK0 / 54 / 5

For B, pass rate is scored after proofread. For A, right after copy.

4-2. Condition A — runs (excerpt)

#DatePassMain violations
A12026-06-032/6248 chars, long paragraphs
A22026-06-053/6217 chars, structure
A32026-06-102/6156 chars, 2 hashtags
A42026-06-123/6148 chars, split practice
A52026-06-142/6162 chars, near-banned phrasing

4-3. Condition B — runs (excerpt)

#DateBeforeAfterMain fixes
B12026-06-032/66/6248→128 chars, two-sentence reshape
B22026-06-053/65/6217→132 chars
B32026-06-102/66/6156→125 chars, tag cleanup
B42026-06-123/66/6practice merged to one sentence
B52026-06-142/64/6rephrase; duplicate theme → human

After B1 (~128 chars):

「執着」とは、物事に張り付く見方だと言われます。今日はイライラしたとき、事実と頭の評価を分けて書き出し、評価の行だけ眺めてみる。#内省テック

How to read the results

Biggest gains

ItemFinding
LengthCopy-only almost always over limit; proofread consistently ≤ 140
StructureReshaping to one teaching + one practice sentence
Human time8 min → 2 min — shift from rewriting to checking

Where proofread is not enough

ItemFinding
Theme duplicationNeeds human or re-run copy
Factual / teaching accuracyNot a rule violation — human eyes
CostTwo agent runs — acceptable for personal use, not free

Same lesson as the harness post: put verification on its own layer — works for multi-agent too.


Implementation notes

PieceContent
RunCursor Agent + prompt (manual or Actions)
StorageRepo .md for history and reproducibility
Semi-autoHuman posts; no X API auto-post
ExtensionGitHub Actions + email per X semi-auto article

Actions is orthogonal to design: nail two-agent split and verification first; schedule later.

Proofread spec: x-proofread-agent-spec.md; prompt: automation/x-proofread/prompt.md. Same pattern as copy — rules in spec, thin prompt (Part 2).


GitHub Actions (extension)

You can run the copy agent daily via GitHub Actions + Cursor SDK (operations article).
This series is about split roles and proofreading effect. Actions is when to start; whether to proofread is separate.

StageContent
Part 3Manual A/B comparison
OperationsAutomate copy; proofread manual or chained
Full pipelineCopy → proofread → email → human

Running the comparison first makes it easier to judge whether Actions is worth it.


Summary

  • Condition A: ~40% checklist pass, ~8 min human edits
  • Condition B: 93% after proofread, ~2 min — big wins on length and structure
  • Proofreading is not universal — duplication and facts stay with humans
  • Part 4 wraps the series — Japanese Part 4

コメント