Parallel AI Research Pipelines
Three systems for orchestrating parallel AI agents — from JSONL work items to declarative workspaces to phased research pipelines. The patterns that actually work.
I needed protocol documentation for 19 top-level domains — DNS behavior, WHOIS formats, RDAP endpoints, registration rules, rate limits, raw captures. Each TLD is its own research unit with its own servers, formats, and quirks. Doing them sequentially would take days.
So I wrote a prompt that launched 19 parallel subagents, each researching one TLD in its own isolated directory, then ran a review pass to find gaps, then launched a second research wave, then a documentation pass. The whole thing ran in one session.
This article is about the pattern that emerged — not the TLD research itself, but the structure for running parallel AI research at scale.
The Problem with Naive Parallel Agents
The obvious approach: “research these 19 things in parallel.” Give each agent a topic and let it go. This fails in predictable ways:
- Agents overwrite each other. Two agents writing to the same summary file. Merge conflicts in shared state. Lost work.
- No consistency. Agent 1 captures WHOIS response time. Agent 7 doesn’t. Agent 12 uses a different JSON schema. You can’t compare findings across units.
- No refinement. First-pass research always has gaps. Without a review step, gaps stay gaps.
- No machine-readable output. Agents default to markdown prose. Prose is hard to aggregate, diff, or feed into code.
The Three-Phase Pattern
The structure that works:
Phase 1: Explore (parallel) → raw findings per unitPhase 2: Review & Refine → cross-unit analysis → v2 template → second passPhase 3: Document (parallel) → uniform deliverablesEach phase has different parallelism characteristics. Phase 1 and 3 are embarrassingly parallel (one agent per unit, no coordination). Phase 2 is sequential — a single review agent reads everything and produces the refined template.
The folder structure
research_root/├── 1_explore/{unit_a, unit_b, ...}/ # Phase 1 workspaces├── 2_research/{unit_a, unit_b, ...}/ # Phase 2 workspaces├── 3_writing/{unit_a, unit_b, ...}/ # Phase 3 workspaces├── {unit}_documentation/ # Final deliverables├── prompts/ # Templates (v1, v2)├── templates/ # Schemas, response formats├── summaries/ # Cross-unit analysis├── analysis/ # Review outputs└── tools/ # Shared scripts, configsThe key insight: each phase gets its own directory tree. Phase 2 agents don’t touch Phase 1 directories. This makes the workspace append-only at the directory level — you can always go back and see exactly what each agent produced at each stage.
Isolation: Three Approaches
The single most important rule across all three systems: agents must not interfere with each other. There are different ways to enforce this:
Directory isolation (research pipeline) — each agent writes only in its assigned directory:
Agent for unit "net" in Phase 1: CAN write: 1_explore/net/* CAN read: tools/*, prompts/*, templates/* CANNOT: 1_explore/org/*, 2_research/*, anything elseGit worktree isolation (work system) — each agent gets a separate copy of the repository on disk:
# Each task runs in its own worktreeclaude --worktree work-W001 "Fix the port mismatch..."# Creates branch worktree-work-W001, separate working directory# Other agents on other worktrees can't see uncommitted changesPane isolation (workspace manager) — each agent runs in its own terminal pane, sharing the repo but partitioned by prompt:
# workspace manager: declarative layout, agents share the repo but work on different dirspanes: - name: agent-01 closing: "Work ONLY in src/parser/. Commit when done." - name: agent-02 closing: "Work ONLY in src/extraction/. Commit when done."Directory isolation is simplest — no git machinery needed. Worktrees are strongest — agents literally can’t see each other’s uncommitted work. Pane isolation is fastest to set up — just a YAML file — but relies on the agent obeying its prompt.
For research, directory isolation is sufficient. For code changes, worktrees are safer.
Machine-Readable First
The second critical rule: JSON is authoritative, markdown is derived.
Each agent produces two outputs per phase:
findings.json— structured data with a defined schema, every field sourcednotes.md— human-readable summary, explicitly non-authoritative
Why not just markdown? Because the review agent needs to aggregate across all units. Reading 19 markdown files and extracting comparable data is fragile. Reading 19 JSON files with the same schema is trivial.
{ "unit": "net", "registry_operator": "Verisign", "lookup_server": "whois.example-registry.com", "whois_available_pattern": "No match for \"DOMAIN.NET\".", "rdap_base": "https://rdap.verisign.com/net/v1", "rdap_available_status": 404, "min_label_length": 3, "rate_limiting": {"whois": "undocumented", "rdap": "429 + Retry-After"}, "sources": ["https://www.verisign.com/...", "live probe 2026-03-28"]}Every field has a sources array. If the review agent questions a finding, it can trace back to the original source. No “trust me, I researched it.”
The Two-Pass Refinement
This is what makes the pattern actually work, not just “parallel agents doing things.”
Phase 1 uses a generic template. Agents do their best, but they don’t know what they don’t know. Some agents capture edge cases others miss. Some discover dimensions the template didn’t anticipate.
The review step reads all Phase 1 outputs and produces:
- A global findings file (unified spec across all units)
- A taxonomy of categories discovered (not just the ones you predicted)
- A gap analysis (what each unit is missing)
- A v2 template incorporating everything Phase 1 revealed
Phase 2 uses the v2 template. Now every agent knows to look for the edge cases that only some agents found in Phase 1. The quality floor rises dramatically.
What the review agent actually produces
For the TLD research, the review agent read 19 findings.json files and produced:
- Implementation tiers — grouping TLDs by complexity (trivial: .net is identical to .com; custom: .uk needs a unique parser; special: .ch blocks WHOIS entirely)
- Parser families — TLDs sharing the same backend/format (Identity Digital runs .org, .io, .ai with identical WHOIS patterns)
- Gap analysis — “.fr agent didn’t capture rate limit behavior” / “.se agent missed zone file AXFR access”
- v2 template — now includes: “check for AXFR zone file access” (only discovered by the .se agent), “capture WHOIS connection terminator behavior” (only .de closes connection instead of using
<<<)
Live Captures as Ground Truth
Agents shouldn’t just search the web. They should probe live systems and capture raw responses.
# Shared probe tool available to all agents (read-only)# probe.py --target net --type whois --domain google.netRaw captures serve two purposes:
- Truth. Web search results can be outdated. RFC text can be ambiguous. A raw WHOIS response is unambiguous.
- Parser guidance. When you later implement a parser, the raw captures are your test fixtures. You don’t need to re-query live servers.
Captures are immutable — written once, never edited. If a second probe gives different results, you capture both. Contradictions are data.
From Pattern to Tool
A template describes a pattern. A task file makes it executable. A CLI makes it repeatable. Each layer reduces how much the operator needs to get right.
Layer 1: Natural language prompt (one-shot)
The TLD research started as a single message:
“Research all remaining TLDs… use one subagent per TLD, give each its own directory… ensure agents never overwrite each other’s work.”
This works once. It’s not repeatable — the next researcher writes a different prompt, gets different structure, produces incomparable output.
Layer 2: Template with variables (repeatable)
Extract the pattern into a template with {{VARIABLES}}:
Phase 1: Explore (parallel — one agent per {{UNIT}}) - Each agent works ONLY in 1_explore/{{UNIT_ID}}/ - Live probe: {{PROBE_TARGETS}} via proxied connections - Persist: findings.json + notes.mdNow anyone can fill in the variables and get the same structure. But it’s still manual — you read the template, fill it in mentally, write the prompt.
Layer 3: Task file (executable)
Make the filled-in template machine-readable — a JSONL record per unit:
{ "id": "C01", "slug": "bot-detection-2026", "title": "How Websites Detect Bots in 2026", "kind": "article", "status": "planned", "source_map": "analysis/01-bot-detection-2026.md", "sources": ["docs/research/03-anti-bot-landscape-2026.md", "..."], "parallel": true}This is the same pattern as a tasks.jsonl in any work system — each line is one unit of work with enough context to build a prompt and launch an agent.
Layer 4: CLI tool (trackable)
A script reads the task file, builds the prompt, and launches the agent:
# Development tasks (sequential work system)work run T001 # read JSONL → build prompt → claude --worktree
# Parallel agents (declarative workspace manager)workspace start team.yml # read YAML → split panes → launch agents
# Content pipeline (same pattern)scripts/content draft C01 # read JSONL → read source map → claudeAll three do the same thing: read structured task data, assemble a prompt with the right context, launch Claude. The data model and orchestration differ, but the core loop is identical.
The three-layer prompt sandwich
workspace manager introduces a useful pattern for prompt assembly — the three-layer sandwich:
Layer 1: Universal rules (TESTING-RULES.md — same across all agents)Layer 2: Task-specific prompt (01-parser-accuracy.md — unique per agent)Layer 3: Closing block (verification + tracking + commit sequence)Layer 1 and 3 stay constant. Layer 2 is the variable. This ensures every agent follows the same verification and state-update protocol, regardless of what task it’s working on.
The research pipeline has the same structure implicitly: the template is Layer 1 + 3, the unit-specific assignment is Layer 2. Making it explicit (like workspace manager does) is cleaner.
What This Produced
For the TLD research specifically:
| Metric | Value |
|---|---|
| TLDs researched | 19 |
| Phases | 3 (explore, research, documentation) |
| Total agents launched | ~60 (19 per phase + review agents) |
| Raw captures | WHOIS + DNS + RDAP per TLD, both registered and available |
| Final output | 19 implementation guides with raw captures |
| Implementation tiers identified | 5 (trivial → special) |
| Parser families identified | 14 |
The deliverables were dense enough that implementing a new TLD in the scanner required reading one README and copying one set of raw captures as test fixtures. No additional research needed.
Three Systems, One Pattern
I’ve now built three systems that all solve the same problem — coordinating parallel AI agents with shared state — in different domains:
| Aspect | Work System | workspace manager | Research Pipeline |
|---|---|---|---|
| Domain | Development tasks | Any parallel agents | Research/writing |
| Task data | JSONL (tasks.jsonl) | YAML (workspace config) | JSONL (items.jsonl) |
| Isolation | Git worktrees | Terminal panes + prompt rules | Directory per unit |
| Launch | run T001 | workspace manager start config.yml | Subagent per unit |
| Parallelism | run-all --max 3 | All panes start at once | Per-phase parallel |
| Review | review T001 (diff + build + test) | RUNBOOK totals + workspace manager read | Review agent reads all findings |
| State tracking | JSONL (append-only) | JSONL + RUNBOOK.md | JSON (findings per unit) |
| Prompt assembly | Script builds from item fields | 3-layer sandwich (YAML) | Template + source map |
The shared principles:
- JSONL for everything. Append-only, git-trackable, human-readable, no database server. Every system uses it for state.
- Isolation by default. Whether worktrees, directories, or prompt boundaries — agents don’t share mutable state.
- Structured launch. Read task data → build prompt → launch agent. Never hand-write the prompt.
- Review as verification. Automated checks (build, test, schema validation) before human review. Persist the verdict.
- The ratchet. Each agent reads current state, does work, updates state. Progress only moves forward.
When to Use Which
Work system (a task runner script) — when tasks are code changes that need build/test verification. Each task gets a worktree, a prompt, and an auto-review. Best for: bug fixes, refactors, feature additions.
workspace manager — when you want N agents working simultaneously with visual monitoring. Declarative YAML, all agents start at once, workspace manager read to check progress. Best for: parallel reviews, round-based enrichment, any task where you want to watch agents work.
Research pipeline — when you’re researching N items across the same dimensions and need two-pass refinement. Directory isolation, phased execution, machine-readable findings. Best for: protocol documentation, competitive analysis, API surveys.
All three are overkill for single tasks. Use a plain prompt for that.
The Original Prompt
For reference, here’s the prompt that kicked off the TLD research. One message, natural language, no template:
Please websearch for all remaining TLDs — same info as we have for .com and .de: basic infos and special stuff, allowed characters and domain rules, price, how to get / availability of domain lists, ways for domain check — DNS, DNS auth, WHOIS, other niche special options — and for all of those the full possible metadata it could provide. Then run for real (use proxies) and capture and store full raw responses as truth and for potential parser/implementation guidance. Use one subagent per TLD and give him his own dir where he can download, code, write etc (persist findings in machine-readable way with sources). Then run review over everything creating a global specs/findings file (with all niches and categories etc). Use that to create v2 template/research task. Then launch second pass of agents (one per TLD, same procedure). Then again review and create a compressed, information-dense documentation for each TLD with everything needed (including raw/real captures in a uniform clean format). Ensure agents never overwrite each other’s work / step on each other’s toes.
The template is the reusable pattern extracted from this. The task file is the machine-readable instance. The CLI is the executor. Each layer makes the pattern more reproducible and less dependent on the operator getting the prompt right.