ADR-0015: Named-Agent-Messaging for Inter-Agent Loop Coordination

Status: Accepted
Date: 2026-05-08
Supersedes: None
Relates-to: ADR-0001, ADR-0002, ADR-0005, ADR-0010

Context and Problem Statement

The Nubosloop (ADR-0010) coordinates per-task spawns through file artefacts: nubosloop.checkpoint.json, the per-task tool_use_audit log (Layer C), the Critic's <report_path> (L5), and the orchestrator's stuck-state file. Within a round, the routing engine reads these artefacts and dispatches the next spawn.

Two coordination patterns are not addressed by these artefacts:

Per-finding addressed handback between agents. When the Critic emits N findings, all of category style → executor are merged into a single Build-Fixer prompt. The Executor cannot acknowledge per-finding ("understood; fixed; ready for re-check"); the Critic re-runs against the next round's diff and re-evaluates from scratch. Findings the Build-Fixer cannot resolve (genuinely ambiguous remediation) cannot be deferred or re-routed without falling out to askuser.
Critic ↔ Executor dialogue inside a round. Today the round structure is fixed: Executor → Critic → Routing → either commit or re-loop. There is no mechanism for the Critic to ask the Executor a clarifying question ("did you intend to delete FoobarService, or was that a side-effect?") without falling out to askuser.

A persistent, addressed, file-based message channel inside .nubos-pilot/messages/ lets agents within a round (and within a task across rounds) carry on a structured dialogue. Each message carries from, to, kind, subject, body, and an optional in_reply_to thread-id. The mechanism is not a daemon and not a network bus. It is append-only filesystem state, identical in spirit to lib/handoff.cjs and the per-task audit log.

The Rule

Agents in np:execute-phase may exchange addressed messages via lib/messaging.cjs and the np:messages-* subcommands. Messages persist under .nubos-pilot/messages/ as JSON files. The Nubosloop loop-termination condition is extended: a round may not commit while any expects_reply: true message in the current task is unarchived.

Decision Drivers

Adressierte Re-Check-Loops: the Critic-to-Executor handback today is per-round, not per-finding. Per-finding addressing reduces re-evaluation cost and surfaces genuinely-stuck findings earlier (a finding that bounces back and forth ≥ 2 rounds is a stuck signal, not an executor-quality signal).
Audit-trail richness: the message log is a Layer-C-adjacent artefact: same threat model (synthetic evidence), same defence (orchestrator-stamped, filesystem-witnessed). Critics auditing prior phases see a thread, not a state-snapshot.
No daemon, no bus: every message is a file. Reads are readdir + readFile. No coordination process. (ADR-0001.)
Zero runtime deps: pure Node fs + crypto. No new package.json entry. (ADR-0002.)
Three-tree orthogonality: .nubos-pilot/messages/ is a strict Project-State sub-tree. (ADR-0005.)

Considered Options

A: No messaging, status quo. Reject: per-finding handback impossible; round-level granularity is the only available unit.
B: In-process message bus (e.g. EventEmitter inside the orchestrator). Reject: violates ADR-0001 (in-session daemon-shaped state) and is invisible to Layer-C audit.
C: Network message broker (Redis / NATS / etc.). Reject: violates ADR-0001 hard, requires service install on consumer machine.
D: File-based addressed messaging in .nubos-pilot/messages/, append-only manifest, archive-on-ack. Chosen.

Decision Outcome

Chosen: Option D, file-based addressed messaging, because it preserves the no-daemon stance, ships zero deps, integrates with the existing audit-trail philosophy of ADR-0010, and gives the routing engine a concrete loop-termination predicate (no unarchived expects_reply messages).

Layout

.nubos-pilot/messages/                        # Project-State sub-tree
  inbox/<agent-name>/<msg-id>.json            # ungelesen, addressed at <agent-name>
  archive/<msg-id>.json                       # processed (acked / replied)
  archive/by-task/<task-id>/<msg-id>.json     # post-task historical archive
  manifest.jsonl                              # append-only audit log, all events

<msg-id> is a UUIDv4 plus a millisecond timestamp prefix for sort-stability: <unix-ms>-<uuid>. Filenames sort chronologically, and no two messages share an id even on the same millisecond.

Message schema

json

{
  "id": "1730000000123-9b3e...",
  "from": "np-critic",
  "to": "np-executor",
  "phase": "M005-S007-T0002",
  "round": 2,
  "kind": "request|response|notify",
  "subject": "filament-resource-policy-missing",
  "body": "...",
  "expects_reply": true,
  "in_reply_to": "1729999999999-...|null",
  "created_at": "2026-05-08T..."
}

kind semantics:

request: expects_reply: true; receiver must archive with a reply or escalate.
response: expects_reply: false; carries in_reply_to pointing at the request it answers; archives the request as a side-effect.
notify: expects_reply: false; no reply required; archived by receiver after read.

subject is a kebab-case finding-category or topic id, matching the Critic finding-category taxonomy where applicable (style, dead-code, missing-test, weak-assertion, unmet-criterion, scope-creep, …) so the routing engine can index by it.

Library surface

lib/messaging.cjs:

async function send(opts: { from, to, kind, subject, body, expects_reply, in_reply_to? }): Promise<MessageId>
async function inbox(agent: string, opts?: { kind?, since? }): Promise<Message[]>
async function archive(msg_id: string): Promise<void>
async function thread(msg_id: string): Promise<Message[]>
async function pendingReplies(task_id: string): Promise<Message[]>
async function sweepTaskOnCommit(task_id: string): Promise<number>

Each send() writes to inbox/<to>/<id>.json and appends a sent event to manifest.jsonl. Each archive() moves inbox/<...>/<id>.json to archive/<id>.json and appends an archived event. All file mutations go through atomicWriteFileSync / rename; no partial-write window.

Subcommands

np:messages-send: emit a message; prints the new id.
np:messages-inbox: list ungelesen messages addressed to an agent.
np:messages-archive: mark processed; refuses with messages-archive-without-reply when expects_reply: true AND no reply was sent.
np:messages-thread: print the reply-chain in causal order.

Nubosloop integration

Critic emits findings → optional addressed messages. When the single-Critic spawn (ADR-0010 §Single-Critic Revision) wants to address a specific Executor question or per-finding clarification, it calls messages-send --to np-executor --kind request --subject <finding-category> --expects-reply from inside the spawn. The aggregate findings JSON still travels via <report_path> (L5); messages carry the dialogue layer, not the findings layer.
Executor / Build-Fixer reads inbox before writing. Round-2+ Build-Fixer's prompt includes messages-inbox --agent np-executor in the read-list. Replies are sent via messages-send --kind response --in-reply-to <id>.
Loop-termination predicate.bin/np-tools/loop-run-round.cjs::_runCommit Layer-B precondition is extended: in addition to verify_exit_code=0 and findings: [], pendingReplies(task_id).length === 0 MUST hold. A pending reply means a Critic question is unanswered; commit is blocked, the round routes back to Executor with the open inbox surfaced via details.pending_subjects.
Phase-completion sweep. After a successful commit, _runCommit calls sweepTaskOnCommit(task_id) which moves every message with the matching phase from inbox/ and archive/ into archive/by-task/<task_id>/ and emits a task-swept event in manifest.jsonl. Future tasks see clean inboxes; the per-task audit-trail stays accessible.
Stuck escalation. If pendingReplies(task_id) does not shrink across two consecutive rounds (the Executor's reply doesn't satisfy the Critic and a new request bounces back), the routing engine treats this as a messaging-stalemate finding and triggers the standard askuser four-options dialog (ADR-0010 §Failure Mode). (Routing-engine entry tracked separately; the surface is in place.)

Layer-C compatibility

Messaging events do not replace the Layer-C audit; they augment it. A Critic that emits messages-send --kind request from inside its spawn still has its loop-audit-tool-use --agent np-critic stamp applied by the orchestrator after spawn-return. The message itself is an additional artefact, not a substitute for the audit entry.

A hostile orchestrator that wants to fake messaging can write inbox files directly (filesystem is unguarded). This is the same threat model as Layer-C today (the audit log is appended by the orchestrator, not by the runtime) and the same mitigation applies: future Stufe-2 runtime instrumentation will stamp message provenance the orchestrator cannot forge.

Cleanup and lifecycle

Per-task scope. Messages are scoped to the task that produced them (phase field). At task-commit time, all messages with that phase move from inbox/ and archive/ into archive/by-task/<phase>/ for historical lookup but not active routing.
Manifest is append-only. manifest.jsonl is never truncated; it is the audit-trail.
TTL: none in v1. If inbox/ accumulates due to abandoned tasks (operator-killed mid-task), np:doctor surfaces messaging-orphan-inbox with --fix semantics that move orphans to archive/orphan/. (Doctor finding tracked separately.)

Messaging is never committed

Per the same User-Vorgabe as ADR-0014: .nubos-pilot/messages/ is runtime-state, not source-of-truth. The directory is added to the consumer-project's .gitignore. The audit-trail value lives in the directory during a project's life; replay from manifest.jsonl is possible but not load-bearing.

Consequences

Good, because per-finding Critic ↔ Executor dialogue is now expressible without falling out to askuser.
Good, because pendingReplies is a deterministic loop-termination predicate, not a heuristic.
Good, because the message log is a Layer-C-adjacent audit artefact: Critics in later phases can replay the dialogue for stuck-detection and recurring-pattern analysis.
Good, because zero new runtime deps: pure fs + crypto.randomUUID.
Bad, because the messaging surface is one more thing to teach agents; agent prompts grow. Mitigation: optional surface, so agents that don't send continue to work as today.
Bad, because filesystem-message-races on parallel-spawn topologies (Researcher-Schwarm k=3, future parallel critics) require care. Mitigation: filenames are timestamp-prefixed UUIDs; readers sort by filename.
Bad, because a wave of orphan inbox messages from killed tasks is possible. Mitigation: np:doctor covers it; archive/by-task/ retention is bounded by task-archive policy.

More information

Library: lib/messaging.cjs; bin/np-tools/loop-run-round.cjs::_runCommit (extended Layer-B precondition + sweep hook).
Subcommand: bin/np-tools/messages-{send,inbox,archive,thread}.cjs.
Concept: Named-Agent-Messaging.
Agents: agents/np-critic.md (Inter-Agent Messaging section), agents/np-executor.md (Round 2+ inbox-read), agents/np-build-fixer.md (Step 0 inbox-read).

This ADR specifies the in-task addressed-messaging surface. Inter-task and inter-phase messaging (operator-to-agent notes, mid-phase escalation tickets) are out-of-scope and remain handled by lib/handoff.cjs and the existing askuser path.

ADR-0015: Named-Agent-Messaging for Inter-Agent Loop Coordination ​

Context and Problem Statement ​

The Rule ​

Decision Drivers ​

Considered Options ​

Decision Outcome ​

Layout ​

Message schema ​

Library surface ​

Subcommands ​

Nubosloop integration ​

Layer-C compatibility ​

Cleanup and lifecycle ​

Messaging is never committed ​

Consequences ​

More information ​