Skip to content

ADR-0022: Context Compression via Reversible Elision

  • Status: Accepted. The decision (deterministic, opt-in, reversible compression of model-bound content, plus two output-side levers) is implemented across both dispatch paths: the off-host agent loop and the native claude -p proxy. It is off by default and changes nothing until enabled.
  • Date: 2026-06-23
  • Supersedes: None
  • Related: ADR-0001 (No-Daemon), ADR-0002 (Zero Runtime Deps), ADR-0021 (Model-Agnostic Dispatch — the off-host loop this hooks into)

Context and Problem Statement

Once nubos-pilot started running agent loops itself (ADR-0021), the token bill became its own problem rather than the host's. In an agentic loop the message history is re-sent on every turn, so a tool result read once is paid for once per remaining turn. A single 40 KB grep result or a 200-line test log can dominate the cost of a long loop, and most of those bytes are predictable filler: clean log lines, unchanged diff context, the homogeneous middle of a JSON array.

The same waste shows up outside the loop. Headless spawns (bin/np-tools/spawn-headless.cjs) inline large fenced blocks into the prompt, the Researcher-Schwarm re-inlines an identical input k times, and verify output handed to a critic is often far larger than the budget the critic needs.

We wanted to cut that cost without taking on either of the usual risks:

  • Tail-truncation is cheap but lossy in the worst way — it drops the one error line that explains the failure.
  • Model-based summarisation adds latency, cost and a second place for the model to hallucinate, which defeats the point of saving tokens.

The native claude -p path adds a wrinkle: its loop runs inside the Claude CLI, so nubos-pilot never sees the message array and cannot compress in-loop the way it can off-host.

Decision Drivers

  • The drop has to be recoverable. Whatever we remove, an agent must be able to get back. A lossy reduction is only acceptable if the worst case is an uncompressed answer, never a wrong one.
  • No model in the compression path. Compression that calls a model to save model tokens is circular. The reducers must be deterministic rules.
  • [ADR-0002] Zero runtime dependencies. No compression library, no tokenizer package. Built-ins only.
  • [ADR-0001] No daemon — mostly. The native-path proxy is the one long-lived process, and it is scoped to the lifetime of the spawn it serves, not owned indefinitely by nubos-pilot.
  • The default must not regress. With the feature off, prompts and tool results pass through byte-for-byte and verify logs fall back to the old tail truncation. Turning it on is opt-in.
  • The prompt cache must survive. Any rewrite that sits in the cached prefix has to stay byte-stable, or we trade a token saving for a cache miss that costs more than it saved.

Considered Options

  • Tail-truncate oversized blocks. Rejected as the general answer. It is what the feature falls back to when off, but on its own it silently discards the lines that matter most (errors, stack frames).
  • Summarise blocks with a cheap model. Rejected. Adds latency and cost, and puts a hallucination surface in the middle of the cost-saving mechanism.
  • Compress the transport only (gzip the HTTP body). Rejected. It saves bandwidth, not tokens — the model still sees and is billed for the full decompressed content.
  • Deterministic structural reducers with a reversible on-disk store. Chosen.

Decision Outcome

Chosen: rule-based structural reducers in lib/compress.cjs that keep the signal lines and drop the predictable bulk, backed by an on-disk elision store (lib/elision.cjs) that makes every drop reversible. Where the bulk used to be, the content carries a ⟦elided:<hash>⟧ marker; the original is cached at .nubos-pilot/elision/<hash>.json and recoverable by hash. Everything sits behind the compression.enabled master switch and is off by default.

The reducers

lib/compress.cjs dispatches on the shape of the block and keeps error, failure and outlier lines first. Blocks smaller than min_block_bytes (default 2048) are left untouched, since the marker would cost more than it saves. The shapes it handles: JSON arrays, build/test logs, grep output, unified diffs, C-family source, brace-light source such as Python, and prose. Brace counting masks string literals and comments before it counts, so braces inside them do not throw off structure detection.

The elision store

Each dropped original is written to .nubos-pilot/elision/<hash>.json, named by a 12-char SHA-256 prefix, validated against the elision-entry.v1 schema, and written atomically under a file lock. Entries carry a TTL (compression.elision.ttl_ms, default 30 min); after it lapses, retrieve reports expired and prune clears the entry lazily. Two ways back to the original: the elision-get CLI, and the in-loop context-expand tool the off-host agent gets automatically while compression is on. A context-expand result is never re-compressed, so expansion terminates.

Two dispatch paths

  • Off-host (in-loop). openai-compat agents run their loop inside nubos-pilot (lib/runtime/agent-loop.cjs), so each tool result is crushed before it enters the re-sent history. This is where the saving compounds, and it is the strongest lever.
  • Native claude -p (proxy). Because that loop is invisible to nubos-pilot, compression.proxy.enabled forks a localhost proxy (lib/elision-proxy.cjs), points the child's ANTHROPIC_BASE_URL at it, and crushes oversized tool_result blocks in each /v1/messages request on the way upstream. It touches only tool_result blocks; system, tool definitions and ordinary text stay byte-identical, and because the rewrite is deterministic the compressed prefix stays cache-stable.

Output-side levers

Two further levers cut the tokens the model writes back, under compression.output_steering.*. Verbosity steering appends a fixed terseness directive after the cached prefix (inside an idempotent <nubos_output_shaping> block, so re-enrichment never stacks and the cache is unaffected). Effort routing downgrades reasoning effort on mechanical continuation turns in the off-host loop, and only sends the reasoning_effort field when a concrete base_effort is pinned, so providers without effort support are untouched.

Cache alignment (proxy path only)

lib/cache-align.cjs keeps the native-spawn prompt cacheable. It detects cache-volatile tokens (UUID, ISO datetime, JWT, long hex hash) in the system prompt and warns, and optionally normalises the tool-definition prefix to a stable byte layout with one cache_control breakpoint. This is the one lever that deliberately rewrites a prefix instead of preserving it byte-for-byte, which is why it has its own opt-in toggle and yields to any cache_control the caller already placed.

Compliance with standing invariants

  • ADR-0002 (Zero Runtime Deps): the reducers, the store and the proxy use Node built-ins only (zlib is not used; compression here is structural, not byte-level). No package added.
  • ADR-0001 (No-Daemon): the off-host path adds no process. The proxy is the single exception, and it lives only as long as the native spawn it serves.

Consequences

  • Good, because the token cost of tool-heavy loops drops substantially, with the biggest effect off-host where the saving compounds across every turn.
  • Good, because the drop is reversible: an agent that needs the elided detail pulls it back by hash, so the failure mode is an uncompressed answer, not a wrong one.
  • Good, because the default is untouched — with the master switch off, behaviour is byte-for-byte as before.
  • Good, because there is no model and no dependency in the compression path, so it cannot add latency or a hallucination surface, and it does not widen the supply chain.
  • Good, because the work is measurable: elision-bench reports compression ratio, critical-line preservation and byte-exact reversibility over a fixture corpus, and already caught two reducer bugs in lib/compress.cjs that per-block assertions alone would have missed.
  • Bad, because the proxy is a long-lived process for the duration of a native spawn, which is a narrow exception to the no-daemon invariant and a new moving part to get right (port binding, upstream forwarding, clean teardown).
  • Bad, because the reducers are heuristics tuned to block shapes; a block the dispatcher misclassifies is compressed by the wrong rule. The elision store keeps this safe rather than correct — the agent can always recover, but it may spend a context-expand round-trip doing so.
  • Bad, because the cache-alignment normaliser rewrites the tool prefix, so a future change to how a host lays out tools could desync it from the cached layout; it is opt-in and yields to caller-placed cache_control precisely to bound that risk.
  • Bad, because the config surface grows (compression.* with proxy, output-steering and cache-align sub-trees), adding keys that must be validated and documented.

More Information

  • Reducers + store: lib/compress.cjs, lib/elision.cjs, schema lib/schemas/data/elision-entry.v1.json. CLI: bin/np-tools/elision-get.cjs.
  • Off-host hook: lib/runtime/agent-loop.cjs (per-tool-result compression), lib/runtime/tools/index.cjs (context-expand tool, EXPAND_TOOL_NAME).
  • Native proxy: lib/elision-proxy.cjs, bin/np-tools/_elision-proxy-entry.cjs, alignment in lib/cache-align.cjs.
  • Output side: lib/output-steering.cjs, cost estimation in lib/token-cost.cjs.
  • Measurement: lib/elision-bench.cjs, CLI bin/np-tools/elision-bench.cjs.
  • Config: compression.* in lib/config-defaults.cjs + lib/config-schema.cjs; documented in Configuration § compression.*.
  • Concept page: Context Compression & Elision.
  • Related ADRs:
    • ADR-0021: the off-host agent loop this compresses in place.
    • ADR-0002: built-ins only, no compression package.
    • ADR-0001: one-shot off-host loop; the proxy is the scoped exception.