Context Compression & Elision

Opt-in, reversible compression of the content nubos-pilot puts in front of a model: the tool results an agent re-reads every turn, the large fenced blocks in a spawn prompt, the verify logs a critic reads. There are also two levers on the output side that cut the tokens the model writes back. None of it calls a model or adds a dependency. Every reducer is a plain rule in lib/compress.cjs, and every block it drops is cached on disk first, so the marker always resolves back to the original. The worst a lossy drop can do is leave an answer uncompressed; it can't make the answer wrong.

The full config surface lives in Configuration § compression.*. This page is the why and the how-it-fits-together.

Why

In an agentic loop the message history is re-sent every turn. A 40 KB grep result or a 200-line test log gets read once but billed once per remaining turn, so the cost compounds fast. The two obvious fixes both fail. Dumb tail-truncation throws away the one error line that mattered. Asking a model to summarise the block adds latency, cost, and a fresh place for the model to be wrong, which is the opposite of what you wanted.

Elision does neither. It reduces structurally: keep the signatures, headers, error lines and outliers, drop the predictable middle. Before it drops anything it caches the original under a hash, so the reduction is reversible. Where the bulk used to be, the agent sees a ⟦elided:<hash>⟧ marker, and it can pull the full text back the moment it actually needs it.

The master switch

compression.enabled (default false) gates everything below it. While it is off nothing changes: verify logs fall back to plain tail-verify_max_bytes truncation, and prompts and tool results pass through byte-for-byte. Turn it on and the reducers run and start emitting elision markers. Each sub-lever (proxy, output_steering, cache_align) keeps its own toggle and stays off until you set it, so flipping the master switch on its own is a conservative change.

Where it bites: two paths

How far compression can reach depends on whose loop the agent runs in.

Off-host agents (in-loop)

openai-compat providers run their agentic loop inside nubos-pilot (lib/runtime/agent-loop.cjs). Because nubos-pilot owns the message array, it crushes each tool result before that result enters the history it re-sends every turn, so the saving compounds across the whole loop. This is the strongest of the levers, and on tool-heavy loops the reductions are large.

While compression is on, these agents also get a context-expand tool (lib/runtime/tools/index.cjs, EXPAND_TOOL_NAME). The model hands it the 12-char hash from a marker and gets the original back mid-loop, with no human in the path. A context-expand result is never re-compressed itself, so expansion always terminates.

Native `claude -p` spawns (proxy)

Native spawns run their loop inside the Claude CLI, where nubos-pilot can't see it. To reach that traffic, compression.proxy.enabled forks a localhost proxy (lib/elision-proxy.cjs), points the child's ANTHROPIC_BASE_URL at it, and crushes oversized tool_result blocks in each /v1/messages request on the way upstream. It touches only tool_result blocks; system, the tool definitions and ordinary text stay byte-identical. The rewrite is deterministic, so the compressed prefix stays stable and Anthropic prompt-caching keeps working.

The reducers

lib/compress.cjs dispatches on the shape of the block. Each reducer keeps the lines that carry signal, drops the predictable bulk, and protects error, failure and outlier lines before anything else. A block under min_block_bytes (default 2048) is left byte-identical, since below that size the marker costs more than it saves.

Block type	What is kept	What is elided
JSON arrays	head + tail items, any item with an error/outlier field	the homogeneous middle
Build / test logs	error, failure, stack-trace and warning lines	clean progress chatter
Grep / search output	a bounded sample per file, capped total	the long tail of matches
Unified diffs	hunk headers + a few context lines	wide unchanged context
C-family source	signatures, structure, `throw`/`TODO`/`FIXME` lines	statement bodies
Brace-light source (Python &c.)	module-level lines, block headers, `raise`/`assert`	deep, indented statement bodies
Prose	head + tail sentences, any directive/warning sentence	the sampled middle

Brace counting for C-family code masks string literals and // … /* */ comments first, so braces inside them never throw off the structure detection.

The elision store

lib/elision.cjs caches each dropped original at .nubos-pilot/elision/<hash>.json. The filename is the 12-char SHA-256 prefix, the file is validated against the elision-entry.v1 schema, and the write goes through a file lock atomically. Entries carry a TTL (compression.elision.ttl_ms, default 30 min); once it passes, retrieve reports expired and prune clears the entry lazily.

There are two ways back to the original:

bash

# CLI — recover the text behind a marker
node .nubos-pilot/bin/np-tools.cjs elision-get <hash>          # raw original to stdout
node .nubos-pilot/bin/np-tools.cjs elision-get <hash> --json   # status envelope

…and the in-loop context-expand tool described above. Setting compression.elision.enabled to false crushes blocks without caching them, which makes the drop irreversible; don't, unless you have a specific reason to.

Output steering: the other direction

lib/output-steering.cjs works the other way, cutting the tokens the model writes back (compression.output_steering.*). There are two independent levers, both safe with the prompt cache.

Verbosity steering adds a fixed terseness directive to the end of the system prompt, inside an idempotent <nubos_output_shaping> block so re-enrichment never stacks. Because it lands after the cached prefix, prompt-caching is unaffected. The profiles get progressively tighter: balanced (a no-op), concise, terse, minimal.
Effort routing looks at each off-host turn structurally. A fresh user ask, or a turn recovering from a tool error, keeps full effort (base_effort); a turn that only continues from clean tool_results counts as mechanical and is downgraded to mechanical_effort. The value reaches the provider as the openai-compat reasoning_effort field, and that field goes out only when you pin a concrete base_effort, which is how you tell nubos-pilot the provider supports it. Leave it unpinned and the lever does nothing and the field is never sent, so a provider without effort support sees no change. Single-shot native spawns have no multi-turn loop, so routing doesn't apply to them.

Cache alignment: proxy path only

lib/cache-align.cjs (compression.cache_align.*) keeps the native-spawn prompt cacheable. Both levers act on the /v1/messages body the proxy already sees, and both are idempotent.

Detect never mutates anything. It scans the system prompt for cache-volatile tokens (a UUID, an ISO-8601 datetime, a JWT, a long hex hash) and logs one warning per kind, so when a prefix keeps missing the cache you can see why instead of guessing.
Normalize does mutate, and is opt-in. It gives the tool-definition prefix a stable byte layout (tools sorted by name, schema keys sorted recursively, array order left alone) and adds a single cache_control: {type: "ephemeral"} breakpoint on the last tool. If the caller already set any cache_control on system or tools, it leaves everything as-is — a deliberate cache layout is never disturbed.

The rest of compression.* keeps prefixes byte-identical. This lever is the exception: it rewrites the tool prefix on purpose to make it canonical, which is why it sits behind its own toggle.

Swarm input dedup

With compression.enabled and compression.elision.enabled both on, the Researcher-Schwarm stops re-inlining its identical k-spawn input. It caches the shared payload once, and each spawn spec then carries an input_ref hash instead of the same blob repeated k times.

Measuring it with `elision-bench`

elision-bench is how you check the reducers are actually behaving. It runs them over a fixture corpus and reports three things, all deterministic and with no model in the loop:

ratio — how much smaller the output got;
critical-line preservation — whether every error, failure and outlier line survived;
byte-exact reversibility — whether every marker resolves back to the original, byte for byte.

bash

# deterministic fidelity over the fixtures
node .nubos-pilot/bin/np-tools.cjs elision-bench

# scaled corpus + holdout control, with an estimated token (and cost) saving
node .nubos-pilot/bin/np-tools.cjs elision-bench --size large \
  --price-per-mtok 3 --currency EUR

# add answer-equivalence: raw vs compressed through a real model
node .nubos-pilot/bin/np-tools.cjs elision-bench --with-model --tier sonnet

It doubles as the regression net for the reducers. It has already caught two bugs in lib/compress.cjs that the per-block fidelity checks on their own would have let through.

ADR-0022 — the decision this page describes, with the options weighed and the consequences accepted.
Configuration § compression.* — the complete key-by-key reference.
Runtimes — off-host openai-compat vs. native claude -p, the split that decides which path applies.
Researcher-Schwarm — consumes the k-spawn input dedup.
Data-Schema Validation — elision-entry.v1, the on-disk store schema.

Context Compression & Elision ​

Why ​

The master switch ​

Where it bites: two paths ​

Off-host agents (in-loop) ​

Native claude -p spawns (proxy) ​

The reducers ​

The elision store ​

Output steering: the other direction ​

Cache alignment: proxy path only ​

Swarm input dedup ​

Measuring it with elision-bench ​

Related ​