Skip to content

Context Compression & Elision

Opt-in, reversible compression of the content nubos-pilot puts in front of a model: the tool results an agent re-reads every turn, the large fenced blocks in a spawn prompt, the verify logs a critic reads. There are also two levers on the output side that cut the tokens the model writes back. None of it calls a model or adds a dependency. Every reducer is a plain rule in lib/compress.cjs, and every block it drops is cached on disk first, so the marker always resolves back to the original. The worst a lossy drop can do is leave an answer uncompressed; it can't make the answer wrong.

The full config surface lives in Configuration § compression.*. This page is the why and the how-it-fits-together.

Why

In an agentic loop the message history is re-sent every turn. A 40 KB grep result or a 200-line test log gets read once but billed once per remaining turn, so the cost compounds fast. The two obvious fixes both fail. Dumb tail-truncation throws away the one error line that mattered. Asking a model to summarise the block adds latency, cost, and a fresh place for the model to be wrong, which is the opposite of what you wanted.

Elision does neither. It reduces structurally: keep the signatures, headers, error lines and outliers, drop the predictable middle. Before it drops anything it caches the original under a hash, so the reduction is reversible. Where the bulk used to be, the agent sees a ⟦elided:<hash>⟧ marker, and it can pull the full text back the moment it actually needs it.

The master switch

compression.enabled (default false) gates everything below it. While it is off nothing changes: verify logs fall back to plain tail-verify_max_bytes truncation, and prompts and tool results pass through byte-for-byte. Turn it on and the reducers run and start emitting elision markers. Each sub-lever (proxy, output_steering, cache_align) keeps its own toggle and stays off until you set it, so flipping the master switch on its own is a conservative change.

Where it bites: two paths

How far compression can reach depends on whose loop the agent runs in.

Off-host agents (in-loop)

openai-compat providers run their agentic loop inside nubos-pilot (lib/runtime/agent-loop.cjs). Because nubos-pilot owns the message array, it crushes each tool result before that result enters the history it re-sends every turn, so the saving compounds across the whole loop. This is the strongest of the levers, and on tool-heavy loops the reductions are large.

While compression is on, these agents also get a context-expand tool (lib/runtime/tools/index.cjs, EXPAND_TOOL_NAME). The model hands it the 12-char hash from a marker and gets the original back mid-loop, with no human in the path. A context-expand result is never re-compressed itself, so expansion always terminates.

Native claude -p spawns (proxy)

Native spawns run their loop inside the Claude CLI, where nubos-pilot can't see it. To reach that traffic, compression.proxy.enabled forks a localhost proxy (lib/elision-proxy.cjs), points the child's ANTHROPIC_BASE_URL at it, and crushes oversized tool_result blocks in each /v1/messages request on the way upstream. It touches only tool_result blocks; system, the tool definitions and ordinary text stay byte-identical. The rewrite is deterministic, so the compressed prefix stays stable and Anthropic prompt-caching keeps working.

The reducers

lib/compress.cjs dispatches on the shape of the block. Each reducer keeps the lines that carry signal, drops the predictable bulk, and protects error, failure and outlier lines before anything else. A block under min_block_bytes (default 2048) is left byte-identical, since below that size the marker costs more than it saves.

Block typeWhat is keptWhat is elided
JSON arrayshead + tail items, any item with an error/outlier fieldthe homogeneous middle
Build / test logserror, failure, stack-trace and warning linesclean progress chatter
Grep / search outputa bounded sample per file, capped totalthe long tail of matches
Unified diffshunk headers + a few context lineswide unchanged context
C-family sourcesignatures, structure, throw/TODO/FIXME linesstatement bodies
Brace-light source (Python &c.)module-level lines, block headers, raise/assertdeep, indented statement bodies
Prosehead + tail sentences, any directive/warning sentencethe sampled middle

Brace counting for C-family code masks string literals and ///* */ comments first, so braces inside them never throw off the structure detection.

The elision store

lib/elision.cjs caches each dropped original at .nubos-pilot/elision/<hash>.json. The filename is the 12-char SHA-256 prefix, the file is validated against the elision-entry.v1 schema, and the write goes through a file lock atomically. Entries carry a TTL (compression.elision.ttl_ms, default 30 min); once it passes, retrieve reports expired and prune clears the entry lazily.

There are two ways back to the original:

bash
# CLI — recover the text behind a marker
node .nubos-pilot/bin/np-tools.cjs elision-get <hash>          # raw original to stdout
node .nubos-pilot/bin/np-tools.cjs elision-get <hash> --json   # status envelope

…and the in-loop context-expand tool described above. Setting compression.elision.enabled to false crushes blocks without caching them, which makes the drop irreversible; don't, unless you have a specific reason to.

Output steering: the other direction

lib/output-steering.cjs works the other way, cutting the tokens the model writes back (compression.output_steering.*). There are two independent levers, both safe with the prompt cache.

  • Verbosity steering adds a fixed terseness directive to the end of the system prompt, inside an idempotent <nubos_output_shaping> block so re-enrichment never stacks. Because it lands after the cached prefix, prompt-caching is unaffected. The profiles get progressively tighter: balanced (a no-op), concise, terse, minimal.
  • Effort routing looks at each off-host turn structurally. A fresh user ask, or a turn recovering from a tool error, keeps full effort (base_effort); a turn that only continues from clean tool_results counts as mechanical and is downgraded to mechanical_effort. The value reaches the provider as the openai-compat reasoning_effort field, and that field goes out only when you pin a concrete base_effort, which is how you tell nubos-pilot the provider supports it. Leave it unpinned and the lever does nothing and the field is never sent, so a provider without effort support sees no change. Single-shot native spawns have no multi-turn loop, so routing doesn't apply to them.

Cache alignment: proxy path only

lib/cache-align.cjs (compression.cache_align.*) keeps the native-spawn prompt cacheable. Both levers act on the /v1/messages body the proxy already sees, and both are idempotent.

  • Detect never mutates anything. It scans the system prompt for cache-volatile tokens (a UUID, an ISO-8601 datetime, a JWT, a long hex hash) and logs one warning per kind, so when a prefix keeps missing the cache you can see why instead of guessing.
  • Normalize does mutate, and is opt-in. It gives the tool-definition prefix a stable byte layout (tools sorted by name, schema keys sorted recursively, array order left alone) and adds a single cache_control: {type: "ephemeral"} breakpoint on the last tool. If the caller already set any cache_control on system or tools, it leaves everything as-is — a deliberate cache layout is never disturbed.

The rest of compression.* keeps prefixes byte-identical. This lever is the exception: it rewrites the tool prefix on purpose to make it canonical, which is why it sits behind its own toggle.

Swarm input dedup

With compression.enabled and compression.elision.enabled both on, the Researcher-Schwarm stops re-inlining its identical k-spawn input. It caches the shared payload once, and each spawn spec then carries an input_ref hash instead of the same blob repeated k times.

Measuring it with elision-bench

elision-bench is how you check the reducers are actually behaving. It runs them over a fixture corpus and reports three things, all deterministic and with no model in the loop:

  • ratio — how much smaller the output got;
  • critical-line preservation — whether every error, failure and outlier line survived;
  • byte-exact reversibility — whether every marker resolves back to the original, byte for byte.
bash
# deterministic fidelity over the fixtures
node .nubos-pilot/bin/np-tools.cjs elision-bench

# scaled corpus + holdout control, with an estimated token (and cost) saving
node .nubos-pilot/bin/np-tools.cjs elision-bench --size large \
  --price-per-mtok 3 --currency EUR

# add answer-equivalence: raw vs compressed through a real model
node .nubos-pilot/bin/np-tools.cjs elision-bench --with-model --tier sonnet

It doubles as the regression net for the reducers. It has already caught two bugs in lib/compress.cjs that the per-block fidelity checks on their own would have let through.