Appearance
Context Compression & Elision
Opt-in, reversible compression of the content nubos-pilot puts in front of a model: the tool results an agent re-reads every turn, the large fenced blocks in a spawn prompt, the verify logs a critic reads. There are also two levers on the output side that cut the tokens the model writes back. None of it calls a model or adds a dependency. Every reducer is a plain rule in lib/compress.cjs, and every block it drops is cached on disk first, so the marker always resolves back to the original. The worst a lossy drop can do is leave an answer uncompressed; it can't make the answer wrong.
The full config surface lives in Configuration § compression.*. This page is the why and the how-it-fits-together.
Why
In an agentic loop the message history is re-sent every turn. A 40 KB grep result or a 200-line test log gets read once but billed once per remaining turn, so the cost compounds fast. The two obvious fixes both fail. Dumb tail-truncation throws away the one error line that mattered. Asking a model to summarise the block adds latency, cost, and a fresh place for the model to be wrong, which is the opposite of what you wanted.
Elision does neither. It reduces structurally: keep the signatures, headers, error lines and outliers, drop the predictable middle. Before it drops anything it caches the original under a hash, so the reduction is reversible. Where the bulk used to be, the agent sees a ⟦elided:<hash>⟧ marker, and it can pull the full text back the moment it actually needs it.
The master switch
compression.enabled (default false) gates everything below it. While it is off nothing changes: verify logs fall back to plain tail-verify_max_bytes truncation, and prompts and tool results pass through byte-for-byte. Turn it on and the reducers run and start emitting elision markers. Each sub-lever (proxy, output_steering, cache_align) keeps its own toggle and stays off until you set it, so flipping the master switch on its own is a conservative change.
Where it bites: two paths
How far compression can reach depends on whose loop the agent runs in.
Off-host agents (in-loop)
openai-compat providers run their agentic loop inside nubos-pilot (lib/runtime/agent-loop.cjs). Because nubos-pilot owns the message array, it crushes each tool result before that result enters the history it re-sends every turn, so the saving compounds across the whole loop. This is the strongest of the levers, and on tool-heavy loops the reductions are large.
While compression is on, these agents also get a context-expand tool (lib/runtime/tools/index.cjs, EXPAND_TOOL_NAME). The model hands it the 12-char hash from a marker and gets the original back mid-loop, with no human in the path. A context-expand result is never re-compressed itself, so expansion always terminates.
Native claude -p spawns (proxy)
Native spawns run their loop inside the Claude CLI, where nubos-pilot can't see it. To reach that traffic, compression.proxy.enabled forks a localhost proxy (lib/elision-proxy.cjs), points the child's ANTHROPIC_BASE_URL at it, and crushes oversized tool_result blocks in each /v1/messages request on the way upstream. It touches only tool_result blocks; system, the tool definitions and ordinary text stay byte-identical. The rewrite is deterministic, so the compressed prefix stays stable and Anthropic prompt-caching keeps working.
The reducers
lib/compress.cjs dispatches on the shape of the block. Each reducer keeps the lines that carry signal, drops the predictable bulk, and protects error, failure and outlier lines before anything else. A block under min_block_bytes (default 2048) is left byte-identical, since below that size the marker costs more than it saves.
| Block type | What is kept | What is elided |
|---|---|---|
| JSON arrays | head + tail items, any item with an error/outlier field | the homogeneous middle |
| Build / test logs | error, failure, stack-trace and warning lines | clean progress chatter |
| Grep / search output | a bounded sample per file, capped total | the long tail of matches |
| Unified diffs | hunk headers + a few context lines | wide unchanged context |
| C-family source | signatures, structure, throw/TODO/FIXME lines | statement bodies |
| Brace-light source (Python &c.) | module-level lines, block headers, raise/assert | deep, indented statement bodies |
| Prose | head + tail sentences, any directive/warning sentence | the sampled middle |
Brace counting for C-family code masks string literals and // … /* */ comments first, so braces inside them never throw off the structure detection.
The elision store
lib/elision.cjs caches each dropped original at .nubos-pilot/elision/<hash>.json. The filename is the 12-char SHA-256 prefix, the file is validated against the elision-entry.v1 schema, and the write goes through a file lock atomically. Entries carry a TTL (compression.elision.ttl_ms, default 30 min); once it passes, retrieve reports expired and prune clears the entry lazily.
There are two ways back to the original:
bash
# CLI — recover the text behind a marker
node .nubos-pilot/bin/np-tools.cjs elision-get <hash> # raw original to stdout
node .nubos-pilot/bin/np-tools.cjs elision-get <hash> --json # status envelope…and the in-loop context-expand tool described above. Setting compression.elision.enabled to false crushes blocks without caching them, which makes the drop irreversible; don't, unless you have a specific reason to.
Output steering: the other direction
lib/output-steering.cjs works the other way, cutting the tokens the model writes back (compression.output_steering.*). There are two independent levers, both safe with the prompt cache.
- Verbosity steering adds a fixed terseness directive to the end of the system prompt, inside an idempotent
<nubos_output_shaping>block so re-enrichment never stacks. Because it lands after the cached prefix, prompt-caching is unaffected. The profiles get progressively tighter:balanced(a no-op),concise,terse,minimal. - Effort routing looks at each off-host turn structurally. A fresh user ask, or a turn recovering from a tool error, keeps full effort (
base_effort); a turn that only continues from cleantool_results counts as mechanical and is downgraded tomechanical_effort. The value reaches the provider as the openai-compatreasoning_effortfield, and that field goes out only when you pin a concretebase_effort, which is how you tell nubos-pilot the provider supports it. Leave it unpinned and the lever does nothing and the field is never sent, so a provider without effort support sees no change. Single-shot native spawns have no multi-turn loop, so routing doesn't apply to them.
Cache alignment: proxy path only
lib/cache-align.cjs (compression.cache_align.*) keeps the native-spawn prompt cacheable. Both levers act on the /v1/messages body the proxy already sees, and both are idempotent.
- Detect never mutates anything. It scans the system prompt for cache-volatile tokens (a UUID, an ISO-8601 datetime, a JWT, a long hex hash) and logs one warning per kind, so when a prefix keeps missing the cache you can see why instead of guessing.
- Normalize does mutate, and is opt-in. It gives the tool-definition prefix a stable byte layout (tools sorted by name, schema keys sorted recursively, array order left alone) and adds a single
cache_control: {type: "ephemeral"}breakpoint on the last tool. If the caller already set anycache_controlonsystemortools, it leaves everything as-is — a deliberate cache layout is never disturbed.
The rest of compression.* keeps prefixes byte-identical. This lever is the exception: it rewrites the tool prefix on purpose to make it canonical, which is why it sits behind its own toggle.
Swarm input dedup
With compression.enabled and compression.elision.enabled both on, the Researcher-Schwarm stops re-inlining its identical k-spawn input. It caches the shared payload once, and each spawn spec then carries an input_ref hash instead of the same blob repeated k times.
Measuring it with elision-bench
elision-bench is how you check the reducers are actually behaving. It runs them over a fixture corpus and reports three things, all deterministic and with no model in the loop:
- ratio — how much smaller the output got;
- critical-line preservation — whether every error, failure and outlier line survived;
- byte-exact reversibility — whether every marker resolves back to the original, byte for byte.
bash
# deterministic fidelity over the fixtures
node .nubos-pilot/bin/np-tools.cjs elision-bench
# scaled corpus + holdout control, with an estimated token (and cost) saving
node .nubos-pilot/bin/np-tools.cjs elision-bench --size large \
--price-per-mtok 3 --currency EUR
# add answer-equivalence: raw vs compressed through a real model
node .nubos-pilot/bin/np-tools.cjs elision-bench --with-model --tier sonnetIt doubles as the regression net for the reducers. It has already caught two bugs in lib/compress.cjs that the per-block fidelity checks on their own would have let through.
Related
- ADR-0022 — the decision this page describes, with the options weighed and the consequences accepted.
- Configuration §
compression.*— the complete key-by-key reference. - Runtimes — off-host openai-compat vs. native
claude -p, the split that decides which path applies. - Researcher-Schwarm — consumes the k-spawn input dedup.
- Data-Schema Validation —
elision-entry.v1, the on-disk store schema.
