server: opt-in Codec binary streaming (msgpack/protobuf token frames)#22757
Closed
wdunn001 wants to merge 2 commits intoggml-org:masterfrom
Closed
server: opt-in Codec binary streaming (msgpack/protobuf token frames)#22757wdunn001 wants to merge 2 commits intoggml-org:masterfrom
wdunn001 wants to merge 2 commits intoggml-org:masterfrom
Conversation
…v1/chat/completions
Adds support for emitting raw token IDs as Codec MessagePack or Protobuf
frames instead of JSON SSE when the request body sets stream_format to
'msgpack' or 'protobuf'. Fully backwards-compatible -- the field defaults
to 'json' and the existing path is byte-identical when unset.
Wire impact (measured against synthetic and live Ollama qwen2.5):
JSON-SSE, live Ollama: 186.4 B/token
Codec msgpack (identity): 16.0 B/token 9.6x
Codec protobuf (identity): 10.9 B/token 14.2x
Codec msgpack + Content-Encoding: br: 2.79 B/token 55.2x
Agent-to-agent handoff (1024 tokens) is 3.6x faster end-to-end because
both the wire shrinks AND detokenize+tokenize on the receiver gets
eliminated.
Note: this lands in llama-server, which means Ollama gets the option
for free as soon as it surfaces stream_format through its own API.
Changes
-------
tools/server/CMakeLists.txt
Adds LLAMA_CODEC=AUTO|SYSTEM|FETCH|OFF (default AUTO):
AUTO prefer system-installed libcodec (vcpkg / OS pkg manager),
fall back to FetchContent from github.com/wdunn001/Codec
SYSTEM fail if no system libcodec
FETCH always FetchContent
OFF disable; the binary endpoint becomes a no-op
When wired in, defines LLAMA_HAVE_CODEC and links libcodec privately
to server-context. Otherwise the build is unchanged.
tools/server/server-context.cpp
In the existing streaming branch of handle_completions_impl, when
stream_format is 'msgpack' or 'protobuf':
- Content-Type is application/x-msgpack or application/x-protobuf
- Each result is encoded as CodecFrame { ids, done, finish_reason? }
via codec_encode_msgpack / codec_encode_protobuf
- The [DONE] SSE sentinel is omitted (final frame's done=true terminates)
- On error a terminal frame with finish_reason='error' is emitted so
binary clients distinguish a server error from a clean truncation
- finish_reason is mapped from stop_type:
STOP_TYPE_EOS -> 'eos_token'
STOP_TYPE_LIMIT -> 'length'
STOP_TYPE_WORD -> 'stop_sequence'
No changes to validation, scheduler, sampler, prompt logic, or
task_params. The existing tokens field on partial/final results
is what feeds the binary frame.
Sister-server PRs ship the same protocol contract:
vLLM: vllm-project/vllm#41765
SGLang: sgl-project/sglang#24483
Spec: https://github.com/wdunn001/Codec
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi @wdunn001, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Detect <tool_call>..</tool_call> regions in the outbound token stream without ever detokenizing on the hot path. Mirrors sglang PR #24557 and uses libcodec's ToolWatcher state machine via the new codec_tool_watcher_new_with_ids() API — pure uint32 compare per token, ~ns of overhead, completed regions surface as structured tool_calls on the matching CodecFrame. Opt-in per request via three new body fields: - tool_watcher: bool (default false) - tool_watcher_start: string (default "<tool_call>") - tool_watcher_end: string (default "</tool_call>") Marker strings are tokenized with parse_special=true; if either does not resolve to a single token in the loaded vocab, the watcher is disabled and a SRV_WRN logs the reason. Binary streaming still works in plain passthrough mode in that case. Region body decoding uses llama.cpp's own common_token_to_piece (no external Codec map fetch needed — the server already owns the vocab). Body JSON is parsed inline; the optional "name" field is surfaced when present, the raw body always rides as arguments_json so orchestrators can return invalid_arguments errors to the model when parsing fails. Server-generated call IDs use the deterministic "tc_<8hex>" shape. Backward compatible: stream_format=msgpack/protobuf without tool_watcher=true emits byte-identical frames to before.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in Codec binary streaming path to
llama-serverthat ships raw token IDs as MessagePack or Protobuf frames instead of UTF-8 wrapped in JSON SSE. Models emituint32token IDs internally; converting them to text and re-tokenising on the receiving end is most of the wire cost.Fully backwards-compatible —
stream_formatdefaults to"json"and the existing SSE path is byte-identical when the field is absent. The new code is gated behind a CMake option and a runtime flag, so deployments that don't want it pay zero cost.Motivation
For agent-to-agent workloads, model A's output token IDs are re-tokenised as model B's prompt input. The text detokenize → serialize → transmit → deserialize → re-tokenize pipeline exists only to satisfy the JSON wire format — it contributes zero semantic value. Codec eliminates it.
Measured wire impact (synthetic + live Ollama, see Codec/packages/bench):
| Configuration | B/token | vs JSON-SSE |
|--------------------------------------------|--------:|------------:|
| JSON-SSE, live Ollama qwen2.5 | 186.4 | 1.0× |
| Codec msgpack | 16.0 | 9.6× |
| Codec protobuf | 10.9 | 14.2× |
| Codec msgpack +
Content-Encoding: br| 2.8 | 55.2× |End-to-end agent-to-agent handoff (1024 tokens): 3.6× faster because both the wire shrinks AND detokenize+tokenize gets eliminated. Real BPE tokenizers are 5–50× slower than the modeled hashtable lookup, so the codec advantage on real workloads is wider.
Changes
The implementation is intentionally small. The binary path lives entirely inside the existing streaming branch in
handle_completions_impl; it does not touch the JSON path, the request validation, the tokenizer, or the scheduler.tools/server/CMakeLists.txtAdds a single
LLAMA_CODECoption (defaultAUTO) controlling how libcodec is sourced:AUTOreproduces what most users expect:vcpkg install codecworks,cmake -S . -B buildworks without any extra setup. The FetchContent path pins to a tag of the upstream Codec repo and only pullspackages/c(SOURCE_SUBDIR), not the multi-language monorepo.When libcodec is wired in,
LLAMA_HAVE_CODECis defined, theserver-contextlibrary links it privately, and the binary streaming branch inserver-context.cppis compiled. When it's not, the only added cost is one CMake option that'sOFF.tools/server/server-context.cppAdds an
#if defined(LLAMA_HAVE_CODEC)branch inside the existing streamingelseblock ofhandle_completions_impl. When the request body contains"stream_format": "msgpack"or"stream_format": "protobuf":The response
Content-Typeisapplication/x-msgpackorapplication/x-protobufinstead oftext/event-stream.Each result is encoded as a
CodecFrame { ids: [u32], done: bool, finish_reason?: string }using libcodec'scodec_encode_msgpack/codec_encode_protobuf.The terminal
[DONE]SSE sentinel is omitted (the final frame'sdone=truealready terminates the stream).On error, a terminal frame with
finish_reason="error"is emitted so binary clients can distinguish a genuine server error from a clean truncation.finish_reasonis mapped from the existingstop_typeenum:STOP_TYPE_EOS → "eos_token",STOP_TYPE_LIMIT → "length",STOP_TYPE_WORD → "stop_sequence".The new code touches the lambda
nextcallback only; the slot scheduler, sampler, prompt logic, andtask_paramsare unchanged. The existingtokensfield onserver_task_result_cmpl_partialand_finalis what feeds the binary frame — no new fields, no new data flow.Wire format
Both modes carry identical semantics; only serialization differs.
MessagePack — frames concatenated, no delimiter:
Protobuf — 4-byte big-endian length prefix + payload:
The
.protoschema is also exposed by Codec-compliant servers atGET /codec/schemafor client codegen — that endpoint isn't part of this PR but can be added trivially as a follow-up.Endpoint usage
Browser/Node clients can use
@codecai/webto decode + detokenize the stream:Polyglot client ecosystem
The Codec protocol ships clients across the major language ecosystems — same dialect maps work everywhere, all verified bit-identical against HuggingFace's reference
tokenizerslibrary:| Lang | Package | Tests |
|---|---|---|
| TypeScript / JS / Browser |
@codecai/web| 35/35 ✓ || .NET |
Codec.Net| 16/16 ✓ || Python |
codecai| 20/20 ✓ || C99 (this PR's dependency) |
libcodec(FetchContent / vcpkg) | 4 CTest suites ✓ |Sister-server PRs
The same protocol is being added to other major open-weight inference servers:
vLLM: vllm-project/vllm#41765
SGLang: sgl-project/sglang#24483
TGI: branch preserved, but TGI is now archived upstream
This PR rounds out the major-three story for non-archived inference engines.
Open questions for reviewers
LLAMA_CODECdefaulting toAUTO. Some maintainers prefer all optional features OFF by default. I usedAUTObecause (a) the runtime gate (stream_formatdefaulting to"json") means there's zero cost at runtime, (b) FetchContent is well-cached on subsequent builds. Happy to flip the default if you prefer.Server-side detokenization. On the binary path the server still detokenizes to fill
partial->content, even though the client never reads it. There's a ~5% CPU win from skipping detokenize on the binary path. Held as a follow-up to keep this PR small — the change is straightforward in the slot loop.Schema endpoint.
GET /codec/schema(returns the.prototext) is part of the broader Codec spec and useful for client codegen. Easy follow-up if you want it; not included here to keep diff tight.Alternative wire flag spelling. Other Codec-enabled servers (vLLM, SGLang) use
stream_format: "json" | "msgpack" | "protobuf"exactly as in this PR. Happy to align on different wording ifstream_formatcollides with anything in llama.cpp's idiom.Related
Codec spec: https://github.com/wdunn001/Codec/blob/main/spec/PROTOCOL.md
Tokenizer dialect registry: https://github.com/wdunn001/codec-maps
libcodec (the C library this PR FetchContents): https://github.com/wdunn001/Codec/tree/main/packages/c
vLLM PR: feat(completions): token-native binary transport via stream_format=msgpack|protobuf vllm-project/vllm#41765
SGLang PR: feat(codec): token-native binary transport for completions streaming sgl-project/sglang#24483
AI assistance disclosure
This PR was developed with AI assistance (Anthropic's Claude). The author drove design choices, reviewed all code, and is responsible for the contents - but AI tools helped draft and refine the C++ changes, the CMake integration, parts of the documentation, and the supporting Codec ecosystem (tokenizer maps, polyglot client libraries, libcodec) referenced from this PR. The author has read and validated every line submitted here.
Measured results (live, end-to-end)
These numbers are real measurements from a live sglang server running this PR's branch with Qwen/Qwen2.5-0.5B-Instruct on an RTX 3090, deterministic at temperature 0.0. Full report: Codec/packages/bench/RESULTS.md.
Wire format (3 wire modes × 4 compression encodings, 64-token completion)
| Path | identity | gzip | br | zstd |
|---|---:|---:|---:|---:|
| JSON-SSE (vanilla main, baseline) | 15.2 KB | 15.2 KB | 15.2 KB | 15.2 KB |
| Codec msgpack | 16.0× | 68.8× | 13.4× | 61.5× |
| Codec protobuf | 23.9× | 69.5× | 16.8× | 57.4× |
Per-token cost: 243 B/tok JSON-SSE → 3.5 B/tok Codec + gzip.
JSON-SSE doesn't compress on either server even with
Accept-Encodingset — the text path doesn't honor the header. The Codec path'scodec_compression.pyis what actually does compression.Polyglot interop
Same wire decoded by Python, .NET, C, and Web clients. Wire bytes match exactly across all four implementations on every cell of the 12-cell matrix.
End-to-end agent loop (full two-turn round-trip with real tool dispatch)
| Tool | JSON-SSE wire | Codec wire | Reduction | Speedup |
|---|---:|---:|---:|---:|
| mock
get_weather| 13.7 KB | 809 B | 16.9× | 1.08× || SearXNG (live web search) | 61.9 KB | 3.4 KB | 18.2× | 1.24× |
| MetaMCP (Time MCP server) | 19.6 KB | 1.1 KB | 17.8× | 1.24× |
The agentic loop is: prompt → model emits
<tool_call>→ server detects (this PR) → orchestrator dispatches → tool result fed back as atoolmessage → model produces final answer. Both turns count toward the wire total.ToolWatcher CPU microbench (libcodec, C99, single core, 1M tokens)
| Path | ns/token | Mtok/s |
|---|---:|---:|
|
codec_tool_watcher_feed| 0.61 | 1,648 ||
codec_detokenizer_render(same stream) | 60.4 | 16.6 || Speedup | | ~100× |
The watcher's hot loop is a uint32 compare against two cached IDs plus an occasional memcpy. Detokenize does a vocab lookup and UTF-8 string construction per token — that's the gap.
Reproducing
Bench drivers under
packages/demo-{python,dotnet,c,web}; full method + raw outputs inpackages/bench/RESULTS.md.Update: compression crossover study (size sweep, 16 → 2,048 tokens)
A 8-size sweep against this PR's server shows a clean crossover between gzip and zstd that lets us recommend a concrete threshold rule. Same lab box (RTX 3090, Qwen2.5-0.5B-Instruct, long-form prompt), measured wire bytes per cell:
| path · encoding | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| msgpack · identity | 249 | 482 | 944 | 1.8 KB | 3.6 KB | 7.2 KB | 14.4 KB | 27.1 KB |
| msgpack · gzip |
110|115|126|146| 194 | 268 | 400 | 639 || msgpack · br | 303 | 574 | 923 | 1.6 KB | 2.9 KB | 5.5 KB | 10.8 KB | 20.2 KB |
| msgpack · zstd |
107|112| 134 | 152 |176|239|273|381|| protobuf · identity | 164 | 322 | 636 | 1.2 KB | 2.5 KB | 4.9 KB | 9.8 KB | 18.5 KB |
| protobuf · gzip |
98|102|113|133| 179 | 247 | 367 | 587 || protobuf · br | 243 | 408 | 762 | 1.4 KB | 2.7 KB | 5.3 KB | 10.6 KB | 20.0 KB |
| protobuf · zstd | 100 | 104 | 122 | 140 |
164|223|258|368|Highlighted= winner at that size. JSON-SSE row omitted (the server doesn't compress text streams in this build — bytes scale linearly: 3.8 KB → 457 KB).Threshold rule
| stream length | best encoding | why |
|---|---|---|
| ≤ 128 tokens | gzip | tiny deflate header beats zstd's frame header on small payloads |
| ≥ 256 tokens | zstd | Huffman + dictionary keep amortising as the stream grows |
The crossover for protobuf is between 128 and 256 tokens (gzip 133 B vs zstd 140 B at 128; gzip 179 B vs zstd 164 B at 256).
Brotli underperforms at every size measured for streaming Codec frames — per-block overhead never amortises across small CodecFrames. Identity also loses at every size, including 16 tokens (compressed is ≥2× smaller even there).
A simpler one-rule policy that gets ~95% of the win: always zstd. At worst it costs ~10% more bytes than gzip on the smallest payloads, and it wins by 1.6× on large payloads.
Recommendation for default Accept-Encoding
This server should advertise
gzip, zstd(in that order) as preferred encodings, fall back toidentityif neither is supported by the client, and not advertisebrfor Codec streams. Full study:packages/bench/RESULTS.md§1c. Reproduce withcodec-bench-crossover --url <server> --sizes 16 32 64 128 256 512 1024 2048 --prompt-long.Update: encoding picker shipped as a standalone package
The crossover study above motivated a standalone, framework-agnostic encoding picker — published as
wire-compress(source). Drop it in any HTTP server to get the rightContent-Encodingbased onAccept-Encodingand estimated payload size:The library treats brotli as a fallback tier, not a loser. On streaming small-frame workloads brotli's per-block overhead doesn't amortise (gzip beats it at every measured size), but brotli has wider client coverage than zstd — Safari, iOS, older Firefox all ship br but not zstd — so it remains a critical fallback when neither modern encoder is supported. The picker's choice order:
zstd if client supports it AND size ≥ 256 (or mid-band ≥ 128)
gzip if client supports it AND size ≤ 128 (or zstd unsupported)
br if client supports nothing else compressible (Safari/iOS path)
identity if client refuses everything else
Crossover chart referenced in the study above lives at
packages/bench/docs/crossover-summary.png.Update: time-impact analysis (TTFT cliff)
Wire bytes are half the story. The other half is time. Re-running the sweep with TTFT instrumentation surfaced an important finding that reframes the encoding recommendation:
All numbers below come from a single timed sweep on this PR's server: fixed prompt, all 12 cells (3 paths × 4 encodings) at 3 sizes, median of 2 reps. Token counts are identical across encodings within a size (64 / 512 / 1967 emitted), so cells are directly comparable.
Two findings, not one:
1. zstd has a TTFT cliff — gzip, br, and identity all stream
11 ms12 ms12 ms11 ms12 ms11 ms11 ms11 ms11 ms11 ms11 ms11 mszstd's TTFT regresses 334× at 2K tokens (11 ms → 3,684 ms) — first byte arrives only when the model finishes generating. gzip and brotli both flush chunk-by-chunk and preserve TTFT.
2. Brotli is barely compressing on this stack
The TTFT chart suggests br is a viable fallback. The wire-bytes table from the same run says otherwise:
170 B333 B660 B182 B284 B470 B157 B313 B608 B179 B293 B467 Bprotobuf · br at 2K is 7% larger than identity. msgpack · br saves only 27% over identity at 2K, while gzip in the same slot is 660 B (42× smaller than br). This looks like a server-side brotli middleware configuration issue (per-frame compression with a quality setting unsuited to small-frame workloads), not a fundamental br limitation.
Reduction-vs-baseline at 2K tokens
Total wall-clock is model-bound; Codec adds <1% across every cell.
For human-facing streams use gzip — it streams and delivers 700×+ wire reduction. The
wire-compresspicker'sinteractive: truemode (default) enforces this: gzip > br > zstd-only-if-alone.Total wall-clock time is model-bound — Codec adds <1% overhead vs JSON-SSE across every size and encoding. The wire reduction is essentially free in time.
Full chart and methodology:
RESULTS.md§1d. The proposed "in-server tools-as-tokens" architecture (server-side MCP dispatch so the tool call never leaves the trusted process boundary) is sketched inRESULTS.md§1e.Update: bolt-on tool architecture (revised proposal)
The earlier "in-process MCP dispatcher inside sglang" sketch turned out to be the wrong layer for tokenization. The revised proposal is bolt-on Codec-native tools — independently versioned, deployed, and authored, hosted in their own repos, with build-time tokenizer caches per supported model.
Why bolt-ons beat in-process dispatch:
Modularity — tools want their own release cadence and deploy surface. In-process dispatch forces every tool change into a server release.
Tokenization belongs at the tool, not the gateway — the tool knows its own response fragments better than any central registry can. A weather tool that emits
"It is {temp}°F in {city}"knows exactly which prefixes/suffixes to pre-cache; the gateway doesn't.Independent hosting — teams publish from their own repos on their own infra. The gateway only needs the manifest URL.
The flow with bolt-on tools:
Token IDs flow through the loop end-to-end. The gateway never detokenizes. The tool's hot path is
concat(cached_prefix_ids, tokenize(short_dynamic), cached_suffix_ids)— typically 50-100× CPU reduction at the tool layer, and zero CPU at the gateway.Reference SDK landed at
packages/codec-tool-kit— the manifest spec,CodecToolinterface, andprecache()build helper. Zero runtime deps, ~6 KB. Tools that ship caches for the active model are token-native; tools that don't have a binding fall back to text-mode and the gateway tokenizes at the boundary.What's still needed in this PR's server (small, opt-in additions):
Tool registry that loads manifests at startup and validates
tokenizerHashagainst the active model's tokenizerMCP-style HTTP/IPC client to post
CodecToolCalland receiveCodecToolResultReinjection path that takes
responseIdsand feeds them back at the position where<tool_call>was detectedFull architecture and rationale:
RESULTS.md§1e.Update: composite efficiency metric (bytes × TTFT)
Wire-bytes ranking puts zstd on top. TTFT ranking puts gzip on top. Multiplying them gives a single number that ranks encodings holistically — bytes-milliseconds, "the cost of holding this response in flight until the user sees something." Normalised to JSON-SSE identity = 1.0×, higher is better.
The two regimes give two different rankings, which is exactly what the picker has to choose between:
Full composite tables
Interactive efficiency (bytes × TTFT, normalised to JSON-SSE identity = 1.0×)
gzip258×373×722×gzip279×433×855×Arrows mark cells where zstd is worse than uncompressed identity Codec on the composite metric — the TTFT cliff has fully cancelled the wire savings.
Batch efficiency (bytes only)
gzip92×437×1014×gzip99×424×1021×In batch mode the bytes-only score puts zstd on top above 256 tokens. gzip wins the small-payload bracket (≤128 tokens) — same crossover the bytes sweep in §1c showed.
What this proves about the picker
The Pareto front for both metrics is
{gzip, zstd}— br and identity are dominated everywhere. Thewire-compresspicker has exactly one knob (interactive: boolean) because these two metrics cleanly separate the workloads: gzip for interactive, zstd for batch above 256 tokens, gzip below.Charts:
composite-interactive.png·composite-batch.png. Full study and methodology:RESULTS.md§1d.Update: cross-stack matrix — llama.cpp row measured
Built llama-server from this PR's branch (#22757) with CUDA, ran the same
codec-bench-timedsweep that landed for sglang earlier (Qwen2.5-0.5B Q4_K_M GGUF, 64/512/2048 tokens, 2 reps, identical token counts across encodings).Headline finding: this PR ships Codec wire formats but no compression middleware. All four
Accept-Encodingvalues return identical bytes — gzip/br/zstd are passthrough — so wire-coefficient is exactly 1.0 for every encoding. That's a strictly cleaner baseline than sglang (no broken middleware) but you only get the format ratio (17× for msgpack, 25× for protobuf), not the compression bonus that sglang's gzip layer delivers (700×+).Wire bytes at 2K tokens
Cross-stack ratios
= 1.000means "no compression applied" — the wire-coefficient is exactly 1.0 because llama.cpp returns the raw codec bytes regardless ofAccept-Encoding. The composite scores are still >1× because Codec frames are intrinsically 17-25× smaller than JSON-SSE without any compression layer.TTFT is consistently 5-7 ms across every encoding (better than sglang's 11 ms — llama.cpp's HTTP layer flushes faster).
Path forward: hooking a streaming-aware gzip layer into the existing HTTP pipeline (wrap the SSE writer in
boost::iostreams::gzip_compressoror use mongoose's gzip support) would lift this row from 17-25× to 700×+, matching sglang's gzip cell. That's a small, contained change relative to the existing PR.Cross-stack matrix and methodology:
RESULTS.md§1f. vLLM row pending its bench run.Update: full bench suite measured on this PR
Beyond the timed sweep, we ran the complete benchmark suite against this PR's
llama-server(Qwen2.5-0.5B Q4_K_M GGUF, RTX 3090 via CUDA docker):codec-bench(standard grid, 64 tok)codec-bench-timed(3 sizes × 2 reps)codec-bench-crossover(8 sizes × full grid)Standard grid at 64 tok (parity comparison)
989 B(16.4×)989 B989 B989 B656 B(24.7×)656 B656 B656 BThese ratios match sglang's identity row (16.0× / 23.9×) within noise — Codec wire format is stack-agnostic. The format-level wins are delivered by this PR. The compression-level wins (sglang gzip pushes msgpack from 16× to 705× at scale) are absent here because no compression middleware is wired in.
8-size crossover sweep — gzip/br/zstd literally tied with identity
Each row's four encoding cells are bit-identical at every size. Per-token cost is constant (~14 B/tok msgpack, ~9.5 B/tok protobuf) — format efficiency works exactly as the bytes-only sweep predicted; only the compression layer is missing.
Path forward
Hooking a streaming-aware gzip layer into the existing HTTP pipeline (mongoose's gzip support, or wrapping the SSE writer with
boost::iostreams::gzip_compressor) would lift this PR from 16-25× to ~700× at 2K tokens, matching sglang's gzip cell. That's a small contained follow-up.Methodology and full cross-stack matrix:
RESULTS.md§1f.