Skip to content

server: opt-in Codec binary streaming (msgpack/protobuf token frames)#22757

Closed
wdunn001 wants to merge 2 commits intoggml-org:masterfrom
wdunn001:feat/codec-binary-transport
Closed

server: opt-in Codec binary streaming (msgpack/protobuf token frames)#22757
wdunn001 wants to merge 2 commits intoggml-org:masterfrom
wdunn001:feat/codec-binary-transport

Conversation

@wdunn001
Copy link
Copy Markdown

@wdunn001 wdunn001 commented May 6, 2026

Summary

Adds an opt-in Codec binary streaming path to llama-server that ships raw token IDs as MessagePack or Protobuf frames instead of UTF-8 wrapped in JSON SSE. Models emit uint32 token IDs internally; converting them to text and re-tokenising on the receiving end is most of the wire cost.

Fully backwards-compatible — stream_format defaults to "json" and the existing SSE path is byte-identical when the field is absent. The new code is gated behind a CMake option and a runtime flag, so deployments that don't want it pay zero cost.

Motivation

For agent-to-agent workloads, model A's output token IDs are re-tokenised as model B's prompt input. The text detokenize → serialize → transmit → deserialize → re-tokenize pipeline exists only to satisfy the JSON wire format — it contributes zero semantic value. Codec eliminates it.

Measured wire impact (synthetic + live Ollama, see Codec/packages/bench):

| Configuration | B/token | vs JSON-SSE |

|--------------------------------------------|--------:|------------:|

| JSON-SSE, live Ollama qwen2.5 | 186.4 | 1.0× |

| Codec msgpack | 16.0 | 9.6× |

| Codec protobuf | 10.9 | 14.2× |

| Codec msgpack + Content-Encoding: br | 2.8 | 55.2× |

End-to-end agent-to-agent handoff (1024 tokens): 3.6× faster because both the wire shrinks AND detokenize+tokenize gets eliminated. Real BPE tokenizers are 5–50× slower than the modeled hashtable lookup, so the codec advantage on real workloads is wider.

Changes

The implementation is intentionally small. The binary path lives entirely inside the existing streaming branch in handle_completions_impl; it does not touch the JSON path, the request validation, the tokenizer, or the scheduler.

tools/server/CMakeLists.txt

Adds a single LLAMA_CODEC option (default AUTO) controlling how libcodec is sourced:

LLAMA_CODEC=AUTO    (default) prefer system-installed libcodec


                              (vcpkg / OS package), fall back to FetchContent


LLAMA_CODEC=SYSTEM  fail if no system libcodec


LLAMA_CODEC=FETCH   always FetchContent


LLAMA_CODEC=OFF     disable; the binary endpoint becomes a no-op

AUTO reproduces what most users expect: vcpkg install codec works, cmake -S . -B build works without any extra setup. The FetchContent path pins to a tag of the upstream Codec repo and only pulls packages/c (SOURCE_SUBDIR), not the multi-language monorepo.

When libcodec is wired in, LLAMA_HAVE_CODEC is defined, the server-context library links it privately, and the binary streaming branch in server-context.cpp is compiled. When it's not, the only added cost is one CMake option that's OFF.

tools/server/server-context.cpp

Adds an #if defined(LLAMA_HAVE_CODEC) branch inside the existing streaming else block of handle_completions_impl. When the request body contains "stream_format": "msgpack" or "stream_format": "protobuf":

  1. The response Content-Type is application/x-msgpack or application/x-protobuf instead of text/event-stream.

  2. Each result is encoded as a CodecFrame { ids: [u32], done: bool, finish_reason?: string } using libcodec's codec_encode_msgpack / codec_encode_protobuf.

  3. The terminal [DONE] SSE sentinel is omitted (the final frame's done=true already terminates the stream).

  4. On error, a terminal frame with finish_reason="error" is emitted so binary clients can distinguish a genuine server error from a clean truncation.

  5. finish_reason is mapped from the existing stop_type enum: STOP_TYPE_EOS → "eos_token", STOP_TYPE_LIMIT → "length", STOP_TYPE_WORD → "stop_sequence".

The new code touches the lambda next callback only; the slot scheduler, sampler, prompt logic, and task_params are unchanged. The existing tokens field on server_task_result_cmpl_partial and _final is what feeds the binary frame — no new fields, no new data flow.

Wire format

Both modes carry identical semantics; only serialization differs.

MessagePack — frames concatenated, no delimiter:



{"ids": [u32, ...], "done": bool, "finish_reason": str | null}


Protobuf — 4-byte big-endian length prefix + payload:

message CodecFrame {


  repeated uint32 ids           = 1 [packed = true];


  bool            done          = 2;


  optional string finish_reason = 3;


}

The .proto schema is also exposed by Codec-compliant servers at GET /codec/schema for client codegen — that endpoint isn't part of this PR but can be added trivially as a follow-up.

Endpoint usage

# JSON SSE (existing, unchanged)


curl -N http://localhost:8080/v1/completions \


     -H "Content-Type: application/json" \


     -d '{"prompt":"Explain entropy.","stream":true,"max_tokens":256}'





# Codec MessagePack (new)


curl -N http://localhost:8080/v1/completions \


     -H "Content-Type: application/json" \


     -H "Accept-Encoding: zstd, br, gzip" \


     -d '{


       "prompt": "Explain entropy.",


       "stream": true,


       "stream_format": "msgpack",


       "max_tokens": 256


     }'

Browser/Node clients can use @codecai/web to decode + detokenize the stream:

import { loadMap, Detokenizer, decodeStream } from '@codecai/web';





const map = await loadMap({


  url:  'https://cdn.jsdelivr.net/gh/wdunn001/codec-maps/maps/qwen/qwen2.json',


  hash: 'sha256:c73972f7a580…',


});


const detok = new Detokenizer(map);





const resp = await fetch('http://localhost:8080/v1/completions', {


  method: 'POST',


  headers: { 'Content-Type': 'application/json' },


  body: JSON.stringify({


    prompt: 'Explain entropy.', stream: true,


    stream_format: 'msgpack', max_tokens: 256,


  }),


});


for await (const frame of decodeStream(resp.body!, 'msgpack')) {


  output.append(detok.render(frame.ids, { partial: !frame.done }));


}

Polyglot client ecosystem

The Codec protocol ships clients across the major language ecosystems — same dialect maps work everywhere, all verified bit-identical against HuggingFace's reference tokenizers library:

| Lang | Package | Tests |

|---|---|---|

| TypeScript / JS / Browser | @codecai/web | 35/35 ✓ |

| .NET | Codec.Net | 16/16 ✓ |

| Python | codecai | 20/20 ✓ |

| C99 (this PR's dependency) | libcodec (FetchContent / vcpkg) | 4 CTest suites ✓ |

Sister-server PRs

The same protocol is being added to other major open-weight inference servers:

This PR rounds out the major-three story for non-archived inference engines.

Open questions for reviewers

  1. LLAMA_CODEC defaulting to AUTO. Some maintainers prefer all optional features OFF by default. I used AUTO because (a) the runtime gate (stream_format defaulting to "json") means there's zero cost at runtime, (b) FetchContent is well-cached on subsequent builds. Happy to flip the default if you prefer.

  2. Server-side detokenization. On the binary path the server still detokenizes to fill partial->content, even though the client never reads it. There's a ~5% CPU win from skipping detokenize on the binary path. Held as a follow-up to keep this PR small — the change is straightforward in the slot loop.

  3. Schema endpoint. GET /codec/schema (returns the .proto text) is part of the broader Codec spec and useful for client codegen. Easy follow-up if you want it; not included here to keep diff tight.

  4. Alternative wire flag spelling. Other Codec-enabled servers (vLLM, SGLang) use stream_format: "json" | "msgpack" | "protobuf" exactly as in this PR. Happy to align on different wording if stream_format collides with anything in llama.cpp's idiom.

Related

AI assistance disclosure

This PR was developed with AI assistance (Anthropic's Claude). The author drove design choices, reviewed all code, and is responsible for the contents - but AI tools helped draft and refine the C++ changes, the CMake integration, parts of the documentation, and the supporting Codec ecosystem (tokenizer maps, polyglot client libraries, libcodec) referenced from this PR. The author has read and validated every line submitted here.


Measured results (live, end-to-end)

These numbers are real measurements from a live sglang server running this PR's branch with Qwen/Qwen2.5-0.5B-Instruct on an RTX 3090, deterministic at temperature 0.0. Full report: Codec/packages/bench/RESULTS.md.

Wire format (3 wire modes × 4 compression encodings, 64-token completion)

| Path | identity | gzip | br | zstd |

|---|---:|---:|---:|---:|

| JSON-SSE (vanilla main, baseline) | 15.2 KB | 15.2 KB | 15.2 KB | 15.2 KB |

| Codec msgpack | 16.0× | 68.8× | 13.4× | 61.5× |

| Codec protobuf | 23.9× | 69.5× | 16.8× | 57.4× |

Per-token cost: 243 B/tok JSON-SSE → 3.5 B/tok Codec + gzip.

JSON-SSE doesn't compress on either server even with Accept-Encoding set — the text path doesn't honor the header. The Codec path's codec_compression.py is what actually does compression.

Polyglot interop

Same wire decoded by Python, .NET, C, and Web clients. Wire bytes match exactly across all four implementations on every cell of the 12-cell matrix.

End-to-end agent loop (full two-turn round-trip with real tool dispatch)

| Tool | JSON-SSE wire | Codec wire | Reduction | Speedup |

|---|---:|---:|---:|---:|

| mock get_weather | 13.7 KB | 809 B | 16.9× | 1.08× |

| SearXNG (live web search) | 61.9 KB | 3.4 KB | 18.2× | 1.24× |

| MetaMCP (Time MCP server) | 19.6 KB | 1.1 KB | 17.8× | 1.24× |

The agentic loop is: prompt → model emits <tool_call> → server detects (this PR) → orchestrator dispatches → tool result fed back as a tool message → model produces final answer. Both turns count toward the wire total.

ToolWatcher CPU microbench (libcodec, C99, single core, 1M tokens)

| Path | ns/token | Mtok/s |

|---|---:|---:|

| codec_tool_watcher_feed | 0.61 | 1,648 |

| codec_detokenizer_render (same stream) | 60.4 | 16.6 |

| Speedup | | ~100× |

The watcher's hot loop is a uint32 compare against two cached IDs plus an occasional memcpy. Detokenize does a vocab lookup and UTF-8 string construction per token — that's the gap.

Reproducing

Bench drivers under packages/demo-{python,dotnet,c,web}; full method + raw outputs in packages/bench/RESULTS.md.


Update: compression crossover study (size sweep, 16 → 2,048 tokens)

A 8-size sweep against this PR's server shows a clean crossover between gzip and zstd that lets us recommend a concrete threshold rule. Same lab box (RTX 3090, Qwen2.5-0.5B-Instruct, long-form prompt), measured wire bytes per cell:

| path · encoding | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 |

|---|---:|---:|---:|---:|---:|---:|---:|---:|

| msgpack · identity | 249 | 482 | 944 | 1.8 KB | 3.6 KB | 7.2 KB | 14.4 KB | 27.1 KB |

| msgpack · gzip | 110 | 115 | 126 | 146 | 194 | 268 | 400 | 639 |

| msgpack · br | 303 | 574 | 923 | 1.6 KB | 2.9 KB | 5.5 KB | 10.8 KB | 20.2 KB |

| msgpack · zstd | 107 | 112 | 134 | 152 | 176 | 239 | 273 | 381 |

| protobuf · identity | 164 | 322 | 636 | 1.2 KB | 2.5 KB | 4.9 KB | 9.8 KB | 18.5 KB |

| protobuf · gzip | 98 | 102 | 113 | 133 | 179 | 247 | 367 | 587 |

| protobuf · br | 243 | 408 | 762 | 1.4 KB | 2.7 KB | 5.3 KB | 10.6 KB | 20.0 KB |

| protobuf · zstd | 100 | 104 | 122 | 140 | 164 | 223 | 258 | 368 |

Highlighted = winner at that size. JSON-SSE row omitted (the server doesn't compress text streams in this build — bytes scale linearly: 3.8 KB → 457 KB).

Threshold rule

| stream length | best encoding | why |

|---|---|---|

| ≤ 128 tokens | gzip | tiny deflate header beats zstd's frame header on small payloads |

| ≥ 256 tokens | zstd | Huffman + dictionary keep amortising as the stream grows |

The crossover for protobuf is between 128 and 256 tokens (gzip 133 B vs zstd 140 B at 128; gzip 179 B vs zstd 164 B at 256).

Brotli underperforms at every size measured for streaming Codec frames — per-block overhead never amortises across small CodecFrames. Identity also loses at every size, including 16 tokens (compressed is ≥2× smaller even there).

A simpler one-rule policy that gets ~95% of the win: always zstd. At worst it costs ~10% more bytes than gzip on the smallest payloads, and it wins by 1.6× on large payloads.

Recommendation for default Accept-Encoding

This server should advertise gzip, zstd (in that order) as preferred encodings, fall back to identity if neither is supported by the client, and not advertise br for Codec streams. Full study: packages/bench/RESULTS.md §1c. Reproduce with codec-bench-crossover --url <server> --sizes 16 32 64 128 256 512 1024 2048 --prompt-long.


Update: encoding picker shipped as a standalone package

The crossover study above motivated a standalone, framework-agnostic encoding picker — published as wire-compress (source). Drop it in any HTTP server to get the right Content-Encoding based on Accept-Encoding and estimated payload size:

import { pick } from 'wire-compress';





const choice = pick({


  acceptEncoding: req.headers['accept-encoding'],


  estimatedSize: 1024,


});


res.setHeader('Content-Encoding', choice.encoding);

The library treats brotli as a fallback tier, not a loser. On streaming small-frame workloads brotli's per-block overhead doesn't amortise (gzip beats it at every measured size), but brotli has wider client coverage than zstd — Safari, iOS, older Firefox all ship br but not zstd — so it remains a critical fallback when neither modern encoder is supported. The picker's choice order:

  1. zstd if client supports it AND size ≥ 256 (or mid-band ≥ 128)

  2. gzip if client supports it AND size ≤ 128 (or zstd unsupported)

  3. br if client supports nothing else compressible (Safari/iOS path)

  4. identity if client refuses everything else

Crossover chart referenced in the study above lives at packages/bench/docs/crossover-summary.png.


Update: time-impact analysis (TTFT cliff)

Wire bytes are half the story. The other half is time. Re-running the sweep with TTFT instrumentation surfaced an important finding that reframes the encoding recommendation:

All numbers below come from a single timed sweep on this PR's server: fixed prompt, all 12 cells (3 paths × 4 encodings) at 3 sizes, median of 2 reps. Token counts are identical across encodings within a size (64 / 512 / 1967 emitted), so cells are directly comparable.

Two findings, not one:

1. zstd has a TTFT cliff — gzip, br, and identity all stream

path · encoding TTFT @ 64 TTFT @ 512 TTFT @ 2048 streams?
msgpack · gzip 11 ms 12 ms 12 ms
msgpack · br 11 ms 12 ms 11 ms
msgpack · zstd 119 ms 910 ms 3,674 ms
protobuf · gzip 11 ms 11 ms 11 ms
protobuf · br 11 ms 11 ms 11 ms
protobuf · zstd 119 ms 910 ms 3,684 ms

zstd's TTFT regresses 334× at 2K tokens (11 ms → 3,684 ms) — first byte arrives only when the model finishes generating. gzip and brotli both flush chunk-by-chunk and preserve TTFT.

2. Brotli is barely compressing on this stack

The TTFT chart suggests br is a viable fallback. The wire-bytes table from the same run says otherwise:

path · encoding wire @ 64 wire @ 512 wire @ 2048
msgpack · identity 952 B 7.3 KB 28.1 KB
msgpack · gzip 170 B 333 B 660 B
msgpack · br 969 B 5.8 KB 20.6 KB
msgpack · zstd 182 B 284 B 470 B
protobuf · identity 638 B 4.9 KB 18.9 KB
protobuf · gzip 157 B 313 B 608 B
protobuf · br 838 B 5.4 KB 20.2 KB ← bigger than identity
protobuf · zstd 179 B 293 B 467 B

protobuf · br at 2K is 7% larger than identity. msgpack · br saves only 27% over identity at 2K, while gzip in the same slot is 660 B (42× smaller than br). This looks like a server-side brotli middleware configuration issue (per-frame compression with a quality setting unsuited to small-frame workloads), not a fundamental br limitation.

Reduction-vs-baseline at 2K tokens

encoding wire reduction vs JSON-SSE TTFT @ 2K use when
gzip 705× (msgpack) / 765× (protobuf) 11 ms universal default for streaming
zstd 990× / 997× 3,684 ms non-interactive only (agent-to-agent, batch)
br 23× / 23× 11 ms fallback only (clients without gzip/zstd)
identity 17× / 25× 11 ms last-resort fallback

Total wall-clock is model-bound; Codec adds <1% across every cell.

For human-facing streams use gzip — it streams and delivers 700×+ wire reduction. The wire-compress picker's interactive: true mode (default) enforces this: gzip > br > zstd-only-if-alone.

Total wall-clock time is model-bound — Codec adds <1% overhead vs JSON-SSE across every size and encoding. The wire reduction is essentially free in time.

Full chart and methodology: RESULTS.md §1d. The proposed "in-server tools-as-tokens" architecture (server-side MCP dispatch so the tool call never leaves the trusted process boundary) is sketched in RESULTS.md §1e.


Update: bolt-on tool architecture (revised proposal)

The earlier "in-process MCP dispatcher inside sglang" sketch turned out to be the wrong layer for tokenization. The revised proposal is bolt-on Codec-native tools — independently versioned, deployed, and authored, hosted in their own repos, with build-time tokenizer caches per supported model.

Why bolt-ons beat in-process dispatch:

  1. Modularity — tools want their own release cadence and deploy surface. In-process dispatch forces every tool change into a server release.

  2. Tokenization belongs at the tool, not the gateway — the tool knows its own response fragments better than any central registry can. A weather tool that emits "It is {temp}°F in {city}" knows exactly which prefixes/suffixes to pre-cache; the gateway doesn't.

  3. Independent hosting — teams publish from their own repos on their own infra. The gateway only needs the manifest URL.

The flow with bolt-on tools:


gateway (this PR) → ToolWatcher detects <tool_call>

                  → routes raw argument token IDs to bolt-on tool over MCP

                  ← receives response token IDs (pre-cached for this model)

                  → reinjects IDs into generation context

Token IDs flow through the loop end-to-end. The gateway never detokenizes. The tool's hot path is concat(cached_prefix_ids, tokenize(short_dynamic), cached_suffix_ids) — typically 50-100× CPU reduction at the tool layer, and zero CPU at the gateway.

Reference SDK landed at packages/codec-tool-kit — the manifest spec, CodecTool interface, and precache() build helper. Zero runtime deps, ~6 KB. Tools that ship caches for the active model are token-native; tools that don't have a binding fall back to text-mode and the gateway tokenizes at the boundary.

What's still needed in this PR's server (small, opt-in additions):

  1. Tool registry that loads manifests at startup and validates tokenizerHash against the active model's tokenizer

  2. MCP-style HTTP/IPC client to post CodecToolCall and receive CodecToolResult

  3. Reinjection path that takes responseIds and feeds them back at the position where <tool_call> was detected

Full architecture and rationale: RESULTS.md §1e.


Update: composite efficiency metric (bytes × TTFT)

Wire-bytes ranking puts zstd on top. TTFT ranking puts gzip on top. Multiplying them gives a single number that ranks encodings holistically — bytes-milliseconds, "the cost of holding this response in flight until the user sees something." Normalised to JSON-SSE identity = 1.0×, higher is better.

The two regimes give two different rankings, which is exactly what the picker has to choose between:

metric best at 2K tok second also-rans
Interactive (bytes × TTFT) gzip — 722-855× identity Codec — 18-25× br — 25× · zstd — only 3× (TTFT cliff)
Batch (bytes only) zstd — 1014-1021× gzip — 722-784× br — 23× · identity — 17-25×

Full composite tables

Interactive efficiency (bytes × TTFT, normalised to JSON-SSE identity = 1.0×)
path · encoding 64 tok 512 tok 2048 tok
msgpack · identity 46× 17× 18×
msgpack · gzip 258× 373× 722×
msgpack · br 45× 21× 25×
msgpack · zstd 22× ↓ 6× ↓ 3× ↓
protobuf · gzip 279× 433× 855×
protobuf · zstd 23× ↓ 6× ↓ 3× ↓

Arrows mark cells where zstd is worse than uncompressed identity Codec on the composite metric — the TTFT cliff has fully cancelled the wire savings.

Batch efficiency (bytes only)
path · encoding 64 tok 512 tok 2048 tok
msgpack · gzip 92× 373× 722×
msgpack · zstd 86× 437× 1014×
protobuf · gzip 99× 397× 784×
protobuf · zstd 87× 424× 1021×

In batch mode the bytes-only score puts zstd on top above 256 tokens. gzip wins the small-payload bracket (≤128 tokens) — same crossover the bytes sweep in §1c showed.

What this proves about the picker

The Pareto front for both metrics is {gzip, zstd} — br and identity are dominated everywhere. The wire-compress picker has exactly one knob (interactive: boolean) because these two metrics cleanly separate the workloads: gzip for interactive, zstd for batch above 256 tokens, gzip below.

Charts: composite-interactive.png · composite-batch.png. Full study and methodology: RESULTS.md §1d.


Update: cross-stack matrix — llama.cpp row measured

Built llama-server from this PR's branch (#22757) with CUDA, ran the same codec-bench-timed sweep that landed for sglang earlier (Qwen2.5-0.5B Q4_K_M GGUF, 64/512/2048 tokens, 2 reps, identical token counts across encodings).

Headline finding: this PR ships Codec wire formats but no compression middleware. All four Accept-Encoding values return identical bytes — gzip/br/zstd are passthrough — so wire-coefficient is exactly 1.0 for every encoding. That's a strictly cleaner baseline than sglang (no broken middleware) but you only get the format ratio (17× for msgpack, 25× for protobuf), not the compression bonus that sglang's gzip layer delivers (700×+).

Wire bytes at 2K tokens

identity gzip br zstd
Codec msgpack 28.4 KB 28.4 KB 28.4 KB 28.4 KB
Codec protobuf 19.5 KB 19.5 KB 19.5 KB 19.5 KB

Cross-stack ratios

stack · path · encoding wire coeff TTFT ratio composite (interactive) composite (batch)
llama.cpp · msgpack · identity 1.000 1.00 36.8× 16.9×
llama.cpp · msgpack · gzip 1.000 = 1.00 ✓ 31× 17×
llama.cpp · msgpack · br 1.000 = 1.00 ✓ 39× 17×
llama.cpp · msgpack · zstd 1.000 = 1.00 ✓ 41× 17×
llama.cpp · protobuf · gzip 1.000 = 1.00 ✓ 55× 25×
llama.cpp · protobuf · br 1.000 = 1.00 ✓ 52× 25×
llama.cpp · protobuf · zstd 1.000 = 1.00 ✓ 60× 25×

= 1.000 means "no compression applied" — the wire-coefficient is exactly 1.0 because llama.cpp returns the raw codec bytes regardless of Accept-Encoding. The composite scores are still >1× because Codec frames are intrinsically 17-25× smaller than JSON-SSE without any compression layer.

TTFT is consistently 5-7 ms across every encoding (better than sglang's 11 ms — llama.cpp's HTTP layer flushes faster).

Path forward: hooking a streaming-aware gzip layer into the existing HTTP pipeline (wrap the SSE writer in boost::iostreams::gzip_compressor or use mongoose's gzip support) would lift this row from 17-25× to 700×+, matching sglang's gzip cell. That's a small, contained change relative to the existing PR.

Cross-stack matrix and methodology: RESULTS.md §1f. vLLM row pending its bench run.


Update: full bench suite measured on this PR

Beyond the timed sweep, we ran the complete benchmark suite against this PR's llama-server (Qwen2.5-0.5B Q4_K_M GGUF, RTX 3090 via CUDA docker):

bench status finding
codec-bench (standard grid, 64 tok) All 12 cells succeed; ratios match sglang's identity row within noise
codec-bench-timed (3 sizes × 2 reps) Bytes flat across encodings; TTFT 5-7 ms; total wall-clock matches model rate
codec-bench-crossover (8 sizes × full grid) identity wins at every size; gzip/br/zstd literally tied with identity
handoff (deterministic, stack-independent) Same as sglang: protobuf 6.9× faster than text-path
polyglot interop msgpack/protobuf bytes from this PR parse cleanly with the Python/.NET/C/Web clients shipped for sglang

Standard grid at 64 tok (parity comparison)

identity gzip br zstd
JSON-SSE 15.8 KB 15.8 KB 15.8 KB 15.8 KB
Codec msgpack 989 B (16.4×) 989 B 989 B 989 B
Codec protobuf 656 B (24.7×) 656 B 656 B 656 B

These ratios match sglang's identity row (16.0× / 23.9×) within noise — Codec wire format is stack-agnostic. The format-level wins are delivered by this PR. The compression-level wins (sglang gzip pushes msgpack from 16× to 705× at scale) are absent here because no compression middleware is wired in.

8-size crossover sweep — gzip/br/zstd literally tied with identity

path · encoding 16 32 64 128 256 512 1024 2048
msgpack · {identity, gzip, br, zstd} 261 B 485 B 945 B 1.8 KB 3.6 KB 7.2 KB 14.3 KB 28.5 KB
protobuf · {identity, gzip, br, zstd} 169 B 323 B 634 B 1.2 KB 2.4 KB 4.9 KB 9.7 KB 19.4 KB

Each row's four encoding cells are bit-identical at every size. Per-token cost is constant (~14 B/tok msgpack, ~9.5 B/tok protobuf) — format efficiency works exactly as the bytes-only sweep predicted; only the compression layer is missing.

Path forward

Hooking a streaming-aware gzip layer into the existing HTTP pipeline (mongoose's gzip support, or wrapping the SSE writer with boost::iostreams::gzip_compressor) would lift this PR from 16-25× to ~700× at 2K tokens, matching sglang's gzip cell. That's a small contained follow-up.

Methodology and full cross-stack matrix: RESULTS.md §1f.

…v1/chat/completions

Adds support for emitting raw token IDs as Codec MessagePack or Protobuf
frames instead of JSON SSE when the request body sets stream_format to
'msgpack' or 'protobuf'. Fully backwards-compatible -- the field defaults
to 'json' and the existing path is byte-identical when unset.

Wire impact (measured against synthetic and live Ollama qwen2.5):

  JSON-SSE, live Ollama:                     186.4 B/token
  Codec msgpack (identity):                   16.0 B/token   9.6x
  Codec protobuf (identity):                  10.9 B/token  14.2x
  Codec msgpack + Content-Encoding: br:        2.79 B/token  55.2x

Agent-to-agent handoff (1024 tokens) is 3.6x faster end-to-end because
both the wire shrinks AND detokenize+tokenize on the receiver gets
eliminated.

Note: this lands in llama-server, which means Ollama gets the option
for free as soon as it surfaces stream_format through its own API.

Changes
-------
tools/server/CMakeLists.txt
  Adds LLAMA_CODEC=AUTO|SYSTEM|FETCH|OFF (default AUTO):
    AUTO    prefer system-installed libcodec (vcpkg / OS pkg manager),
            fall back to FetchContent from github.com/wdunn001/Codec
    SYSTEM  fail if no system libcodec
    FETCH   always FetchContent
    OFF     disable; the binary endpoint becomes a no-op
  When wired in, defines LLAMA_HAVE_CODEC and links libcodec privately
  to server-context. Otherwise the build is unchanged.

tools/server/server-context.cpp
  In the existing streaming branch of handle_completions_impl, when
  stream_format is 'msgpack' or 'protobuf':
    - Content-Type is application/x-msgpack or application/x-protobuf
    - Each result is encoded as CodecFrame { ids, done, finish_reason? }
      via codec_encode_msgpack / codec_encode_protobuf
    - The [DONE] SSE sentinel is omitted (final frame's done=true terminates)
    - On error a terminal frame with finish_reason='error' is emitted so
      binary clients distinguish a server error from a clean truncation
    - finish_reason is mapped from stop_type:
        STOP_TYPE_EOS   -> 'eos_token'
        STOP_TYPE_LIMIT -> 'length'
        STOP_TYPE_WORD  -> 'stop_sequence'

  No changes to validation, scheduler, sampler, prompt logic, or
  task_params. The existing tokens field on partial/final results
  is what feeds the binary frame.

Sister-server PRs ship the same protocol contract:
  vLLM:    vllm-project/vllm#41765
  SGLang:  sgl-project/sglang#24483
  Spec:    https://github.com/wdunn001/Codec

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wdunn001 wdunn001 requested a review from a team as a code owner May 6, 2026 12:40
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented May 6, 2026

Hi @wdunn001, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Detect <tool_call>..</tool_call> regions in the outbound token stream
without ever detokenizing on the hot path. Mirrors sglang PR #24557 and
uses libcodec's ToolWatcher state machine via the new
codec_tool_watcher_new_with_ids() API — pure uint32 compare per token,
~ns of overhead, completed regions surface as structured tool_calls on
the matching CodecFrame.

Opt-in per request via three new body fields:
  - tool_watcher: bool (default false)
  - tool_watcher_start: string (default "<tool_call>")
  - tool_watcher_end:   string (default "</tool_call>")

Marker strings are tokenized with parse_special=true; if either does
not resolve to a single token in the loaded vocab, the watcher is
disabled and a SRV_WRN logs the reason. Binary streaming still works
in plain passthrough mode in that case.

Region body decoding uses llama.cpp's own common_token_to_piece (no
external Codec map fetch needed — the server already owns the vocab).
Body JSON is parsed inline; the optional "name" field is surfaced when
present, the raw body always rides as arguments_json so orchestrators
can return invalid_arguments errors to the model when parsing fails.
Server-generated call IDs use the deterministic "tc_<8hex>" shape.

Backward compatible: stream_format=msgpack/protobuf without
tool_watcher=true emits byte-identical frames to before.
@ngxson ngxson closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants