spec: qwen3-moe-streaming-sse-v1 — follow-up contract to M32d KV cache by noahgift · Pull Request #1835 · paiml/aprender

noahgift · 2026-05-20T05:53:07Z

Summary

Registers a new provable contract `qwen3-moe-streaming-sse-v1` for per-token SSE streaming on the qwen3_moe chat-completions path. Natural follow-up to #1832 (M32d KV cache).

Why now

Pre-M32d, streaming SSE was meaningless on qwen3_moe — full-prefill-per-token mode at ~0.5 tok/s meant the client would see no output for ~30 minutes before all 256 tokens arrived at once. Post-M32d at 9.62 tok/s sustained, per-token emits become valuable for chat UX.

Contract gates

V1_001: chat-completions with `stream=true` emits per-token SSE events (not buffered)
V1_002: `stream=false` still returns a single JSON response (regression check)
V1_003: streaming throughput ≥ 2 tok/s median inter-event time

Implementation phases (engineer playbook)

Phase 1 (~2hr): callback variant of `run_qwen3_moe_generate` accepting `&mut dyn FnMut(u32) -> bool`
Phase 2 (~4hr): wire into `try_qwen3_moe_backend` — `tokio::task::spawn_blocking` + mpsc channel + axum SSE stream
Phase 3 (~2hr): cargo integration test (env-gated, `#[ignore]` by default; mirrors `qwen3_moe_serve_dispatch_v1.rs`)

Total ~6-8 hours; operator-actionable once #1832 merges.

NOT in scope

MoE inference correctness (`qwen3-moe-serve-dispatch-v1` covers it)
KV cache mechanics (M32d / feat(M32d): KV cache for qwen3_moe inference path — 19× speedup #1832)
Streaming for dense models (already exists via `OwnedQuantizedModelCachedSync` continuous batching)
Tool-call streaming deltas (separate contract)

Test plan

Pure contract YAML; no code touched
CI: `pv validate` (if wired)

🤖 Generated with Claude Code

Registers a new provable contract for per-token SSE streaming on the qwen3_moe chat-completions path. This is the natural follow-up to #1832 (M32d KV cache) — pre-M32d, streaming was meaningless because full-prefill-per-token mode took ~30 minutes per 256-token completion. Post-M32d at 9.62 tok/s sustained, per-token SSE emits become valuable for chat UX. ## Falsification gates - V1_001: chat-completions with stream=true emits per-token SSE events (not buffered into pregenerated SSE) - V1_002: stream=false still returns a single JSON response (regression) - V1_003: streaming throughput ≥ 2 tok/s median inter-event time ## Implementation phases (engineer playbook) - Phase 1 (~2hr): callback variant of run_qwen3_moe_generate - Phase 2 (~4hr): wire into try_qwen3_moe_backend in cuda_chat_backend.rs - Phase 3 (~2hr): cargo integration test Total ~6-8 hours, operator-actionable once #1832 merges. NOT in scope: - MoE inference correctness (covered by qwen3-moe-serve-dispatch-v1) - KV cache mechanics (M32d / #1832) - Streaming for dense models (already exists via OwnedQuantizedModelCachedSync) - Tool-call streaming (separate contract) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 2 commits May 20, 2026 07:52

Merge branch 'main' into spec/qwen3-moe-streaming-sse-v1

bd8b88c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec: qwen3-moe-streaming-sse-v1 — follow-up contract to M32d KV cache#1835

spec: qwen3-moe-streaming-sse-v1 — follow-up contract to M32d KV cache#1835
noahgift wants to merge 2 commits into
mainfrom
spec/qwen3-moe-streaming-sse-v1

noahgift commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Summary

Why now

Contract gates

Implementation phases (engineer playbook)

NOT in scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant