Skip to content

spec: qwen3-moe-streaming-sse-v1 — follow-up contract to M32d KV cache#1835

Open
noahgift wants to merge 2 commits into
mainfrom
spec/qwen3-moe-streaming-sse-v1
Open

spec: qwen3-moe-streaming-sse-v1 — follow-up contract to M32d KV cache#1835
noahgift wants to merge 2 commits into
mainfrom
spec/qwen3-moe-streaming-sse-v1

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Registers a new provable contract `qwen3-moe-streaming-sse-v1` for per-token SSE streaming on the qwen3_moe chat-completions path. Natural follow-up to #1832 (M32d KV cache).

Why now

Pre-M32d, streaming SSE was meaningless on qwen3_moe — full-prefill-per-token mode at ~0.5 tok/s meant the client would see no output for ~30 minutes before all 256 tokens arrived at once. Post-M32d at 9.62 tok/s sustained, per-token emits become valuable for chat UX.

Contract gates

  • V1_001: chat-completions with `stream=true` emits per-token SSE events (not buffered)
  • V1_002: `stream=false` still returns a single JSON response (regression check)
  • V1_003: streaming throughput ≥ 2 tok/s median inter-event time

Implementation phases (engineer playbook)

  • Phase 1 (~2hr): callback variant of `run_qwen3_moe_generate` accepting `&mut dyn FnMut(u32) -> bool`
  • Phase 2 (~4hr): wire into `try_qwen3_moe_backend` — `tokio::task::spawn_blocking` + mpsc channel + axum SSE stream
  • Phase 3 (~2hr): cargo integration test (env-gated, `#[ignore]` by default; mirrors `qwen3_moe_serve_dispatch_v1.rs`)

Total ~6-8 hours; operator-actionable once #1832 merges.

NOT in scope

Test plan

  • Pure contract YAML; no code touched
  • CI: `pv validate` (if wired)

🤖 Generated with Claude Code

noahgift and others added 2 commits May 20, 2026 07:52
Registers a new provable contract for per-token SSE streaming on the
qwen3_moe chat-completions path. This is the natural follow-up to
#1832 (M32d KV cache) — pre-M32d, streaming was
meaningless because full-prefill-per-token mode took ~30 minutes per
256-token completion. Post-M32d at 9.62 tok/s sustained, per-token
SSE emits become valuable for chat UX.

## Falsification gates

- V1_001: chat-completions with stream=true emits per-token SSE events
  (not buffered into pregenerated SSE)
- V1_002: stream=false still returns a single JSON response (regression)
- V1_003: streaming throughput ≥ 2 tok/s median inter-event time

## Implementation phases (engineer playbook)

- Phase 1 (~2hr): callback variant of run_qwen3_moe_generate
- Phase 2 (~4hr): wire into try_qwen3_moe_backend in cuda_chat_backend.rs
- Phase 3 (~2hr): cargo integration test

Total ~6-8 hours, operator-actionable once #1832 merges.

NOT in scope:
- MoE inference correctness (covered by qwen3-moe-serve-dispatch-v1)
- KV cache mechanics (M32d / #1832)
- Streaming for dense models (already exists via OwnedQuantizedModelCachedSync)
- Tool-call streaming (separate contract)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant