feat(M32d): KV cache for qwen3_moe inference path — 19× speedup#1832
Open
noahgift wants to merge 2 commits into
Open
feat(M32d): KV cache for qwen3_moe inference path — 19× speedup#1832noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
This was referenced May 20, 2026
noahgift
added a commit
to paiml/claude-code-parity-apr
that referenced
this pull request
May 20, 2026
…spatch ready (#254) Upstream M32d KV cache for qwen3_moe inference path shipped at paiml/aprender#1832 (open; in CI). Operator flipped from Option (b) engineer-driven (#1829) to Option (a) in-session implementation. Empirical (on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M): - Pre-M32d: ~0.5 tok/s (bench timed out on every per-turn budget) - Post-M32d: 9.62 tok/s sustained on 32-token gen (19× speedup) - Numerical equivalence vs full-prefill: byte-identical greedy outputs - V1_001 + V1_003 (#1819 cargo test) regression: stable V1_004 prerequisite (M32d KV cache) NOW MET. Bench discharge is operator-actionable on a tractable ~10hr wall. ## Files ### NEW: `evidence/phase-6/m32d-shipped-2026-05-20.md` - Upstream empirical results table - New cargo tests pinning the invariants (equivalence + perf floor) - V1_004 dispatch checklist (7 operator steps) - Cross-references to all upstream PRs ### MODIFIED: `evidence/phase-6/1.5b-calibration-run.md` - aprender#1789 line: V1_004 status flipped from BLOCKED to "prerequisite MET 2026-05-20 via M32d" - Updated PR list with all 7 follow-up PRs (#1806, #1807, #1812, #1814, #1819, #1826, #1832) - Added cross-reference to m32d-shipped-2026-05-20.md ## What this is NOT - NOT a CCPA-side code change (bench script + analyzer + harness unchanged) - NOT the V1_004 bench dispatch itself (operator-coordinated, ~10hr wall) - NOT a new CCPA contract gate (V1_004 is unchanged; only its prerequisite flipped) Mechanical doc update. M-counter NOT bumped per the discipline doctrine. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1433ac7 to
4a04aae
Compare
Closed
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements M32d KV cache for the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.1 → v1.2.0). Empirical: 19× speedup on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.
Operator flipped from Option (b) engineer-driven (#1829) to Option (a) in-session. This PR supersedes #1829.
Empirical results
All measurements on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.
Implementation
Risk surfaces from scope doc (all cleared)
New tests
Both `#[ignore]` + env-gated on `QWEN3_MOE_GGUF_PATH`.
V1_001 + V1_003 regression
Existing #1819 cargo test still passes: 9.39s wall, content "Human: What", no matmul guard fire.
Companion-side downstream
paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:
```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```
Expected ~10 hour wall on full 20-fixture corpus at sustained 9.62 tok/s. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension.
Test plan
🤖 Generated with Claude Code