feat(M32d): KV cache for qwen3_moe inference path — 19× speedup by noahgift · Pull Request #1832 · paiml/aprender

noahgift · 2026-05-20T05:34:05Z

Summary

Implements M32d KV cache for the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.1 → v1.2.0). Empirical: 19× speedup on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.

Operator flipped from Option (b) engineer-driven (#1829) to Option (a) in-session. This PR supersedes #1829.

Empirical results

Metric	Pre-M32d	Post-M32d	Speedup
Sustained throughput (32 tok)	~0.5 tok/s	9.62 tok/s	19×
Wall on 4 tokens	1002ms	553ms	1.8×
Greedy output equivalence	—	byte-identical	✓

All measurements on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.

Implementation

New: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` step-for-step EXCEPT the FFN block, which calls `moe_ffn_forward_layer` (router → top-k → per-expert SwiGLU) instead of dense gate/up/down dispatch.
Rewrite: `run_qwen3_moe_generate` now uses cache-aware decode: prefill per-prompt-token + decode per-output-token, both via the new single-token function.
Visibility fix: `single_cache_final_output` in `ffn_block.rs` → `pub(crate)` so MoE path reuses the dense final-norm + LM head unchanged.

Risk surfaces from scope doc (all cleared)

✅ Numerical equivalence — byte-identical greedy outputs on 4-token reference
✅ Dense path regression — `forward_single_with_cache` untouched
✅ RoPE position offset — handled via `position` parameter (same pattern)
✅ GQA expansion — handled via `kv_dim()` + first-token edge case explicit
✅ Expert routing under cache — confirmed unaffected
◯ Streaming SSE — structurally enabled; not wired (separate follow-up)

New tests

`crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — generates 4 tokens via cache-on AND legacy full-prefill, asserts greedy outputs byte-identical
`crates/aprender-serve/tests/m32d_perf.rs` — asserts ≥ 5 tok/s sustained on 32-token gen (pinned floor)

Both `#[ignore]` + env-gated on `QWEN3_MOE_GGUF_PATH`.

V1_001 + V1_003 regression

Existing #1819 cargo test still passes: 9.39s wall, content "Human: What", no matmul guard fire.

Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at sustained 9.62 tok/s. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension.

Test plan

`cargo check -p aprender-serve --lib --features cuda` — clean
`cargo check --test moe_kv_cache_equivalence --test m32d_perf` — clean
`QWEN3_MOE_GGUF_PATH=... cargo test moe_kv_cache_equivalence --release -- --ignored` — PASS, 4 tokens byte-identical
`QWEN3_MOE_GGUF_PATH=... cargo test m32d_perf --release -- --ignored` — PASS, 9.62 tok/s sustained
`QWEN3_MOE_GGUF_PATH=... cargo test qwen3_moe_serve_dispatch_v1 --release -- --ignored` (regression check, fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF #1819 test) — PASS
CI (sovereign-ci full workflow)
Companion-side CCPA Phase 6 V1_004 discharge bench (operator-coordinated, ~10 hr wall)

🤖 Generated with Claude Code

…spatch ready (#254) Upstream M32d KV cache for qwen3_moe inference path shipped at paiml/aprender#1832 (open; in CI). Operator flipped from Option (b) engineer-driven (#1829) to Option (a) in-session implementation. Empirical (on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M): - Pre-M32d: ~0.5 tok/s (bench timed out on every per-turn budget) - Post-M32d: 9.62 tok/s sustained on 32-token gen (19× speedup) - Numerical equivalence vs full-prefill: byte-identical greedy outputs - V1_001 + V1_003 (#1819 cargo test) regression: stable V1_004 prerequisite (M32d KV cache) NOW MET. Bench discharge is operator-actionable on a tractable ~10hr wall. ## Files ### NEW: `evidence/phase-6/m32d-shipped-2026-05-20.md` - Upstream empirical results table - New cargo tests pinning the invariants (equivalence + perf floor) - V1_004 dispatch checklist (7 operator steps) - Cross-references to all upstream PRs ### MODIFIED: `evidence/phase-6/1.5b-calibration-run.md` - aprender#1789 line: V1_004 status flipped from BLOCKED to "prerequisite MET 2026-05-20 via M32d" - Updated PR list with all 7 follow-up PRs (#1806, #1807, #1812, #1814, #1819, #1826, #1832) - Added cross-reference to m32d-shipped-2026-05-20.md ## What this is NOT - NOT a CCPA-side code change (bench script + analyzer + harness unchanged) - NOT the V1_004 bench dispatch itself (operator-coordinated, ~10hr wall) - NOT a new CCPA contract gate (V1_004 is unchanged; only its prerequisite flipped) Mechanical doc update. M-counter NOT bumped per the discipline doctrine. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift mentioned this pull request May 20, 2026

spec: qwen3-moe-streaming-sse-v1 — follow-up contract to M32d KV cache #1835

Open

2 tasks

noahgift force-pushed the feat/m32d-moe-kv-cache branch from 1433ac7 to 4a04aae Compare May 20, 2026 05:53

noahgift enabled auto-merge (squash) May 20, 2026 06:04

noahgift mentioned this pull request May 20, 2026

refactor(forward): lift attention_layer_with_cache helper (M32d Day 1 prep, #1830 PR-1 of 4) #1831

Closed

2 tasks

Merge branch 'main' into feat/m32d-moe-kv-cache

0709293

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(M32d): KV cache for qwen3_moe inference path — 19× speedup#1832

feat(M32d): KV cache for qwen3_moe inference path — 19× speedup#1832
noahgift wants to merge 2 commits into
mainfrom
feat/m32d-moe-kv-cache

noahgift commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Summary

Empirical results

Implementation

Risk surfaces from scope doc (all cleared)

New tests

V1_001 + V1_003 regression

Companion-side downstream

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant