Skip to content

feat(M32d): KV cache for qwen3_moe inference path — 19× speedup#1832

Open
noahgift wants to merge 2 commits into
mainfrom
feat/m32d-moe-kv-cache
Open

feat(M32d): KV cache for qwen3_moe inference path — 19× speedup#1832
noahgift wants to merge 2 commits into
mainfrom
feat/m32d-moe-kv-cache

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Implements M32d KV cache for the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.1 → v1.2.0). Empirical: 19× speedup on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.

Operator flipped from Option (b) engineer-driven (#1829) to Option (a) in-session. This PR supersedes #1829.

Empirical results

Metric Pre-M32d Post-M32d Speedup
Sustained throughput (32 tok) ~0.5 tok/s 9.62 tok/s 19×
Wall on 4 tokens 1002ms 553ms 1.8×
Greedy output equivalence byte-identical

All measurements on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.

Implementation

  • New: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` step-for-step EXCEPT the FFN block, which calls `moe_ffn_forward_layer` (router → top-k → per-expert SwiGLU) instead of dense gate/up/down dispatch.
  • Rewrite: `run_qwen3_moe_generate` now uses cache-aware decode: prefill per-prompt-token + decode per-output-token, both via the new single-token function.
  • Visibility fix: `single_cache_final_output` in `ffn_block.rs` → `pub(crate)` so MoE path reuses the dense final-norm + LM head unchanged.

Risk surfaces from scope doc (all cleared)

  1. ✅ Numerical equivalence — byte-identical greedy outputs on 4-token reference
  2. ✅ Dense path regression — `forward_single_with_cache` untouched
  3. ✅ RoPE position offset — handled via `position` parameter (same pattern)
  4. ✅ GQA expansion — handled via `kv_dim()` + first-token edge case explicit
  5. ✅ Expert routing under cache — confirmed unaffected
  6. ◯ Streaming SSE — structurally enabled; not wired (separate follow-up)

New tests

  • `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — generates 4 tokens via cache-on AND legacy full-prefill, asserts greedy outputs byte-identical
  • `crates/aprender-serve/tests/m32d_perf.rs` — asserts ≥ 5 tok/s sustained on 32-token gen (pinned floor)

Both `#[ignore]` + env-gated on `QWEN3_MOE_GGUF_PATH`.

V1_001 + V1_003 regression

Existing #1819 cargo test still passes: 9.39s wall, content "Human: What", no matmul guard fire.

Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at sustained 9.62 tok/s. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension.

Test plan

  • `cargo check -p aprender-serve --lib --features cuda` — clean
  • `cargo check --test moe_kv_cache_equivalence --test m32d_perf` — clean
  • `QWEN3_MOE_GGUF_PATH=... cargo test moe_kv_cache_equivalence --release -- --ignored` — PASS, 4 tokens byte-identical
  • `QWEN3_MOE_GGUF_PATH=... cargo test m32d_perf --release -- --ignored` — PASS, 9.62 tok/s sustained
  • `QWEN3_MOE_GGUF_PATH=... cargo test qwen3_moe_serve_dispatch_v1 --release -- --ignored` (regression check, fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF #1819 test) — PASS
  • CI (sovereign-ci full workflow)
  • Companion-side CCPA Phase 6 V1_004 discharge bench (operator-coordinated, ~10 hr wall)

🤖 Generated with Claude Code

noahgift added a commit to paiml/claude-code-parity-apr that referenced this pull request May 20, 2026
…spatch ready (#254)

Upstream M32d KV cache for qwen3_moe inference path shipped at
paiml/aprender#1832 (open; in CI). Operator flipped from Option (b)
engineer-driven (#1829) to Option (a) in-session implementation.

Empirical (on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M):
- Pre-M32d: ~0.5 tok/s (bench timed out on every per-turn budget)
- Post-M32d: 9.62 tok/s sustained on 32-token gen (19× speedup)
- Numerical equivalence vs full-prefill: byte-identical greedy outputs
- V1_001 + V1_003 (#1819 cargo test) regression: stable

V1_004 prerequisite (M32d KV cache) NOW MET. Bench discharge is
operator-actionable on a tractable ~10hr wall.

## Files

### NEW: `evidence/phase-6/m32d-shipped-2026-05-20.md`

- Upstream empirical results table
- New cargo tests pinning the invariants (equivalence + perf floor)
- V1_004 dispatch checklist (7 operator steps)
- Cross-references to all upstream PRs

### MODIFIED: `evidence/phase-6/1.5b-calibration-run.md`

- aprender#1789 line: V1_004 status flipped from BLOCKED to "prerequisite MET 2026-05-20 via M32d"
- Updated PR list with all 7 follow-up PRs (#1806, #1807, #1812, #1814, #1819, #1826, #1832)
- Added cross-reference to m32d-shipped-2026-05-20.md

## What this is NOT

- NOT a CCPA-side code change (bench script + analyzer + harness unchanged)
- NOT the V1_004 bench dispatch itself (operator-coordinated, ~10hr wall)
- NOT a new CCPA contract gate (V1_004 is unchanged; only its prerequisite flipped)

Mechanical doc update. M-counter NOT bumped per the discipline doctrine.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Implements M32d KV cache support on the qwen3_moe inference path.
Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004
in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0).

## Empirical results

On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M:

- **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on
  every per-turn budget — 5 timeout-class dispatches recorded in
  paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical-
  2026-05-19.md)
- **Post-M32d**: **9.62 tok/s sustained** on 32-token generation
  (19× speedup; comfortably above the ≥ 5 tok/s scope target)
- **Numerical equivalence**: byte-identical greedy outputs vs full-
  prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms;
  ~2× speedup even at small token counts; gap compounds with length)
- **V1_001 + V1_003 regression**: existing #1819 cargo test still
  passes (9.39s wall, content "Human: What", no matmul guard fire)

## Implementation

**New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache`
in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`.
Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs:
~165) step-for-step EXCEPT at the FFN block, where it calls
`moe_ffn_forward_layer` (router → top-k expert select → per-expert
SwiGLU → weighted sum → down projection) instead of the dense
gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm,
RoPE at `position`, GQA-aware cached attention, output projection,
residual) is byte-identical to the dense reference.

**Generate loop rewrite**: `crates/aprender-serve/src/infer/
qwen3_moe_generate.rs::run_qwen3_moe_generate` now:
  1. Allocates `OwnedQuantizedKVCache` sized to
     `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)`
  2. Prefill: per prompt token, calls
     `forward_single_qwen3_moe_with_cache` (cache fills incrementally;
     final iteration's logits seed decode)
  3. Decode: greedy-argmax → append → next cache-aware forward
  4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full

**Visibility fix**: `single_cache_final_output` in `ffn_block.rs`
bumped to `pub(crate)` so the MoE function can reuse the dense final-
norm + LM head path unchanged. Same edit applied to the orphan
`debug.rs` duplicate for hygiene (it's not in the build graph but mirrors
ffn_block.rs).

## New tests (both `#[ignore]`'d, env-gated)

- `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` —
  Generates 4 tokens via M32d cache-on path AND a legacy full-prefill
  loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms
  perf numbers in eprintln output.
- `crates/aprender-serve/tests/m32d_perf.rs` —
  Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s.
  Floor pinned via `M32D_TPS_FLOOR` constant. Catches future
  KV-cache regressions.

Activation:
```
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \
  cargo test --test moe_kv_cache_equivalence --test m32d_perf \
  -p aprender-serve --features cuda --release -- --ignored --nocapture
```

## Risk assessment vs scope doc

All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md`
were addressed:

1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale
   logit drift; 4-token sequence byte-identical to full-prefill.
2. **Dense path regression**: NONE. Dense `forward_single_with_cache`
   not touched (only its sibling `single_cache_final_output` visibility
   bumped, which doesn't change semantics).
3. **RoPE position offset**: handled via `position` parameter passed
   to `apply_rope` (same pattern as dense reference).
4. **GQA expansion**: handled via `kv_dim()` config method (same as
   dense reference); first-token edge case (empty cache) explicitly
   handled by expanding V across Q heads.
5. **Expert routing under cache**: confirmed unaffected — router reads
   from current-token hidden state only.
6. **Streaming SSE for free**: structurally enabled but not wired into
   the chat handler (separate follow-up contract).

## Contract bump

v1.1.1 → v1.2.0:
- V1_004 entry gains `prerequisite_status` field documenting M32d
  shipped + empirical throughput numbers
- `evidence` field updated with the post-M32d operator dispatch recipe
- status_history appends v1.2.0 entry

## Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s
sustained. Acceptance: `evidence/under-contract/scores.json` with
`student_pass_rate > 0` discharges V1_004 + lifts CCPA M280
suspension.

## Supersedes

#1829 (Option b engineer-playbook + V1_004 status
formalization) — operator flipped from Option (b) to Option (a)
in-session; this PR delivers the actual implementation. #1829 can be
closed as superseded.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant