spec(M32d): Option (b) chosen — engineer playbook + V1_004 status formalization by noahgift · Pull Request #1829 · paiml/aprender

noahgift · 2026-05-20T05:18:01Z

Summary

Operator decision 2026-05-20: Option (b) Engineer-driven follow-up chosen for M32d KV cache work. This PR formalizes the choice + ships the engineer playbook so anyone with aprender inference-stack familiarity can pick up the work. Tracking issue: #1830.

What this PR contains

docs/specifications/m32d-moe-kv-cache-scope.md: status banner updated to ACTIVE; operator-decision section replaced with chosen rationale; 250-line Engineer playbook (Option b) appended.
contracts/qwen3-moe-serve-dispatch-v1.yaml v1.1.1 → v1.2.0: V1_004 gains blocked_on + blocker_status fields with empirical-evidence + schedule context; status_history appended.

What this PR does NOT contain

NOT M32d implementation. Pure spec / contract / playbook update.
NOT a binding calendar — playbook is structured by Day-N units to let the engineer pace themselves.
NOT a hand-off to a named engineer — issue M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830 is open for anyone to claim.

Engineer playbook highlights

Hand-off criteria: 6 closeable items — forward_single_qwen3_moe_with_cache ships, run_qwen3_moe_generate uses it, equivalence test passes, dense path unregressed, ≥ 5 tok/s on 30B-MoE, V1_004 bench discharges.
Day-by-day plan: Day 1 ramp-up → Day 2-3 lift helpers (attention from dense + MoE FFN from full-prefill) → Day 4-5 new function → Day 6 wire → Day 7 tests → Day 8-10 V1_004 dispatch (operator-coordinated).
PR layout: 4 PRs (2 pure refactor + 1 core M32d + 1 test) to keep each diff reviewable.
Risk gates: each PR has a gate test that must pass before next PR starts.
Open questions for engineer's Day 1: prefill efficiency choice, GPU variant parity, cache rollback semantics, multi-turn chat reuse.

Test plan

Doc-only / spec-only. No code touched.
CI: spec contract pv validate (if wired)

🤖 Generated with Claude Code

…malization Operator decision 2026-05-20: M32d KV cache for qwen3_moe path will be delivered via Option (b) engineer-driven follow-up (1-2 week calendar target), not Claude in-session (8-hour Option a). ## Changes ### `docs/specifications/m32d-moe-kv-cache-scope.md` - Status banner: ACTIVE, Option (b) chosen. - Operator decision section: replaced the 3-way choice with the chosen rationale + historical reference numbers. - NEW: 250-line "Engineer playbook (Option b)" section covering: - Audience + calendar target + hand-off criteria (6 closeable items) - Day-by-day plan (Day 1 ramp-up through Day 8-10 V1_004 dispatch) - PR layout (4 PRs: 2 prep refactors + 1 core M32d + 1 test) - Risk gates between each PR - 4 open questions for the engineer's Day 1 investigation - Cross-team coordination (reviewer expectations + CCPA companion) - Closing-the-loop checklist (contract bump + companion-side update + optional follow-up streaming SSE contract) ### `contracts/qwen3-moe-serve-dispatch-v1.yaml` v1.1.1 → v1.2.0 - V1_004 entry: added `blocked_on: "M32d KV cache for qwen3_moe path"` and `blocker_status` field documenting the 2026-05-19 empirical evidence (5 timeout-class dispatches) and the 2026-05-20 schedule (Option b chosen). - `evidence` field updated: expected post-M32d-merge throughput target (≥ 5 tok/s) + bench wall estimate (~10 hours on 20-fixture corpus). - status_history appends v1.2.0 entry. - v1.1.1 PR reference updated to "#1819" (was placeholder). ## What this is NOT - NOT M32d implementation. Pure spec / contract / playbook update. - NOT a binding commitment on calendar dates — playbook is structured by Day-N units to let the engineer pace themselves. - NOT a hand-off to a specific named engineer. Anyone with aprender inference-stack familiarity (or willing Day-1 ramp-up) can claim. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-20T05:34:17Z

Superseded by #1832 (M32d in-session implementation). Operator flipped from Option (b) engineer-driven to Option (a) in-session. The playbook in this PR remains useful historical reference for the day-by-day scope analysis, but the actual implementation is delivered directly. Closing as superseded.

Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

This was referenced May 20, 2026

M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830

Open

feat(M32d): KV cache for qwen3_moe inference path — 19× speedup #1832

Merged

noahgift closed this May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(M32d): Option (b) chosen — engineer playbook + V1_004 status formalization#1829

spec(M32d): Option (b) chosen — engineer playbook + V1_004 status formalization#1829
noahgift wants to merge 1 commit into
mainfrom
spec/m32d-option-b-engineer-playbook

noahgift commented May 20, 2026 •

edited

Loading

Uh oh!

noahgift commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR contains

What this PR does NOT contain

Engineer playbook highlights

Test plan

Uh oh!

noahgift commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

noahgift commented May 20, 2026 •

edited

Loading