Skip to content

spec(M32d): Option (b) chosen — engineer playbook + V1_004 status formalization#1829

Closed
noahgift wants to merge 1 commit into
mainfrom
spec/m32d-option-b-engineer-playbook
Closed

spec(M32d): Option (b) chosen — engineer playbook + V1_004 status formalization#1829
noahgift wants to merge 1 commit into
mainfrom
spec/m32d-option-b-engineer-playbook

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

@noahgift noahgift commented May 20, 2026

Summary

Operator decision 2026-05-20: Option (b) Engineer-driven follow-up chosen for M32d KV cache work. This PR formalizes the choice + ships the engineer playbook so anyone with aprender inference-stack familiarity can pick up the work. Tracking issue: #1830.

What this PR contains

  • docs/specifications/m32d-moe-kv-cache-scope.md: status banner updated to ACTIVE; operator-decision section replaced with chosen rationale; 250-line Engineer playbook (Option b) appended.
  • contracts/qwen3-moe-serve-dispatch-v1.yaml v1.1.1 → v1.2.0: V1_004 gains blocked_on + blocker_status fields with empirical-evidence + schedule context; status_history appended.

What this PR does NOT contain

Engineer playbook highlights

  • Hand-off criteria: 6 closeable items — forward_single_qwen3_moe_with_cache ships, run_qwen3_moe_generate uses it, equivalence test passes, dense path unregressed, ≥ 5 tok/s on 30B-MoE, V1_004 bench discharges.
  • Day-by-day plan: Day 1 ramp-up → Day 2-3 lift helpers (attention from dense + MoE FFN from full-prefill) → Day 4-5 new function → Day 6 wire → Day 7 tests → Day 8-10 V1_004 dispatch (operator-coordinated).
  • PR layout: 4 PRs (2 pure refactor + 1 core M32d + 1 test) to keep each diff reviewable.
  • Risk gates: each PR has a gate test that must pass before next PR starts.
  • Open questions for engineer's Day 1: prefill efficiency choice, GPU variant parity, cache rollback semantics, multi-turn chat reuse.

Test plan

  • Doc-only / spec-only. No code touched.
  • CI: spec contract pv validate (if wired)

🤖 Generated with Claude Code

…malization

Operator decision 2026-05-20: M32d KV cache for qwen3_moe path will be
delivered via Option (b) engineer-driven follow-up (1-2 week calendar
target), not Claude in-session (8-hour Option a).

## Changes

### `docs/specifications/m32d-moe-kv-cache-scope.md`

- Status banner: ACTIVE, Option (b) chosen.
- Operator decision section: replaced the 3-way choice with the chosen
  rationale + historical reference numbers.
- NEW: 250-line "Engineer playbook (Option b)" section covering:
  - Audience + calendar target + hand-off criteria (6 closeable items)
  - Day-by-day plan (Day 1 ramp-up through Day 8-10 V1_004 dispatch)
  - PR layout (4 PRs: 2 prep refactors + 1 core M32d + 1 test)
  - Risk gates between each PR
  - 4 open questions for the engineer's Day 1 investigation
  - Cross-team coordination (reviewer expectations + CCPA companion)
  - Closing-the-loop checklist (contract bump + companion-side update +
    optional follow-up streaming SSE contract)

### `contracts/qwen3-moe-serve-dispatch-v1.yaml` v1.1.1 → v1.2.0

- V1_004 entry: added `blocked_on: "M32d KV cache for qwen3_moe path"`
  and `blocker_status` field documenting the 2026-05-19 empirical
  evidence (5 timeout-class dispatches) and the 2026-05-20 schedule
  (Option b chosen).
- `evidence` field updated: expected post-M32d-merge throughput target
  (≥ 5 tok/s) + bench wall estimate (~10 hours on 20-fixture corpus).
- status_history appends v1.2.0 entry.
- v1.1.1 PR reference updated to "#1819" (was placeholder).

## What this is NOT

- NOT M32d implementation. Pure spec / contract / playbook update.
- NOT a binding commitment on calendar dates — playbook is structured
  by Day-N units to let the engineer pace themselves.
- NOT a hand-off to a specific named engineer. Anyone with aprender
  inference-stack familiarity (or willing Day-1 ramp-up) can claim.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift
Copy link
Copy Markdown
Contributor Author

Superseded by #1832 (M32d in-session implementation). Operator flipped from Option (b) engineer-driven to Option (a) in-session. The playbook in this PR remains useful historical reference for the day-by-day scope analysis, but the actual implementation is delivered directly. Closing as superseded.

@noahgift noahgift closed this May 20, 2026
noahgift added a commit that referenced this pull request May 20, 2026
Implements M32d KV cache support on the qwen3_moe inference path.
Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004
in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0).

## Empirical results

On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M:

- **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on
  every per-turn budget — 5 timeout-class dispatches recorded in
  paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical-
  2026-05-19.md)
- **Post-M32d**: **9.62 tok/s sustained** on 32-token generation
  (19× speedup; comfortably above the ≥ 5 tok/s scope target)
- **Numerical equivalence**: byte-identical greedy outputs vs full-
  prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms;
  ~2× speedup even at small token counts; gap compounds with length)
- **V1_001 + V1_003 regression**: existing #1819 cargo test still
  passes (9.39s wall, content "Human: What", no matmul guard fire)

## Implementation

**New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache`
in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`.
Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs:
~165) step-for-step EXCEPT at the FFN block, where it calls
`moe_ffn_forward_layer` (router → top-k expert select → per-expert
SwiGLU → weighted sum → down projection) instead of the dense
gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm,
RoPE at `position`, GQA-aware cached attention, output projection,
residual) is byte-identical to the dense reference.

**Generate loop rewrite**: `crates/aprender-serve/src/infer/
qwen3_moe_generate.rs::run_qwen3_moe_generate` now:
  1. Allocates `OwnedQuantizedKVCache` sized to
     `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)`
  2. Prefill: per prompt token, calls
     `forward_single_qwen3_moe_with_cache` (cache fills incrementally;
     final iteration's logits seed decode)
  3. Decode: greedy-argmax → append → next cache-aware forward
  4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full

**Visibility fix**: `single_cache_final_output` in `ffn_block.rs`
bumped to `pub(crate)` so the MoE function can reuse the dense final-
norm + LM head path unchanged. Same edit applied to the orphan
`debug.rs` duplicate for hygiene (it's not in the build graph but mirrors
ffn_block.rs).

## New tests (both `#[ignore]`'d, env-gated)

- `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` —
  Generates 4 tokens via M32d cache-on path AND a legacy full-prefill
  loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms
  perf numbers in eprintln output.
- `crates/aprender-serve/tests/m32d_perf.rs` —
  Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s.
  Floor pinned via `M32D_TPS_FLOOR` constant. Catches future
  KV-cache regressions.

Activation:
```
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \
  cargo test --test moe_kv_cache_equivalence --test m32d_perf \
  -p aprender-serve --features cuda --release -- --ignored --nocapture
```

## Risk assessment vs scope doc

All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md`
were addressed:

1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale
   logit drift; 4-token sequence byte-identical to full-prefill.
2. **Dense path regression**: NONE. Dense `forward_single_with_cache`
   not touched (only its sibling `single_cache_final_output` visibility
   bumped, which doesn't change semantics).
3. **RoPE position offset**: handled via `position` parameter passed
   to `apply_rope` (same pattern as dense reference).
4. **GQA expansion**: handled via `kv_dim()` config method (same as
   dense reference); first-token edge case (empty cache) explicitly
   handled by expanding V across Q heads.
5. **Expert routing under cache**: confirmed unaffected — router reads
   from current-token hidden state only.
6. **Streaming SSE for free**: structurally enabled but not wired into
   the chat handler (separate follow-up contract).

## Contract bump

v1.1.1 → v1.2.0:
- V1_004 entry gains `prerequisite_status` field documenting M32d
  shipped + empirical throughput numbers
- `evidence` field updated with the post-M32d operator dispatch recipe
- status_history appends v1.2.0 entry

## Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s
sustained. Acceptance: `evidence/under-contract/scores.json` with
`student_pass_rate > 0` discharges V1_004 + lifts CCPA M280
suspension.

## Supersedes

#1829 (Option b engineer-playbook + V1_004 status
formalization) — operator flipped from Option (b) to Option (a)
in-session; this PR delivers the actual implementation. #1829 can be
closed as superseded.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 20, 2026
Implements M32d KV cache support on the qwen3_moe inference path.
Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004
in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0).

## Empirical results

On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M:

- **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on
  every per-turn budget — 5 timeout-class dispatches recorded in
  paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical-
  2026-05-19.md)
- **Post-M32d**: **9.62 tok/s sustained** on 32-token generation
  (19× speedup; comfortably above the ≥ 5 tok/s scope target)
- **Numerical equivalence**: byte-identical greedy outputs vs full-
  prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms;
  ~2× speedup even at small token counts; gap compounds with length)
- **V1_001 + V1_003 regression**: existing #1819 cargo test still
  passes (9.39s wall, content "Human: What", no matmul guard fire)

## Implementation

**New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache`
in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`.
Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs:
~165) step-for-step EXCEPT at the FFN block, where it calls
`moe_ffn_forward_layer` (router → top-k expert select → per-expert
SwiGLU → weighted sum → down projection) instead of the dense
gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm,
RoPE at `position`, GQA-aware cached attention, output projection,
residual) is byte-identical to the dense reference.

**Generate loop rewrite**: `crates/aprender-serve/src/infer/
qwen3_moe_generate.rs::run_qwen3_moe_generate` now:
  1. Allocates `OwnedQuantizedKVCache` sized to
     `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)`
  2. Prefill: per prompt token, calls
     `forward_single_qwen3_moe_with_cache` (cache fills incrementally;
     final iteration's logits seed decode)
  3. Decode: greedy-argmax → append → next cache-aware forward
  4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full

**Visibility fix**: `single_cache_final_output` in `ffn_block.rs`
bumped to `pub(crate)` so the MoE function can reuse the dense final-
norm + LM head path unchanged. Same edit applied to the orphan
`debug.rs` duplicate for hygiene (it's not in the build graph but mirrors
ffn_block.rs).

## New tests (both `#[ignore]`'d, env-gated)

- `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` —
  Generates 4 tokens via M32d cache-on path AND a legacy full-prefill
  loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms
  perf numbers in eprintln output.
- `crates/aprender-serve/tests/m32d_perf.rs` —
  Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s.
  Floor pinned via `M32D_TPS_FLOOR` constant. Catches future
  KV-cache regressions.

Activation:
```
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \
  cargo test --test moe_kv_cache_equivalence --test m32d_perf \
  -p aprender-serve --features cuda --release -- --ignored --nocapture
```

## Risk assessment vs scope doc

All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md`
were addressed:

1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale
   logit drift; 4-token sequence byte-identical to full-prefill.
2. **Dense path regression**: NONE. Dense `forward_single_with_cache`
   not touched (only its sibling `single_cache_final_output` visibility
   bumped, which doesn't change semantics).
3. **RoPE position offset**: handled via `position` parameter passed
   to `apply_rope` (same pattern as dense reference).
4. **GQA expansion**: handled via `kv_dim()` config method (same as
   dense reference); first-token edge case (empty cache) explicitly
   handled by expanding V across Q heads.
5. **Expert routing under cache**: confirmed unaffected — router reads
   from current-token hidden state only.
6. **Streaming SSE for free**: structurally enabled but not wired into
   the chat handler (separate follow-up contract).

## Contract bump

v1.1.1 → v1.2.0:
- V1_004 entry gains `prerequisite_status` field documenting M32d
  shipped + empirical throughput numbers
- `evidence` field updated with the post-M32d operator dispatch recipe
- status_history appends v1.2.0 entry

## Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s
sustained. Acceptance: `evidence/under-contract/scores.json` with
`student_pass_rate > 0` discharges V1_004 + lifts CCPA M280
suspension.

## Supersedes

#1829 (Option b engineer-playbook + V1_004 status
formalization) — operator flipped from Option (b) to Option (a)
in-session; this PR delivers the actual implementation. #1829 can be
closed as superseded.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant