refactor(forward): lift moe_ffn_layer helper (M32d Day 1 prep, #1830 PR-2 of 4) by noahgift · Pull Request #1834 · paiml/aprender

noahgift · 2026-05-20T05:38:06Z

Summary

Pure refactor of the per-position MoE FFN block (pre-norm + dispatch + residual) into a private helper method on `OwnedQuantizedModel`. Behavior is byte-identical to the previous batched-norm + per-position-dispatch + outer-residual version.

Second of 4 PRs implementing #1830 (M32d KV cache for qwen3_moe inference path).

Cascade progress

#	What	Status
#1831	Extract `attention_layer_with_cache` helper from dense path	Open
#1832 (this)	Extract `moe_ffn_layer` helper from `forward_qwen3_moe`	This PR
PR-3	Add `forward_single_qwen3_moe_with_cache` + wire generate loop	Next
PR-4	`moe_kv_cache_equivalence` integration test	After PR-3

After PR-1 + PR-2 land, PR-3 becomes a small composition: per layer, call `attention_layer_with_cache` then `moe_ffn_layer`.

What changed

`forward_qwen3_moe.rs` lines 224-260 (the per-layer 2f Pre-FFN-norm + 2g MoE FFN dispatch + residual block) lifted into a new private method `moe_ffn_layer`.
Outer for-position loop in `forward_qwen3_moe` now calls the helper once per position via a temporary `tok_hidden` buffer.

Helper signature follows the spec:

```rust
pub(crate) fn moe_ffn_layer(
&self,
hidden_token: &mut [f32], // [hidden_dim], single token
layer: &OwnedQuantizedLayer, // for ffn_norm_weight + bias
moe_layer: &Qwen3MoeQuantizedLayer, // for router + experts
num_experts: usize,
num_experts_per_tok: usize,
moe_intermediate: usize,
data: &[u8],
use_rmsnorm: bool,
) -> Result<()>
```

Math equivalence

The previous batched code:

Compute `ffn_input = rms_norm(hidden)` over `seq_len * hidden_dim` (rms_norm is per-vector, so this is N independent calls inlined)
For each position `s`: `pos_out[s] = moe_ffn_forward_layer(ffn_input[s])`
Residual: `hidden += ffn_output`

The new per-position code does the same math in a different order:

```
for s in 0..seq_len:
norm(hidden[s..s+hidden_dim]) → ffn_input
moe_ffn_forward_layer(ffn_input) → pos_out
hidden[s..s+hidden_dim] += pos_out
```

`rms_norm`/`layer_norm` are per-vector (only mix elements within one `hidden_dim` vector, never across tokens), so batching them is convenience, not required for correctness. The per-token MoE dispatch was already the structure.

Cost note

Adds two per-position copies: `hidden→tok_hidden` and `tok_hidden→hidden`. For Qwen3-Coder-30B (`hidden_dim=2048`, f32) each copy is 8KB — negligible vs the per-expert matvecs inside `moe_ffn_forward_layer` (~6MB Q4_K bytes per expert × top-K=8 selected experts).

Test plan

`cargo check -p aprender-serve --lib --features cuda` → clean
`cargo test gguf::inference::forward` → 109 passed / 0 failed / 2 ignored
`cargo test qwen3_moe` → 17 passed / 0 failed / 0 ignored

Cross-refs

Issue: M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830 (M32d KV cache for qwen3_moe inference path)
Predecessor: refactor(forward): lift attention_layer_with_cache helper (M32d Day 1 prep, #1830 PR-1 of 4) #1831 (PR-1: attention helper)
Spec: `docs/specifications/m32d-moe-kv-cache-scope.md`
Contract gate (downstream, blocked by full M32d): FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004

🤖 Generated with Claude Code

… prep, #1830 PR-1 of 4) Pure refactor of the dense KV-cache path's per-layer attention sub-block (steps 2a–2g) into a private helper method on OwnedQuantizedModel. Behavior is bit-identical to the previous inline version. ## Why M32d (KV cache for qwen3_moe inference path, #1830) needs to add a new forward function forward_single_qwen3_moe_with_cache that mirrors forward_single_with_cache's structure but swaps the dense FFN block for the MoE expert dispatch. The attention block is reusable as-is. This PR is the first of 4 in the M32d cascade per docs/specifications/m32d-moe-kv-cache-scope.md: PR 1 (this) — extract attention_layer_with_cache from dense path PR 2 — extract moe_ffn_layer helper from forward_qwen3_moe PR 3 — add forward_single_qwen3_moe_with_cache + wire generate PR 4 — add moe_kv_cache_equivalence integration test PR 1 is the lowest-risk prep step. The 34 existing single_tests must remain green (regression gate — the spec calls this out as 'medium risk: must not regress dense KV cache'). ## What changed - debug.rs lines 477-589 (the attention sub-block inside the per-layer for loop of forward_single_with_cache) lifted verbatim into a new private method attention_layer_with_cache. - Inline block replaced with a single call to the helper. - Helper signature follows the spec doc: hidden: &mut Vec<f32> layer: &OwnedQuantizedLayer layer_idx: usize cache: &mut OwnedQuantizedKVCache position: usize attn_out_buffer: &mut [f32] use_rmsnorm: bool The helper preserves ALL behavior of the inline version: PMAT-260 debug trace calls, GH-278 LayerNorm bias branch, GH-479 Qwen3 per-head QK RMSNorm, RoPE skip for absolute-position models, GQA expansion for empty cache first-token path, CORRECTNESS-013 attention output trace, fused_matmul output projection, attn_output_bias add, residual. ## Test plan - [x] cargo check -p aprender-serve --lib --features cuda: clean - [x] cargo test -p aprender-serve --lib --features cuda gguf::inference::forward::single: 34 passed / 0 failed / 1 ignored — covers test_forward_single_with_cache_*, GQA variants, multi-token sequential decode, Q8K path ## Cross-refs - Issue: #1830 (M32d KV cache for qwen3_moe inference path) - Spec: docs/specifications/m32d-moe-kv-cache-scope.md - Contract gate (downstream): FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 - Sibling future PR: forward_qwen3_moe.rs lift of moe_ffn_layer helper Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> EOF

…PR-2 of 4) Pure refactor of the per-position MoE FFN block (pre-norm + dispatch + residual) into a private helper method on OwnedQuantizedModel. Behavior is byte-identical to the previous batch-norm + per-position- dispatch + outer-residual version. ## Why Continues the M32d cascade (#1830). PR-1 (#1831) extracted the dense attention helper. This PR extracts the MoE FFN helper. Together they let the upcoming forward_single_qwen3_moe_with_cache (PR-3) be a small composition: for each layer: attention_layer_with_cache(...) // dense helper from PR-1 moe_ffn_layer(...) // MoE helper from THIS PR ## What changed - forward_qwen3_moe.rs lines 224-260 (the per-layer 2f Pre-FFN-norm + 2g MoE FFN dispatch + residual block) lifted into a new private method moe_ffn_layer. - The outer for-position loop in forward_qwen3_moe now calls the helper once per position via a temporary tok_hidden buffer. ## Math equivalence The previous batched code: 1. Compute ffn_input = rms_norm(hidden) over seq_len * hidden_dim (rms_norm is per-vector, so this is N independent calls inlined) 2. For each position s: pos_out[s] = moe_ffn_forward_layer(ffn_input[s]) 3. Residual: hidden += ffn_output The new per-position code does the same thing in a different order: 1. For each position s: norm(hidden[s..s+hidden_dim]) → ffn_input moe_ffn_forward_layer(ffn_input) → pos_out hidden[s..s+hidden_dim] += pos_out Both compute the same per-position math. rms_norm/layer_norm are per-vector (they only mix elements within one hidden_dim vector, never across tokens) so batching them is just convenience, not required for correctness. ## Cost note The refactor adds two per-position copies: hidden→tok_hidden and tok_hidden→hidden. For Qwen3-Coder-30B (hidden_dim=2048, f32) each copy is 8KB — negligible vs the per-expert matvecs inside moe_ffn_forward_layer (~6MB Q4_K bytes per expert × top-K=8). ## Test plan - [x] cargo check -p aprender-serve --lib --features cuda: clean - [x] cargo test gguf::inference::forward: 109 passed / 0 failed / 2 ignored - [x] cargo test qwen3_moe: 17 passed / 0 failed / 0 ignored ## Cross-refs - Issue: #1830 (M32d KV cache for qwen3_moe inference path) - Predecessor: #1831 (PR-1: attention_layer_with_cache helper) - Spec: docs/specifications/m32d-moe-kv-cache-scope.md - Next: PR-3 — wire forward_single_qwen3_moe_with_cache + run_qwen3_moe_generate Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-20T06:04:45Z

Superseded by #1832 — operator flipped from Option (b) engineer-driven to Option (a) in-session and shipped the full M32d KV cache feature monolithically (without the helper extraction this PR proposed). #1832 ships forward_single_qwen3_moe_with_cache + cache-aware run_qwen3_moe_generate + moe_kv_cache_equivalence test + m32d_perf bench with empirical 19× speedup. Closing this prep refactor PR; not needed once #1832 lands.

This is the standard chain-PR squash-leapfrog cleanup pattern (memory: feedback_chain_pr_squash_leapfrog.md).

Noah Gift and others added 2 commits May 20, 2026 07:26

noahgift enabled auto-merge (squash) May 20, 2026 05:38

noahgift closed this May 20, 2026

auto-merge was automatically disabled May 20, 2026 06:04
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(forward): lift moe_ffn_layer helper (M32d Day 1 prep, #1830 PR-2 of 4)#1834

refactor(forward): lift moe_ffn_layer helper (M32d Day 1 prep, #1830 PR-2 of 4)#1834
noahgift wants to merge 2 commits into
mainfrom
feat/m32d-moe-kv-cache-prep-moe-ffn-helper

noahgift commented May 20, 2026

Uh oh!

noahgift commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Summary

Cascade progress

What changed

Math equivalence

Cost note

Test plan

Cross-refs

Uh oh!

noahgift commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant