Skip to content

refactor(forward): lift moe_ffn_layer helper (M32d Day 1 prep, #1830 PR-2 of 4)#1834

Closed
noahgift wants to merge 2 commits into
mainfrom
feat/m32d-moe-kv-cache-prep-moe-ffn-helper
Closed

refactor(forward): lift moe_ffn_layer helper (M32d Day 1 prep, #1830 PR-2 of 4)#1834
noahgift wants to merge 2 commits into
mainfrom
feat/m32d-moe-kv-cache-prep-moe-ffn-helper

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Pure refactor of the per-position MoE FFN block (pre-norm + dispatch + residual) into a private helper method on `OwnedQuantizedModel`. Behavior is byte-identical to the previous batched-norm + per-position-dispatch + outer-residual version.

Second of 4 PRs implementing #1830 (M32d KV cache for qwen3_moe inference path).

Cascade progress

# What Status
#1831 Extract `attention_layer_with_cache` helper from dense path Open
#1832 (this) Extract `moe_ffn_layer` helper from `forward_qwen3_moe` This PR
PR-3 Add `forward_single_qwen3_moe_with_cache` + wire generate loop Next
PR-4 `moe_kv_cache_equivalence` integration test After PR-3

After PR-1 + PR-2 land, PR-3 becomes a small composition: per layer, call `attention_layer_with_cache` then `moe_ffn_layer`.

What changed

  • `forward_qwen3_moe.rs` lines 224-260 (the per-layer 2f Pre-FFN-norm + 2g MoE FFN dispatch + residual block) lifted into a new private method `moe_ffn_layer`.
  • Outer for-position loop in `forward_qwen3_moe` now calls the helper once per position via a temporary `tok_hidden` buffer.

Helper signature follows the spec:

```rust
pub(crate) fn moe_ffn_layer(
&self,
hidden_token: &mut [f32], // [hidden_dim], single token
layer: &OwnedQuantizedLayer, // for ffn_norm_weight + bias
moe_layer: &Qwen3MoeQuantizedLayer, // for router + experts
num_experts: usize,
num_experts_per_tok: usize,
moe_intermediate: usize,
data: &[u8],
use_rmsnorm: bool,
) -> Result<()>
```

Math equivalence

The previous batched code:

  1. Compute `ffn_input = rms_norm(hidden)` over `seq_len * hidden_dim` (rms_norm is per-vector, so this is N independent calls inlined)
  2. For each position `s`: `pos_out[s] = moe_ffn_forward_layer(ffn_input[s])`
  3. Residual: `hidden += ffn_output`

The new per-position code does the same math in a different order:

```
for s in 0..seq_len:
norm(hidden[s..s+hidden_dim]) → ffn_input
moe_ffn_forward_layer(ffn_input) → pos_out
hidden[s..s+hidden_dim] += pos_out
```

`rms_norm`/`layer_norm` are per-vector (only mix elements within one `hidden_dim` vector, never across tokens), so batching them is convenience, not required for correctness. The per-token MoE dispatch was already the structure.

Cost note

Adds two per-position copies: `hidden→tok_hidden` and `tok_hidden→hidden`. For Qwen3-Coder-30B (`hidden_dim=2048`, f32) each copy is 8KB — negligible vs the per-expert matvecs inside `moe_ffn_forward_layer` (~6MB Q4_K bytes per expert × top-K=8 selected experts).

Test plan

  • `cargo check -p aprender-serve --lib --features cuda` → clean
  • `cargo test gguf::inference::forward` → 109 passed / 0 failed / 2 ignored
  • `cargo test qwen3_moe` → 17 passed / 0 failed / 0 ignored

Cross-refs

🤖 Generated with Claude Code

Noah Gift and others added 2 commits May 20, 2026 07:26
… prep, #1830 PR-1 of 4)

Pure refactor of the dense KV-cache path's per-layer attention sub-block
(steps 2a–2g) into a private helper method on OwnedQuantizedModel.
Behavior is bit-identical to the previous inline version.

## Why

M32d (KV cache for qwen3_moe inference path, #1830) needs to add a new
forward function forward_single_qwen3_moe_with_cache that mirrors
forward_single_with_cache's structure but swaps the dense FFN block for
the MoE expert dispatch. The attention block is reusable as-is.

This PR is the first of 4 in the M32d cascade per
docs/specifications/m32d-moe-kv-cache-scope.md:

  PR 1 (this) — extract attention_layer_with_cache from dense path
  PR 2       — extract moe_ffn_layer helper from forward_qwen3_moe
  PR 3       — add forward_single_qwen3_moe_with_cache + wire generate
  PR 4       — add moe_kv_cache_equivalence integration test

PR 1 is the lowest-risk prep step. The 34 existing single_tests must
remain green (regression gate — the spec calls this out as 'medium
risk: must not regress dense KV cache').

## What changed

- debug.rs lines 477-589 (the attention sub-block inside the per-layer
  for loop of forward_single_with_cache) lifted verbatim into a new
  private method attention_layer_with_cache.
- Inline block replaced with a single call to the helper.
- Helper signature follows the spec doc:
    hidden: &mut Vec<f32>
    layer: &OwnedQuantizedLayer
    layer_idx: usize
    cache: &mut OwnedQuantizedKVCache
    position: usize
    attn_out_buffer: &mut [f32]
    use_rmsnorm: bool

The helper preserves ALL behavior of the inline version: PMAT-260 debug
trace calls, GH-278 LayerNorm bias branch, GH-479 Qwen3 per-head QK
RMSNorm, RoPE skip for absolute-position models, GQA expansion for
empty cache first-token path, CORRECTNESS-013 attention output trace,
fused_matmul output projection, attn_output_bias add, residual.

## Test plan

- [x] cargo check -p aprender-serve --lib --features cuda: clean
- [x] cargo test -p aprender-serve --lib --features cuda gguf::inference::forward::single:
       34 passed / 0 failed / 1 ignored
       — covers test_forward_single_with_cache_*, GQA variants,
         multi-token sequential decode, Q8K path

## Cross-refs

- Issue: #1830 (M32d KV cache for qwen3_moe inference path)
- Spec: docs/specifications/m32d-moe-kv-cache-scope.md
- Contract gate (downstream): FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004
- Sibling future PR: forward_qwen3_moe.rs lift of moe_ffn_layer helper

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
EOF
…PR-2 of 4)

Pure refactor of the per-position MoE FFN block (pre-norm + dispatch +
residual) into a private helper method on OwnedQuantizedModel.
Behavior is byte-identical to the previous batch-norm + per-position-
dispatch + outer-residual version.

## Why

Continues the M32d cascade (#1830). PR-1 (#1831) extracted the dense
attention helper. This PR extracts the MoE FFN helper. Together they
let the upcoming forward_single_qwen3_moe_with_cache (PR-3) be a small
composition:

  for each layer:
    attention_layer_with_cache(...)  // dense helper from PR-1
    moe_ffn_layer(...)               // MoE helper from THIS PR

## What changed

- forward_qwen3_moe.rs lines 224-260 (the per-layer 2f Pre-FFN-norm
  + 2g MoE FFN dispatch + residual block) lifted into a new private
  method moe_ffn_layer.
- The outer for-position loop in forward_qwen3_moe now calls the
  helper once per position via a temporary tok_hidden buffer.

## Math equivalence

The previous batched code:
  1. Compute ffn_input = rms_norm(hidden) over seq_len * hidden_dim
     (rms_norm is per-vector, so this is N independent calls inlined)
  2. For each position s: pos_out[s] = moe_ffn_forward_layer(ffn_input[s])
  3. Residual: hidden += ffn_output

The new per-position code does the same thing in a different order:
  1. For each position s:
       norm(hidden[s..s+hidden_dim]) → ffn_input
       moe_ffn_forward_layer(ffn_input) → pos_out
       hidden[s..s+hidden_dim] += pos_out

Both compute the same per-position math. rms_norm/layer_norm are
per-vector (they only mix elements within one hidden_dim vector,
never across tokens) so batching them is just convenience, not
required for correctness.

## Cost note

The refactor adds two per-position copies: hidden→tok_hidden and
tok_hidden→hidden. For Qwen3-Coder-30B (hidden_dim=2048, f32) each
copy is 8KB — negligible vs the per-expert matvecs inside
moe_ffn_forward_layer (~6MB Q4_K bytes per expert × top-K=8).

## Test plan

- [x] cargo check -p aprender-serve --lib --features cuda: clean
- [x] cargo test gguf::inference::forward: 109 passed / 0 failed / 2 ignored
- [x] cargo test qwen3_moe: 17 passed / 0 failed / 0 ignored

## Cross-refs

- Issue: #1830 (M32d KV cache for qwen3_moe inference path)
- Predecessor: #1831 (PR-1: attention_layer_with_cache helper)
- Spec: docs/specifications/m32d-moe-kv-cache-scope.md
- Next: PR-3 — wire forward_single_qwen3_moe_with_cache + run_qwen3_moe_generate

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 20, 2026 05:38
@noahgift
Copy link
Copy Markdown
Contributor Author

Superseded by #1832 — operator flipped from Option (b) engineer-driven to Option (a) in-session and shipped the full M32d KV cache feature monolithically (without the helper extraction this PR proposed). #1832 ships forward_single_qwen3_moe_with_cache + cache-aware run_qwen3_moe_generate + moe_kv_cache_equivalence test + m32d_perf bench with empirical 19× speedup. Closing this prep refactor PR; not needed once #1832 lands.

This is the standard chain-PR squash-leapfrog cleanup pattern (memory: feedback_chain_pr_squash_leapfrog.md).

@noahgift noahgift closed this May 20, 2026
auto-merge was automatically disabled May 20, 2026 06:04

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant