refactor(forward): lift moe_ffn_layer helper (M32d Day 1 prep, #1830 PR-2 of 4)#1834
Closed
noahgift wants to merge 2 commits into
Closed
refactor(forward): lift moe_ffn_layer helper (M32d Day 1 prep, #1830 PR-2 of 4)#1834noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
… prep, #1830 PR-1 of 4) Pure refactor of the dense KV-cache path's per-layer attention sub-block (steps 2a–2g) into a private helper method on OwnedQuantizedModel. Behavior is bit-identical to the previous inline version. ## Why M32d (KV cache for qwen3_moe inference path, #1830) needs to add a new forward function forward_single_qwen3_moe_with_cache that mirrors forward_single_with_cache's structure but swaps the dense FFN block for the MoE expert dispatch. The attention block is reusable as-is. This PR is the first of 4 in the M32d cascade per docs/specifications/m32d-moe-kv-cache-scope.md: PR 1 (this) — extract attention_layer_with_cache from dense path PR 2 — extract moe_ffn_layer helper from forward_qwen3_moe PR 3 — add forward_single_qwen3_moe_with_cache + wire generate PR 4 — add moe_kv_cache_equivalence integration test PR 1 is the lowest-risk prep step. The 34 existing single_tests must remain green (regression gate — the spec calls this out as 'medium risk: must not regress dense KV cache'). ## What changed - debug.rs lines 477-589 (the attention sub-block inside the per-layer for loop of forward_single_with_cache) lifted verbatim into a new private method attention_layer_with_cache. - Inline block replaced with a single call to the helper. - Helper signature follows the spec doc: hidden: &mut Vec<f32> layer: &OwnedQuantizedLayer layer_idx: usize cache: &mut OwnedQuantizedKVCache position: usize attn_out_buffer: &mut [f32] use_rmsnorm: bool The helper preserves ALL behavior of the inline version: PMAT-260 debug trace calls, GH-278 LayerNorm bias branch, GH-479 Qwen3 per-head QK RMSNorm, RoPE skip for absolute-position models, GQA expansion for empty cache first-token path, CORRECTNESS-013 attention output trace, fused_matmul output projection, attn_output_bias add, residual. ## Test plan - [x] cargo check -p aprender-serve --lib --features cuda: clean - [x] cargo test -p aprender-serve --lib --features cuda gguf::inference::forward::single: 34 passed / 0 failed / 1 ignored — covers test_forward_single_with_cache_*, GQA variants, multi-token sequential decode, Q8K path ## Cross-refs - Issue: #1830 (M32d KV cache for qwen3_moe inference path) - Spec: docs/specifications/m32d-moe-kv-cache-scope.md - Contract gate (downstream): FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 - Sibling future PR: forward_qwen3_moe.rs lift of moe_ffn_layer helper Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> EOF
…PR-2 of 4) Pure refactor of the per-position MoE FFN block (pre-norm + dispatch + residual) into a private helper method on OwnedQuantizedModel. Behavior is byte-identical to the previous batch-norm + per-position- dispatch + outer-residual version. ## Why Continues the M32d cascade (#1830). PR-1 (#1831) extracted the dense attention helper. This PR extracts the MoE FFN helper. Together they let the upcoming forward_single_qwen3_moe_with_cache (PR-3) be a small composition: for each layer: attention_layer_with_cache(...) // dense helper from PR-1 moe_ffn_layer(...) // MoE helper from THIS PR ## What changed - forward_qwen3_moe.rs lines 224-260 (the per-layer 2f Pre-FFN-norm + 2g MoE FFN dispatch + residual block) lifted into a new private method moe_ffn_layer. - The outer for-position loop in forward_qwen3_moe now calls the helper once per position via a temporary tok_hidden buffer. ## Math equivalence The previous batched code: 1. Compute ffn_input = rms_norm(hidden) over seq_len * hidden_dim (rms_norm is per-vector, so this is N independent calls inlined) 2. For each position s: pos_out[s] = moe_ffn_forward_layer(ffn_input[s]) 3. Residual: hidden += ffn_output The new per-position code does the same thing in a different order: 1. For each position s: norm(hidden[s..s+hidden_dim]) → ffn_input moe_ffn_forward_layer(ffn_input) → pos_out hidden[s..s+hidden_dim] += pos_out Both compute the same per-position math. rms_norm/layer_norm are per-vector (they only mix elements within one hidden_dim vector, never across tokens) so batching them is just convenience, not required for correctness. ## Cost note The refactor adds two per-position copies: hidden→tok_hidden and tok_hidden→hidden. For Qwen3-Coder-30B (hidden_dim=2048, f32) each copy is 8KB — negligible vs the per-expert matvecs inside moe_ffn_forward_layer (~6MB Q4_K bytes per expert × top-K=8). ## Test plan - [x] cargo check -p aprender-serve --lib --features cuda: clean - [x] cargo test gguf::inference::forward: 109 passed / 0 failed / 2 ignored - [x] cargo test qwen3_moe: 17 passed / 0 failed / 0 ignored ## Cross-refs - Issue: #1830 (M32d KV cache for qwen3_moe inference path) - Predecessor: #1831 (PR-1: attention_layer_with_cache helper) - Spec: docs/specifications/m32d-moe-kv-cache-scope.md - Next: PR-3 — wire forward_single_qwen3_moe_with_cache + run_qwen3_moe_generate Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
|
Superseded by #1832 — operator flipped from Option (b) engineer-driven to Option (a) in-session and shipped the full M32d KV cache feature monolithically (without the helper extraction this PR proposed). #1832 ships forward_single_qwen3_moe_with_cache + cache-aware run_qwen3_moe_generate + moe_kv_cache_equivalence test + m32d_perf bench with empirical 19× speedup. Closing this prep refactor PR; not needed once #1832 lands. This is the standard chain-PR squash-leapfrog cleanup pattern (memory: feedback_chain_pr_squash_leapfrog.md). |
auto-merge was automatically disabled
May 20, 2026 06:04
Pull request was closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pure refactor of the per-position MoE FFN block (pre-norm + dispatch + residual) into a private helper method on `OwnedQuantizedModel`. Behavior is byte-identical to the previous batched-norm + per-position-dispatch + outer-residual version.
Second of 4 PRs implementing #1830 (M32d KV cache for qwen3_moe inference path).
Cascade progress
After PR-1 + PR-2 land, PR-3 becomes a small composition: per layer, call `attention_layer_with_cache` then `moe_ffn_layer`.
What changed
Helper signature follows the spec:
```rust
pub(crate) fn moe_ffn_layer(
&self,
hidden_token: &mut [f32], // [hidden_dim], single token
layer: &OwnedQuantizedLayer, // for ffn_norm_weight + bias
moe_layer: &Qwen3MoeQuantizedLayer, // for router + experts
num_experts: usize,
num_experts_per_tok: usize,
moe_intermediate: usize,
data: &[u8],
use_rmsnorm: bool,
) -> Result<()>
```
Math equivalence
The previous batched code:
The new per-position code does the same math in a different order:
```
for s in 0..seq_len:
norm(hidden[s..s+hidden_dim]) → ffn_input
moe_ffn_forward_layer(ffn_input) → pos_out
hidden[s..s+hidden_dim] += pos_out
```
`rms_norm`/`layer_norm` are per-vector (only mix elements within one `hidden_dim` vector, never across tokens), so batching them is convenience, not required for correctness. The per-token MoE dispatch was already the structure.
Cost note
Adds two per-position copies: `hidden→tok_hidden` and `tok_hidden→hidden`. For Qwen3-Coder-30B (`hidden_dim=2048`, f32) each copy is 8KB — negligible vs the per-expert matvecs inside `moe_ffn_forward_layer` (~6MB Q4_K bytes per expert × top-K=8 selected experts).
Test plan
Cross-refs
🤖 Generated with Claude Code