OneDNN BRGeMM Micro-Kernel Integration for BF16 MatMul by bbhattar · Pull Request #903 · google/gemma.cpp

bbhattar · 2026-04-28T22:10:19Z

This PR integrates OneDNN BRGeMM (Batch-Reduced General Matrix Multiply) micro-kernels as an alternative compute path for BF16 MatMul on Intel Xeon platforms with AMX or AVX-512 BF16 support.

What

When enabled via the GEMMA_ONEDNN_BRGEMM compile-time flag, BF16×BF16 MatMul operations are dispatched to JIT-compiled BRGeMM kernels instead of the Highway SIMD path. This targets Gemma model workloads (FFW projections, attention) on Intel Xeon Scalable (SPR/EMR) processors. At this point support has been added to both CMake and Bazel build systems.

How to Enable

# CMake
cmake -DGEMMA_ONEDNN_BRGEMM=ON ..

# Bazel
bazel build --define gemma_onednn_brgemm=1 ...

Runtime Fallback

When GEMMA_ONEDNN_BRGEMM is enabled at compile time, the BRGeMM path activates for BF16×BF16 operations whose dimensions meet AMX tile constraints (M, N, K ≥ 32 and K % 32 == 0). All other cases — non-BF16 types, smaller or non-aligned dimensions, mixed precision — fall through to the standard Highway SIMD MatMul path automatically.

Changes

File	Description
`ops/brgemm.h`	Types, caches, thread-local buffers, `UseOneDnnBrgemm()`, autotuning candidates
`ops/brgemm-inl.h`	`DoMatMul_BRGeMM()`: kernel JIT/caching, B-packing with hugepages, tiled parallel execution
`ops/matmul-inl.h`	BRGeMM dispatch block in `MatMul()` guarded by `#if GEMMA_ONEDNN_BRGEMM`
`ops/matmul.h`	`#include "ops/brgemm.h"`, `brgemm_autotune` field in `MMPerKey`
`ops/bench_matmul.cc`	Check `brgemm_autotune.Best()` to avoid infinite loop when BRGeMM handles dispatch
`CMakeLists.txt`	`GEMMA_ONEDNN_BRGEMM` option, FetchContent for OneDNN v3.11, conditional target linking
`BUILD.bazel`	`config_setting` for `gemma_onednn_brgemm`, conditional OneDNN dep and defines for x86_64
`MODULE.bazel`	OneDNN v3.11 `http_archive` dependency
`bazel/onednn.BUILD`	Bazel build rules for OneDNN
`util/zones.h`	`kBRGeMM` caller enum for thread pool dispatch
`util/zones.cc`	`CallerName` mapping for `kBRGeMM`

Testing

matmul_test passes with and without GEMMA_ONEDNN_BRGEMM (all original test shapes, types, and correctness checks preserved)
bench_matmul runs successfully with BRGeMM enabled
No changes to existing tests; zero impact when OneDNN is not enabled or on non-x86 platforms

google-cla · 2026-04-28T22:10:29Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

jan-wassenberg

Very nice work :) Just some fairly minor suggestions:

jan-wassenberg · 2026-04-29T15:58:18Z

+
+struct BRGeMMConfig {
+  int64_t M_blk;
+  int64_t N_blk;


We could set these to 32 directly, as member initializers? Possibly also make them const to make clear that they do not change.
Also, prefer size_t for all size-like things to prevent sign-conversion warnings.

jan-wassenberg · 2026-04-29T15:58:54Z

+// Tunable: M_blk in {32,64}, batch_size in {16,32,64,128,256}.
+inline std::vector<BRGeMMConfig> BRGeMMCandidates(size_t M, size_t K,
+                                                   size_t N) {
+  std::vector<BRGeMMConfig> out;


Let's .reserve with some estimate, also to document how many there will be?

jan-wassenberg · 2026-04-29T15:59:40Z

+  static constexpr int64_t kMBlkValues[] = {32, 64};
+  static constexpr int64_t kBatchValues[] = {16, 32, 64, 128, 256};
+
+  const int64_t k_chunks = static_cast<int64_t>(K) / kKBlk;


Should this round up? We have hwy::DivCeil.

jan-wassenberg · 2026-04-29T16:07:15Z

+    }
+    madvise(ptr_, size_, MADV_HUGEPAGE);
+    for (size_t off = 0; off < size_; off += kHugePageSize) {
+      static_cast<volatile uint8_t*>(ptr_)[off] = 0;


Possibly safer/more portable: consider ptr_[off] = 0; hwy::PreventElision(ptr_[off]).

jan-wassenberg · 2026-04-29T16:07:36Z

+// Kernel cache key: identifies a JIT-compiled kernel set.
+struct BRGeMMKernelKey {
+  size_t M, K, N;
+  int64_t M_blk, N_blk, K_blk, batch_size;


Can these also be size_t? And below.

jan-wassenberg · 2026-04-29T17:14:19Z

+    ke.M_blk =
+        static_cast<int64_t>(std::min(static_cast<size_t>(cfg.M_blk), M));
+
+    ke.M_tail = M % ke.M_blk;


Do we want precomputed hwy::Divisor here to avoid actual division?

jan-wassenberg · 2026-04-29T17:15:43Z

+    const int64_t ldb_for[2] = {ke.N_blk, ke.N_tail ? ke.N_tail : ke.N_blk};
+    const int64_t ldc_for[2] = {ke.N_blk, ke.N_tail ? ke.N_tail : ke.N_blk};
+
+    // Create brgemm kernels for each (M-tile, N-tile) variant.


I think these are "do we have an M and N tail" variants, could the comment be rephrased to make that more clear?

jan-wassenberg · 2026-04-29T17:18:00Z

+  auto& kern_cache = GetBRGeMMKernelCache();
+  auto kern_it = kern_cache.find(kern_key);
+
+  if (kern_it == kern_cache.end()) {


This block is quite big. Might help readability and codegen to put it into a HWY_NOINLINE helper function?

jan-wassenberg · 2026-04-29T17:19:04Z

+          if (!MakeBrgemm(ke.brg_first_all[mi][ni], ms, ns, ke.K_blk,
+                          ke.K_super_size, ke.lda, ldb_for[ni], ldc_for[ni],
+                          a_dt, b_dt, c_dt, false)) {
+            return;


Should we HWY_WARN on failure? Or even HWY_ABORT? If failure can happen, should we fall back to the prior matmul?

jan-wassenberg · 2026-04-29T17:22:51Z

+            const auto va = hn::Load(df, add_row + n);
+            const auto result = hn::MulAdd(v, vscale, va);
+            if constexpr (hwy::IsSame<TC, float>()) {
+              hn::Store(result, df, reinterpret_cast<float*>(C_row) + n);


Better to use HWY_RCAST_ALIGNED to tell the compiler is this element-aligned. (also below)

bbhattar added 2 commits April 13, 2026 18:31

Tested and benchmarked OneDNN BRGeMM integration against dev branch

09ddbf4

fixing the copyright info

1308355

bbhattar force-pushed the feature/onednn-brgemm branch from 629b569 to e072d70 Compare April 28, 2026 22:19

bbhattar added 3 commits April 28, 2026 22:30

Removing OneTBB dependency

b3a46d9

Fixed the compile time flag to designate BRGEMM path

9b1d730

Adding the cmake based build support for oneDNN BGGeMM

e072d70

jan-wassenberg requested changes Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OneDNN BRGeMM Micro-Kernel Integration for BF16 MatMul#903

OneDNN BRGeMM Micro-Kernel Integration for BF16 MatMul#903
bbhattar wants to merge 5 commits intogoogle:devfrom
Intel-tensorflow:feature/onednn-brgemm

bbhattar commented Apr 28, 2026

Uh oh!

google-cla Bot commented Apr 28, 2026

Uh oh!

jan-wassenberg left a comment

Uh oh!

jan-wassenberg Apr 29, 2026

Uh oh!

jan-wassenberg Apr 29, 2026

Uh oh!

jan-wassenberg Apr 29, 2026

Uh oh!

jan-wassenberg Apr 29, 2026

Uh oh!

jan-wassenberg Apr 29, 2026

Uh oh!

jan-wassenberg Apr 29, 2026

Uh oh!

jan-wassenberg Apr 29, 2026

Uh oh!

jan-wassenberg Apr 29, 2026

Uh oh!

jan-wassenberg Apr 29, 2026

Uh oh!

jan-wassenberg Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bbhattar commented Apr 28, 2026

What

How to Enable

Runtime Fallback

Changes

Testing

Uh oh!

google-cla Bot commented Apr 28, 2026

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants