[1/5] autotune: harden cache key + add restore_value (#770) by jhinpan · Pull Request #783 · ROCm/FlyDSL

jhinpan · 2026-07-01T06:39:50Z

First of a planned series making FlyDSL's autotuner (python/flydsl/autotune.py) a correct, adopted tuning path per #770. Today it is well-built but unused, with two gaps that must be closed before any kernel can safely adopt it.

What this PR does

Harden the cache key (_make_key). It previously specialized on shape/dtype only, so a config tuned under one compiler build / GPU arch / memory layout would be silently reused under another. This folds in the axes Triton and quack rely on:

normalized stride pattern ({0, 1, other} — broadcast vs contiguous vs strided; exact numbers don't matter, the pattern does)
device arch (get_rocm_arch)
toolchain fingerprint (reuses the existing jit_function._flydsl_key() — hashes compiler source, native libs, version)
cache-invalidating env vars (reuses _cache_invalidating_env_values())

Add restore_value — the correctness soul of autotune. Benchmarking runs the same kernel dozens of times; an in-place / accumulating kernel (e.g. fused-add rmsnorm, where output overlaps the residual/input buffers) corrupts its own inputs across reps and picks a config on garbage. We snapshot the named tensors once and restore before every rep. Exposed on @autotune alongside the existing reset_to_zero.

Keep the core import-light — defer the CompilationContext import so the autotuner core (Config, key, restore) stays importable and unit-testable without the compiled flydsl._mlir bindings.

Tests

Adds tests/unit/test_autotune.py — 16 GPU-free unit tests (no torch, no compiled bindings) covering Config serialization, every cache-key axis, restore_value / reset_to_zero semantics, config pruning, and disk-cache round-trip.

16 passed in 0.24s

ruff + black clean.

Series roadmap (#770)

this PR — harden key + restore_value + GPU-free unit tests
two-track config (get_default heuristic + exhaustive) + first real adopter (rmsnorm / fused-add rmsnorm)
offline emit + runtime lookup (SGLang-style, keyed by device+shape+dtype)
CI guard (unit tests wired in + opt-in GPU integration + regression gate) + docs
parallel precompile to bound first-run cost (revisits feat(jit): two-phase compilation for autotuning compile_hints #266)

Refs #770. No behavior change for existing code — the autotuner has no current callers.

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR strengthens FlyDSL’s Python autotuner (python/flydsl/autotune.py) so cached tuned configs are not incorrectly reused across differing compilation/device/env contexts, and adds a restore_value mechanism to make benchmarking correct for in-place kernels. It also introduces GPU-free unit tests to validate serialization, key construction, and restore/reset semantics without requiring torch or compiled bindings.

Changes:

Harden autotune cache keys by adding stride-pattern normalization, device arch fingerprint, toolchain fingerprint, and cache-invalidating env values.
Add restore_value snapshot/restore support and defer CompilationContext import to keep autotuner core import-light.
Add a new tests/unit/test_autotune.py suite with GPU-free coverage for key axes, restore/reset behavior, pruning, and disk cache.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
python/flydsl/autotune.py	Extends cache key axes; adds env/toolchain/device fingerprinting, stride normalization, and `restore_value` snapshot/restore; defers compiler import for testability.
tests/unit/test_autotune.py	Adds GPU-free unit tests validating config round-trip, cache-key axes, restore/reset semantics, pruning, and disk cache behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # Dtypes + normalized strides of tensor args for type/layout specialization
        dtype_parts = []
+        stride_parts = []
        for name, val in sig_args.items():
            if hasattr(val, "dtype"):
                dtype_parts.append(f"{name}:{val.dtype}")
+            if hasattr(val, "shape") and hasattr(val, "stride"):
+                stride_parts.append(f"{name}:{_normalize_strides(val)}")
        key_vals.append(tuple(dtype_parts))
+        key_vals.append(tuple(stride_parts))


+        try:
+            return self._do_bench(kernel_call, warmup=self.warmup, rep=self.rep)
+        finally:
+            # Leave the caller's tensors as the kernel would have left them on a
+            # single clean run: restore inputs, then run once more.
+            if snapshot:
+                self._restore_tensors(snapshot)


FlyDSL's autotuner exists but nothing uses it, and two gaps block real adoption. This is the first of a series making it a correct, adopted path. Cache key (_make_key) previously specialized on shape/dtype only. A config tuned under one compiler build, GPU arch, or memory layout would be silently reused under another. Fold in the axes Triton/quack rely on: - normalized stride pattern ({0,1,other}: broadcast vs contiguous vs strided) - device arch (get_rocm_arch) - toolchain fingerprint (reuses jit_function._flydsl_key) - cache-invalidating env vars (reuses _cache_invalidating_env_values) The dtype/stride axes are sorted by arg name so a call is keyed identically regardless of kwarg order (no duplicate tuning / cache files). restore_value (new) is the correctness soul of autotune: benchmarking runs the same kernel dozens of times, so an in-place / accumulating kernel (e.g. fused-add rmsnorm) corrupts its own inputs and picks a config on garbage. Snapshot the named tensors once and restore before every rep. reset_to_zero is now also re-applied on the real (non-benchmark) call — both the post-tune run and cache hits — via a shared _run_config, so an accumulate-into-zero kernel returns the single-clean-run result instead of carrying benchmark-rep state. (Was applied only inside the bench loop.) Also defer the CompilationContext import so the autotuner core stays importable and unit-testable without the compiled flydsl._mlir bindings. Adds tests/unit/test_autotune.py: 19 GPU-free tests covering Config serialization, every cache-key axis (incl. env-fingerprint change and kwarg-order insensitivity), restore_value/reset_to_zero semantics (incl. the final-run and cache-hit reset), pruning, and disk-cache round-trip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Comment-only cleanup of the PR1 additions: keep the one key fact per helper, drop the Triton/quack background, redundant restatements, and by-example prose. No logic change; 19 unit tests still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings July 1, 2026 06:39

Copilot started reviewing on behalf of jhinpan July 1, 2026 06:40 View session

jhinpan changed the title ~~autotune: harden cache key + add restore_value (#770, 1/5)~~ [1/5] autotune: harden cache key + add restore_value (#770) Jul 1, 2026

Copilot AI reviewed Jul 1, 2026

View reviewed changes

This was referenced Jul 1, 2026

[2/5] autotune: two-track config + first real adopter (rmsnorm) (#770) #785

Open

[3/5] autotune: offline config emit + runtime lookup (#770) #786

Open

jhinpan force-pushed the feat/autotune-cache-key-restore branch from 85e4161 to 6f3351f Compare July 1, 2026 07:45

jhinpan mentioned this pull request Jul 1, 2026

[4/5] autotune: CI guard + committed-config regression check (#770) #788

Open

Merge branch 'main' into feat/autotune-cache-key-restore

50a95c2

jhinpan force-pushed the feat/autotune-cache-key-restore branch 2 times, most recently from 58e9931 to f8e5bc9 Compare July 2, 2026 06:51

jhinpan force-pushed the feat/autotune-cache-key-restore branch from f8e5bc9 to bff0d76 Compare July 2, 2026 07:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[1/5] autotune: harden cache key + add restore_value (#770)#783

[1/5] autotune: harden cache key + add restore_value (#770)#783
jhinpan wants to merge 3 commits into
ROCm:mainfrom
jhinpan:feat/autotune-cache-key-restore

jhinpan commented Jul 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jhinpan commented Jul 1, 2026

What this PR does

Tests

Series roadmap (#770)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants