Skip to content

[1/5] autotune: harden cache key + add restore_value (#770)#783

Open
jhinpan wants to merge 3 commits into
ROCm:mainfrom
jhinpan:feat/autotune-cache-key-restore
Open

[1/5] autotune: harden cache key + add restore_value (#770)#783
jhinpan wants to merge 3 commits into
ROCm:mainfrom
jhinpan:feat/autotune-cache-key-restore

Conversation

@jhinpan

@jhinpan jhinpan commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

First of a planned series making FlyDSL's autotuner (python/flydsl/autotune.py) a correct, adopted tuning path per #770. Today it is well-built but unused, with two gaps that must be closed before any kernel can safely adopt it.

What this PR does

Harden the cache key (_make_key). It previously specialized on shape/dtype only, so a config tuned under one compiler build / GPU arch / memory layout would be silently reused under another. This folds in the axes Triton and quack rely on:

  • normalized stride pattern ({0, 1, other} — broadcast vs contiguous vs strided; exact numbers don't matter, the pattern does)
  • device arch (get_rocm_arch)
  • toolchain fingerprint (reuses the existing jit_function._flydsl_key() — hashes compiler source, native libs, version)
  • cache-invalidating env vars (reuses _cache_invalidating_env_values())

Add restore_value — the correctness soul of autotune. Benchmarking runs the same kernel dozens of times; an in-place / accumulating kernel (e.g. fused-add rmsnorm, where output overlaps the residual/input buffers) corrupts its own inputs across reps and picks a config on garbage. We snapshot the named tensors once and restore before every rep. Exposed on @autotune alongside the existing reset_to_zero.

Keep the core import-light — defer the CompilationContext import so the autotuner core (Config, key, restore) stays importable and unit-testable without the compiled flydsl._mlir bindings.

Tests

Adds tests/unit/test_autotune.py16 GPU-free unit tests (no torch, no compiled bindings) covering Config serialization, every cache-key axis, restore_value / reset_to_zero semantics, config pruning, and disk-cache round-trip.

16 passed in 0.24s

ruff + black clean.

Series roadmap (#770)

  1. this PR — harden key + restore_value + GPU-free unit tests
  2. two-track config (get_default heuristic + exhaustive) + first real adopter (rmsnorm / fused-add rmsnorm)
  3. offline emit + runtime lookup (SGLang-style, keyed by device+shape+dtype)
  4. CI guard (unit tests wired in + opt-in GPU integration + regression gate) + docs
  5. parallel precompile to bound first-run cost (revisits feat(jit): two-phase compilation for autotuning compile_hints #266)

Refs #770. No behavior change for existing code — the autotuner has no current callers.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings July 1, 2026 06:39
@jhinpan jhinpan changed the title autotune: harden cache key + add restore_value (#770, 1/5) [1/5] autotune: harden cache key + add restore_value (#770) Jul 1, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens FlyDSL’s Python autotuner (python/flydsl/autotune.py) so cached tuned configs are not incorrectly reused across differing compilation/device/env contexts, and adds a restore_value mechanism to make benchmarking correct for in-place kernels. It also introduces GPU-free unit tests to validate serialization, key construction, and restore/reset semantics without requiring torch or compiled bindings.

Changes:

  • Harden autotune cache keys by adding stride-pattern normalization, device arch fingerprint, toolchain fingerprint, and cache-invalidating env values.
  • Add restore_value snapshot/restore support and defer CompilationContext import to keep autotuner core import-light.
  • Add a new tests/unit/test_autotune.py suite with GPU-free coverage for key axes, restore/reset behavior, pruning, and disk cache.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
python/flydsl/autotune.py Extends cache key axes; adds env/toolchain/device fingerprinting, stride normalization, and restore_value snapshot/restore; defers compiler import for testability.
tests/unit/test_autotune.py Adds GPU-free unit tests validating config round-trip, cache-key axes, restore/reset semantics, pruning, and disk cache behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/flydsl/autotune.py Outdated
Comment on lines +240 to +249
# Dtypes + normalized strides of tensor args for type/layout specialization
dtype_parts = []
stride_parts = []
for name, val in sig_args.items():
if hasattr(val, "dtype"):
dtype_parts.append(f"{name}:{val.dtype}")
if hasattr(val, "shape") and hasattr(val, "stride"):
stride_parts.append(f"{name}:{_normalize_strides(val)}")
key_vals.append(tuple(dtype_parts))
key_vals.append(tuple(stride_parts))
Comment thread python/flydsl/autotune.py
Comment on lines +325 to +331
try:
return self._do_bench(kernel_call, warmup=self.warmup, rep=self.rep)
finally:
# Leave the caller's tensors as the kernel would have left them on a
# single clean run: restore inputs, then run once more.
if snapshot:
self._restore_tensors(snapshot)
FlyDSL's autotuner exists but nothing uses it, and two gaps block real
adoption. This is the first of a series making it a correct, adopted path.

Cache key (_make_key) previously specialized on shape/dtype only. A config
tuned under one compiler build, GPU arch, or memory layout would be silently
reused under another. Fold in the axes Triton/quack rely on:
  - normalized stride pattern ({0,1,other}: broadcast vs contiguous vs strided)
  - device arch (get_rocm_arch)
  - toolchain fingerprint (reuses jit_function._flydsl_key)
  - cache-invalidating env vars (reuses _cache_invalidating_env_values)
The dtype/stride axes are sorted by arg name so a call is keyed identically
regardless of kwarg order (no duplicate tuning / cache files).

restore_value (new) is the correctness soul of autotune: benchmarking runs
the same kernel dozens of times, so an in-place / accumulating kernel (e.g.
fused-add rmsnorm) corrupts its own inputs and picks a config on garbage.
Snapshot the named tensors once and restore before every rep.

reset_to_zero is now also re-applied on the real (non-benchmark) call — both
the post-tune run and cache hits — via a shared _run_config, so an
accumulate-into-zero kernel returns the single-clean-run result instead of
carrying benchmark-rep state. (Was applied only inside the bench loop.)

Also defer the CompilationContext import so the autotuner core stays
importable and unit-testable without the compiled flydsl._mlir bindings.

Adds tests/unit/test_autotune.py: 19 GPU-free tests covering Config
serialization, every cache-key axis (incl. env-fingerprint change and
kwarg-order insensitivity), restore_value/reset_to_zero semantics (incl. the
final-run and cache-hit reset), pruning, and disk-cache round-trip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jhinpan jhinpan force-pushed the feat/autotune-cache-key-restore branch 2 times, most recently from 58e9931 to f8e5bc9 Compare July 2, 2026 06:51
Comment-only cleanup of the PR1 additions: keep the one key fact per
helper, drop the Triton/quack background, redundant restatements, and
by-example prose. No logic change; 19 unit tests still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jhinpan jhinpan force-pushed the feat/autotune-cache-key-restore branch from f8e5bc9 to bff0d76 Compare July 2, 2026 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants